c© 2017 by Subhro Roy. All rights reserved. REASONING ABOUT QUANTITIES IN NATURAL LANGUAGE BY SUBHRO ROY DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2017 Urbana, Illinois Doctoral Committee: Professor Dan Roth, Chair Professor Gerald DeJong Associate Professor Julia Hockenmaier Associate Professor Luke Zettlemoyer, University of Washington Abstract Quantitative reasoning involves understanding the use of quantities and numeric relations in text, and reasoning with respect to them. It forms an essential part of everyday interaction. However, little work from the Natural Language Processing community has focused on quantitative reasoning. In this thesis, we investigate the challenges in performing automated quantitative reasoning over natural language text. We formulate several tasks to tackle some of the fundamental problems of quantitative reasoning, and address the problem of developing robust statistical methods for these tasks. We show that standard NLP tools are not sufficient to obtain the abstraction needed for quantitative reasoning; the standard NLP pipeline needs to be extended in various ways. We propose several technical ideas for these extensions. We first look at the problem of detecting and normalizing quantities expressed in free form text, and show that correct detection and normalization can support several simple quantitative inferences. We then focus on numeric relation extraction from sentences, and show that several natural properties of language can be leveraged to effectively extract numeric relations from a sentence. We finally investigate the problem of quantitative reasoning over multiple quantities mentioned across several sentences. We develop a decomposition strategy which allows reasoning over pairs of numbers to be combined effectively to perform global reasoning. We also look at the problem of effectively using math domain knowledge in quantitative reasoning. On this front, we first propose graph representations called “unit dependency graphs”, and show that these graph representations can be used to effectively incorporate dimensional analysis knowledge in quantitative reasoning. Next, we develop a general framework to incor- porate any declarative knowledge into quantitative reasoning. This framework is used to incorporate several mathematical concepts into textual quantitative reasoning, leading to robust reasoning systems. ii To my mom. iii Acknowledgments I consider myself fortunate to have Dan as my research advisor. He took me on as a PhD student when I was a confused theory deserter with no clue about NLP. He had the most encouraging words about my progress, even when I felt frustrated with the lack of results. He always encouraged me to strive for better results, and understand the bigger picture of the research. He has taught me to not only be a better researcher, but also to be a kind and compassionate human being. I would like to thank my committee members: Prof. Gerald DeJong, Prof. Julia Hockenmaier and Prof. Luke Zettlemoyer, from whom I got valuable feedback on my research. I would also like to thank all my internship mentors and collaborators, namely Ming-Wei Chang, Scott Yih, Hannaneh Hajishirzi, Rik Koncel-Kedziorski, Mark Hopkins, JD Chen; I have learnt a lot working with them. Special thanks to my friends Shyam and John; I will miss our long afternoon walks discussing research among other things. I would like to thank Snigdha; her help was instrumental in finishing this thesis on time. I would also like to thank the past and present members of Cogcomp, namely Daniel, Stephen, Haoruo, Vinod, Rajhans, Kai-Wei, Gourab, Nitish, Colin, Shashank (Srivastava and Gupta), Michael, Christos, Parisa, Chen- Tse, Pavan, Yangqiu and Mark. They had gifted me an ideal stimulating environment for research. Also, special thanks to my friends Arka, Arun, Anish, Sangeetha and Mainak, who provided me a lot of support in troubled times. Finally, I would like to thank my family – my mom, dad and brother. My parents left no stone unturned to provide me the best quality education, in spite of facing lots of hardships themselves to make it possible. This thesis could not have been completed without my family’s encouragement, support and love. I wish my mom was here today to see me graduate. Ma, words cannot describe how much I miss you. iv Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Math Word Problems: An Abstraction for Quantitative Reasoning . . . . . . . . . . . . . . . 2 1.3 Challenges in Automated Quantitative Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Variability of Natural Language Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Quantity Argument Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Reasoning about Multiple Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.4 Incorporating Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 What is a Structure? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Inference in Structured Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Learning Structured Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 Structural Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Discussion and Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Structured Prediction with Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 3 Early Work in Quantitative Reasoning . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Early Algebra and Arithmetic Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 STUDENT Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Other Arithmetic Word Problem Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Solvers for other Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Recent Advances in Statistical Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 4 Quantity Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Representing Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Extraction of Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.2 Mapping Text Segments into QVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.3 Extraction of Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.5 Quantity Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5.1 Reasoning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 4.5.2 Scope of QE Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.6 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6.2 Quantity Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6.3 Quantity Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.6.4 Currency Range Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 5 Equation Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 The Equation Parsing Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.1 Equation Parse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5 Predicting Equation Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.1 Predicting Quantity Trigger List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.2 Predicting Variable Trigger List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.5.3 Predicting Equation Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.4 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.2 Equation Parsing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6.3 Equation Parsing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.6.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 6 Arithmetic Word Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.3 Expression Tree and Problem Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.4 Mapping Problems to Expression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.4.1 Global Inference for Expression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.4.2 Quantity Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4.3 Relevance Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.4.4 LCA Operation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.5.2 Relevance Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.5.3 LCA Operation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.5.4 Global Inference Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 7 Unit Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 Unit Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.3 Learning to Predict UDGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.1 Vertex Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.2 Edge Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3.3 Constrained Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.4 Joint Inference With An Arithmetic Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi 7.4.1 Monotonic Expression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.4.2 Arithmetic Word Problem Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.4.3 Joint Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.4.4 Consistent Rate Unit Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5.3 UDG Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.5.4 Solving Arithmetic Word Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 8 Mapping to Declarative Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.2 Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.2.1 Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.2.2 Dimensional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.2.3 Explicit Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.2.4 Part-Whole Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.3 Mapping of Word Problems to Declarative Knowledge . . . . . . . . . . . . . . . . . . . . . . 89 8.3.1 Scoring Answer Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4 Model and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4.1 Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.5.1 Results on Existing Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.5.2 New Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.5.3 Generalization from Biased Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.5.4 Results on the New Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.3.1 Commonsense Quantitative Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.3.2 Unified Framework for Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.3.3 Automated Math Tutoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Appendix A Lexicon for Equation Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Appendix B Declarative Rules for Arithmetic Word Problems . . . . . . . . . . . . . . . . . . . . . . 112 vii List of Tables 3.1 Schemas used in WORDPRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 10-fold cross-validation results of segmentation accuracy and time required for segmentation, the columns for runtime have been normalized and expressed as ratios . . . . . . . . . . . . . 32 4.2 Results of QE; Adding Semantics(+SEM) consistently improves performance; Only 43.3% of entailing quantities can be recovered by simple string matching . . . . . . . . . . . . . . . . . 33 4.3 Micro-averaged accuracy in detecting monetary mentions . . . . . . . . . . . . . . . . . . . . 33 5.1 Input and output for Equation Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Summary of notations used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3 Statistics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.4 Performance of Quantity Trigger List Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5 Performance of Variable Trigger List Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6 Performance of Equation Tree Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7 Performance on equation parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.1 Performance of LCA Operation classifier on the datasets AI2, IL and CC. . . . . . . . . . . . 63 6.2 Performance of Relevance classifier on the datasets AI2 and IL. . . . . . . . . . . . . . . . . . 65 6.3 Accuracy in correctly solving arithmetic problems. First four rows represent various configu- rations of our system. We achieve state of the art results in both AI2 and IL datasets. . . . . 66 7.1 Units of rate quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.2 Performance of system components for predicting vertex and edge labels for unit dependency graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Performance in predicting UDGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Performance in solving arithmetic word problems . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.5 Examples of problems which UnitDep gets correct, but LCA++ does not. . . . . . . . . . . 83 8.1 Two examples of arithmetic word problems, and derivation of the answer. For each combina- tion, first a knowledge type is chosen, and then a declarative rule from that knowledge type is chosen to infer the operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.2 Accuracy in solving arithmetic word problems. All columns except the last report 5-fold cross validation results. ∗ indicates statistically significant improvement (p = 0.05) over second highest score in the column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 8.3 Pairs of pertubed problems, along with the systems which get them correct . . . . . . . . . . 96 8.4 Examples which Knowledge gets correct, but UnitDep does not. . . . . . . . . . . . . . . . 99 8.5 Examples of errors made by Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 viii List of Figures 5.1 A sentence with its trigger list and equation tree. −r indicates subtraction with order rl. . . 38 6.1 An arithmetic word problem, solution expression and the corresponding expression tree . . . 54 6.2 Two different expression trees for the numeric expression (3 × 5) + 7 − 8 − 9. The right one is monotonic, whereas the left one is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.1 An arithmetic word problem, its UDG, and a tree representation of the solution (66 − 10)/8. Several dependencies exist between the UDG and the final solution of a problem. Here, “66” and “10” are connected via Same Unit edge, hence they can be added or subtracted, “8” is connected by Den Unit to the question, indicating that some expression will be divided by “8” to get the answer’s unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.1 An example arithmetic word problem and its solution, along with the type of domain knowl- edge required to generate each operation of the solution . . . . . . . . . . . . . . . . . . . . . 85 8.2 Annotations in our dataset. Number List refers to the numbers detected in the problem. The subscripts in the solution indicate the position of the numbers in the number list. . . . . . . . 93 ix Chapter 1 Introduction 1.1 Motivation Numbers are an integral part of natural language. We use numbers extensively to communicate how hot the weather is, how well our favorite team played in the last game, etc. Newspaper articles regularly report statistics to present an objective assessment of a situation, ranging from the number of people killed in a bomb blast, to the amount of money embezzled by a politician. To understand such articles, one often needs to figure out several properties of the numbers used in them, and how these numbers interact with the rest of the story. Consider the following sentence taken from a news article in the Chicago Tribune [Chi, 2017]: Emanuel spent $13.6 million from July until the Feb. 24 election and spent an additional $6.3 million in the following five weeks. In order to understand the sentence, one has to understand the following for each numeric mention: 1. Unit: Indicates which object is being referred to. One has to understand which numbers indicate monetary amounts, and which ones indicate date and time, etc. 2. Associated Verb: Indicates the effect of the number on the world, whether it is indicating the state of an economy, or amount of money in a transaction, etc. In this case, both “13.6 million” and “6.3 million” have the verb “spend” associated with it, indicating both are expenditures made by the campaign. 3. Arguments: Indicates the entities interacting with the number. For example, for a monetary trans- action, you want to know the recipient and the donor. In the example above, knowing that both “13.6 million” and “6.3 million” are expenditures made by Emanuel, enables the reader to understand that in total, Emanuel spent 19.9 million dollars. 1 There is also a need to understand interactions of several numbers mentioned across several sentence in text. Consider the following excerpt from a recent news article [Syr, 2017]. Local sources told The New Arab that a car bomb had targeted a Free Syria Army checkpoint at the refugee camp on the Syrian border with Jordan, killing two FSA fighters and injuring several others . . .. Jordanian border police responded by opening fire on the attacking vehicle killing a child, according to the sources. A car bomb in January killed at least seven people at the same camp. Around 85,000 people live in the Rukban camp . . .. An earlier car bomb attack near the camp, in the hinterland border area on June 21 last year that resulted in the deaths of six Jordanian military personnel. In order to report the total number of deaths reported in the article, one has to analyze how the different numbers in the story interact. One has to understand that the number of people living in the camp (“85,000”) is irrelevant for calculating the number of casualties. All numbers stating the number of victims need to be identified from all sentences in the article, like “two FSA fighters”, “a child”, “at least seven people”, etc. Finally all such numbers need to be summed up. All these challenges come under the umbrella of quantitative reasoning. In spite of the difficulties involved, human beings are adept at performing quantitative reasoning to understand natural language text. However, this is not true for automated language understanding systems. Relatively little work in Natural Language Processing has analyzed the use of quantities in text. Even in areas where we have relatively mature solutions, like information retrieval, we fail to deal with quantities; for example, one cannot search the financial media for “transactions in the 1-2 million pounds range.” In this dissertation, we investigate the challenges in performing automated quantitative reasoning. We formulate several tasks that we believe to be fundamental to addressing these challenges, and address the problem of developing robust natural language understanding methods for these tasks, using statistical machine learning. 1.2 Math Word Problems: An Abstraction for Quantitative Reasoning In this thesis we suggest that the vast majority of quantitative reasoning problems that occur frequently in text can be viewed, as math wold problems that are solved by elementary and middle schools kids. Lets look at a simple math word problem which students solve in elementary school. 2 Ryan has 72 marbles and 17 blocks. If he shares the marbles among 9 friends, how many marbles does each friend get? Even in such a simple quantitative reasoning problem, there are several challenges involving multiple inter- actions among the numbers in the text. One needs to understand that the question asks for the “marbles” each “friend” gets, so the number of “blocks” should have no effect on the answer. One needs to understand the impact of the associated verb “shares”, and that division is needed to calculate the correct answer. Note that this bears clear resemblance with the challenges of quantitative reasoning discussed earlier. There are also several advantages in focusing on math word problems. The language in these problems is relatively simple, and requires understanding of a few related concepts. This helps us to focus on quantitative reasoning, without addressing overly complicated NLP issues. These problems are readily available from textbooks and tutoring websites. As a result, developing benchmark datasets is relatively easy. Several pieces of work described in this thesis, particularly in the later chapters (Chapter 6, 7 and 8), have been evaluated on the ability to automatically solve math word problems. However, the solutions are applicable for general quantitative reasoning problems. 1.3 Challenges in Automated Quantitative Reasoning Building automated systems for quantitative reasoning involves developing an information extraction system which detects key properties of a quantity, as well as developing a reasoning system which captures the interaction of several numbers in text. Developing such a system is challenging because of several reasons which we explain and illustrate in the following subsections. 1.3.1 Variability of Natural Language Text An initial step in performing quantitative reasoning is to identify quantities mentioned in natural language text, to understand what they mean and what they refer to. We address this by identifying the quantity and its unit, and representing them in a canonical form. This step is challenging because the same quantity can be expressed in several forms, a problem we refer to as variability. For example, the quantity “$13.6 million” can be written as “USD 13.6 million”, “13600000 dollars”, “13600000$”. In addition, the same amount of money can be expressed in some other currency unit (like euros). This will require some unit conversion to understand that it refers to the same amount of money. Another challenge is the effect of several modifiers 3 on quantities. Consider the difference between “more than 5 hours”, “5 hours”, and “around 5 hours”. Although all three phrases are similar in terms of lexical overlap, they represent different quantity values. In order to capture the difficulties in detecting and normalizing numbers, and their impact on quantitative reasoning, we formulate a new task called Quantity Entailment (QE). This task draws its motivation from the general Recognizing Textual Entailment (RTE) problem [Dagan et al., 2013]. Given two sentences T and H, the task of QE is to determine whether a given quantity in H can be inferred from a given text snippet T . An example is shown below. T : A bomb in a Hebrew University cafeteria killed five Americans and four Israelis. H: A bombing at Hebrew University in Jerusalem killed nine people, including five Americans. For this example, the output of QE is that the phrase “nine people” can be implied by the phrase “fives Americans and four Israelis”. We investigate this task in Chapter 4, and discuss effective solutions to quantity detection and normalization, and ultimately show its effectiveness in determining quantity entailment. 1.3.2 Quantity Argument Detection To reason with respect to numbers, it is important to extract all information related to it from the sentence. For example, if a number is indicating a transaction, one needs to know the donor and the recipient; if a number represents a relation between two entities, one needs to know what those entities are. Consider the following example: Emanuel’s campaign contributions total thrice those of his opponents put together. To understand the sentence above, its important to know that “thrice” denotes a relation between “Emanuel’s campaign contributions” and “the contributions of his opponents”. Often, not all the entities associated with a relation are explicitly mentioned in the sentence; they have to be inferred from the context. Consider the following sentence [Gun, 2015]. On an average day this year, 36 Americans were killed by guns. Here, one needs to use the understanding of “average”, and infer that the relation that is described involves the number of gun violence deaths each day of the year. 4 In order to capture these challenges, we formulate the task of equation parsing. The task takes as input a sentence representing a mathematical relation among at most two entities. The output of equation parsing is an equation which represents the relation expressed in the sentence. In addition to the equation, the output also provides a grounding for the variables in the equation, that is, it provides a mapping of the variables in the equation to text snippets indicating what the variables stand for. We discuss this problem in Chapter 5. 1.3.3 Reasoning about Multiple Quantities Reasoning over multiple quantities often require correct hierarchical composition of numbers with mathe- matical operations, as in the case of the following math word problem: Isabel picked 66 flowers for her friends wedding. She was making bouquets with 8 flowers in each one. If 10 of the flowers wilted before the wedding, how many bouquets could she still make? In order to answer this question, 66 needs to be subtracted from 10, and only then the result needs to be divided by 8. First, we have to decide the order in which the numbers need to be composed, and second, we have to decide the math operation for each combination. This kind of hierarchical composition poses a challenge for quantitative reasoning. To address this, we introduce novel decomposition strategies in Chapters 6 and 8. Our decomposition strategies allow reducing a math word problem with any number of quantities, into simple quantitative reasoning problems over at most two quantities each. We utilize this decomposition to develop a system for automatically solving math word problems like the example above, and achieve state of the art results. 1.3.4 Incorporating Domain Knowledge Domain knowledge often significantly influences the outcome of quantitative reasoning. A key reason humans do very well in quantitative reasoning is that they can easily use their understanding of the world and background knowledge to make inferences. For example, in the math word problem in subsection 1.3.3, a child can easily infer that “wilt” should result in subtraction, since no one uses wilted flowers in bouquets. Also, since the unit of “8” is “flowers per bouquet”, it should be used in a division operation. Indeed, understanding of verb interaction and rate relationships seem essential to solving the problem. All these indicate that there is a need to effectively integrate domain knowledge into any quantitative reasoning system, which will be the focus of the last part of the thesis. 5 1.4 Contributions of the Thesis In this thesis, we investigate each of the challenges mentioned above, for automated quantitative reasoning. In particular, we make the following claim: Current NLP pipeline is not sufficient to extract an abstraction of text required to perform quantitative reasoning. However quantitative reasoning over text can be facilitated by identifying quantities, units, and integrating domain knowledge about mathematical relations into machine learning methods. To perform quantitative reasoning over multiple quantities, we can decompose the problem into simpler components, where each component is a simpler quantitative reasoning problem over at most two quantities. The primary contributions of the thesis are summarized below: 1. We propose a statistical approach to quantity detection and normalization based on numbers, units and modifiers. We show that this is an effective approach to support simple quantitative reasoning. 2. We exemplify the drawbacks of current syntactic and semantic parsers to extract the abstraction need to support quantitative reasoning. 3. We exploit natural properties of equations to reduce search space, and propose a pipeline architecture to efficiently and effectively predict mathematical relations expressed in a single sentence. 4. We propose a novel decomposition procedure to support quantitative reasoning over multiple mentions of numbers. We show that this decomposition allows simple reasoning over pairs of quantities to be composed effectively, to perform quantitative reasoning over multiple numbers across several sentences. 5. We introduce a structure called “unit dependency graphs” to represent the relationships between the units of various numbers in text, and the question you need to answer using that text. We show that unit dependency graphs capture the domain knowledge about unit compatibility and dimensional analysis, and can be effectively used to incorporate such domain knowledge in quantitative reasoning systems. 6. We propose a framework to integrate any form of declarative knowledge into word problem solving. We show that this declarative knowledge based method learns better models from limited data, as well as, obtains the right abstraction even in the presence of biased data. 6 1.5 Overview of the Thesis This section provides a guide for the rest of the thesis. The thesis can be broadly categorized into four logical segments. Part I: Background The first part of this thesis surveys background work in the areas of quantitative reasoning, structured learning and inference. These chapters do not provide exhaustive surveys of these research directions and we will only go over relevant material here and include additional pointers as necessary. A reader familiar with machine learning and structured learning paradigms, can only skim these parts, and focus only on the background work in quantitative reasoning. Chapter 2 is devoted to this part. Part II: Quantity Extraction, Normalization and Equation Parsing This part of the thesis discusses work related to quantities at a mention level or sentence level. We discuss effective approaches to extract and normalize quantities into a standard form. We also show how we can leverage certain natural properties of a sentence to extract mathematical relations between entities in the sentence. The part constitutes two chapters. • Chapter 4 discusses our approach to quantity extraction and normalization, and its application to textual entailment problems. • Chapter 5 describes methods to analyze mathematical relations described within a single sentence, as well as, extract the relevant entities for the relation. Part III: Solving Arithmetic Word Problems This part focuses on quantitative reasoning problems which span across multiple sentences. We found that math word problems form a natural abstraction to the challenges we usually encounter in across sentence quantitative reasoning. As a result, this part will focus on building robust solvers for math word problems. We however do not look at the entire range of word problems, which might require a wide range of background knowledge. We restrict our focus to arithmetic word problems; these are story problems which students solve in elementary school. The answers to these problems can be usually found by combining the numbers in the problem text by simple basic operations. This part constitutes three chapters: 7 • Chapter 6 describes a decomposition strategy which is key in applying quantitative reasoning over multiple numbers mentioned across several sentences. We show how this decomposition can be used to develop an end to end arithmetic word problem solver. • Chapter 7 introduces a structure called “unit dependency graphs” to capture the relations among the units of different numbers mentioned in text. We show how these graphs can be used to incorporate do- main knowledge about rate relationships, unit compatibility and dimensional analysis to word problem solving and to general quantitative reasoning. • Chapter 8 describes a framework to integrate any form of declarative knowledge into a word problem solver. We use the framework to integrate four common forms of domain knowledge for arithmetic word problems. We show that the declarative knowledge helps our system learn more effectively in the presence of limited data. Part IV: Conclusion The final part of the thesis offers concluding remarks. It summarizes the contributions of this dissertation and identifies directions for future research that can be built on top of this work. This can be found in Chapter 9. 8 Chapter 2 Background Statistical machine learning approaches are widely used in many areas of natural language of processing, in particular, in predicting semantic annotations and linguistic structures. In this chapter, we cover a few basic machine learning concepts which will be used in the rest of the thesis. We mainly focus on structured prediction, since this is the paradigm most widely used in this thesis. We introduce the problem of structured prediction via the example of part of speech tagging, and outline common approaches to solve this problem. 2.1 What is a Structure? The concept of a structure is an important one in computational linguistics and machine learning. Burton- Roberts (1997) informally defines a structure as a complex object that is divisible into component parts, each of which can belong to different categories, have specific functions in the complete object and are arranged in a specifiable way. We will use the task of part-of-speech tagging as a running example in this section. We will use the following sentence as an illustrative example: I have twenty dollars in my pocket . The goal of part of speech tagging is to label each word in the sentence by one of 45 part-of-speech tags. Any valid output for the sentence above is a set of 8 part-of-speech tags, one for each word in the sentence. This set of 8 part-of-speech tags can be referred as a structure. 2.2 Structured Prediction Let us now formulate the part of speech tagging problem as a structured prediction problem. We will denote the input sentence to the problem as x, and the output structure is denoted as yvec. In this case, the output structure y can be decomposed into a set of m variables y1,y2, . . . ,yk, where m is the number of words in the 9 input sentence x. Each of the yi variables denotes the part of speech tag for the i-th word of input sentence x. Therefore, each yi can be assigned one of the 45 part-of-speech tags. Finally we define X as the space of all possible inputs to a problem. Here, it will denote the space of all sentences. We also define Y as the space of all possible outputs. In our example, this will denote the set of all sequences of part of speech tags. The structured prediction problem is to learn a mapping function f(x; w) from an input space X to an output space Y. This mapping function is parameterized with weights w of task relevant features φ(x, y). Note that the feature function is defined over both the input x and the entire output structure y. This definition of a feature function allows us to design features to express the dependency between the local output variables for instances. In general, designing feature templates requires domain knowledge. For example, in our running example, it is common to use words in x conjuncted with the corresponding part of speech tag in y as features. We use the weight vector w and the feature function φ(x, y), to define a scoring function for output structures. Often it takes the following linear form: S(y; w, x) = wT φ(x, y) We want this scoring function to assign a higher score to the correct output structure, and assign a lower score to all other structures. 2.3 Inference in Structured Models Suppose we have a weight vector w, such that scoring function S(y; w,x) scores the correct output structure higher than all other structures. How will we use such a scoring function to predict an output structure for an input x ? This involves searching for the best structure according to the scoring function; this is referred to as the inference problem in structured prediction. Mathematically, this seen as computing the following. y∗ = arg max y∈Y wT φ(x, y) where Y is the space of feasible output structures. Note that the part of Y you need to search over, is usually very large. Consider the case of our part of speech tagging example. For an input sentence of m words, the number of possible tag sequence is (45)m, since each word can have one of the 45 tags. As a result, naively enumerating all output sequences and checking their score, is not a feasible solution. 10 There are three options to solve the inference problem in a reasonably efficient way. First, you can transform the problem to an Integer Linear Program (ILP), and then use an off the shelf ILP solver to solve the problem. This can still become intractable if the ILP contains a lot of variables. Another option is to make independence assumptions between different parts of the output structure. An example of this in our tagging task can be to assume that the part of speech of the i th word is dependent only on the i th word and the part of speech tag of the (i− 1) th word. These assumptions are often valid for the problem, and lead to efficient inference algorithms. However, for the problems we will address in this thesis, independence assumptions usually do not hold. Hence, we opt for the third option, which is to perform approximate inference using beam search. The main idea behind beam search is to enumerate all possible assignments to the first output variable, and score them. Keep only the top k assignments and discard the rest. Next for each assignment of the first variable, enumerate all possible assignments for the second output variable, and again, keep the k highest scoring partial assignments, and discard the rest, and so on. Here k is often referred to as the beam size. Lets see how we can perform beam search for the part of speech tagging example. We will assign y1 (the output variable for the first word) all the 45 tags, and score each assignment. Scores of these partial assignments can be computed by using only the features that are dependent on y1. We keep the top k, and move on to y2, and so on. Although beam search is an approximate procedure, in practice, it is very effective. Also, the beam size is a parameter, which can be chosen to be a small value if the output space for the problem is very large. 2.4 Learning Structured Prediction Models Finally, we discuss how to obtain a good scoring function for output structures. For this problem, assume we have access to a set of n labeled training examples D = {(xi, yi)}ni=1. The learning problem is to find a weight vector w, such that it scores the target output structure higher than other structure. This means that you want the value of wT φ(xi, yi) to be higher than wT φ(xi, y), where y is any other feasible output structure. There are several learning algorithms which can be used to obtain such a weight vector. In this thesis, we mostly use structural Support Vector Machines; we introduce them in the following subsection. 11 2.4.1 Structural Support Vector Machines Structured SVMs are a generalization of the binary SVM algorithm [Cortes and Vapnik, 1995] to structured outputs [Tsochantaridis et al., 2005]. The goal of the learning is to solve the following optimization problem: min w 1 2 wT w +C ∑ i L(xi, yi, w) where L(xi, yi, w) is the loss function for the structured prediction. The loss function is written as L(xi, yi, w) = l(max y∈Y ∆(yi, y) −wT φ(xi, yi) + wT φ(xi, y)) Here ∆ signifies the quality of the structure y with respect to the annotated structure yi. The function l(·) can be instantiated by loss functions like the hinge loss, with l(x) = max(0,x). Different algorithms have been proposed in the literature to solve the optimization problem [Joachims et al., 2009, Shalev-Shwartz et al., 2007, Chang et al., 2010]. In this thesis, we use the Illinois-SL package [Chang et al., 2015] with the dual-coordinate descent based algorithms. 2.5 Discussion and Special Case For learning and inference with structured models, we need to specify the following: 1. The feature representation for a structure y for an input x, denoted as φ(x, y). 2. An inference algorithm that finds the best structure for an input x using a weight vector w. 3. A learning algorithm. Note that both binary and multiclass classification problems are special cases of structured prediction. As a result, a structured predictor can be used for binary and multiclass classification. Here we describe how can we convert a multiclass classification problem to a structured problem. Lets assume a multiclass classification problem of K classes. Usually feature functions for multiclass problems are defined only on the input x; lets denote it as φ(x). To convert this to a structured problem, we consider the output space Y to be the set of class labels. We also define the feature function φ(x, y) to be the set of features φ(x) conjoined with the label y. With these definitions, we can now use the standard structured learning algorithm to do multiclass classification. In this thesis, we follow this procedure to perform multiclass classification with Illinois-SL. 12 2.6 Structured Prediction with Latent Variables Until now, our structured prediction framework has only two types of variables – input variables x and output variables y. Both these variables are observed, that is, they are available as part of our training data. However, often times, important modeling information is not provided as part of the training data. For example, for the task of machine translation from English to French, training data comprises English sentences paired with corresponding translated French sentences. There are several intermediate variables which allow the translation of an English sentence to French, like the ones indicating word level or phrase level alignments. However these alignment variables are not provided as part of the training data. Even though these variables are not provided, it is important to model them, in order to capture the process of translation. In such cases, they are modeled as latent variables h ∈ H, where H is the set of feasible assignments to the hidden variables. In order to extend the structured prediction formulation to the case of hidden variables, we redefine the feature function to also use the hidden variables (written as φ(x, h, y)). The inference problem is now to compute (h∗, y∗) = arg max (h,y)∈H×Y wT φ(x, h, y) This part remains similar to the earlier case; we simply have to think of the output structure as a tuple comprising both the output and hidden variables. For learning, we again assume access to n labeled examples D = {(xi, yi)}ni=1. The learning problem is to find a weight vector w, such that it scores the target output structure higher than other structures. The score of a structure y is now given as follows: max h∈H wT φ(x, h, y) In this thesis, we mostly use structural SVM with latent variables [Yu and Joachims, 2009]. This involves solving the following optimization problem: min w [ 1 2 wT w +C ∑ i max (ĥ,ŷ)∈H×Y [wT φ(xi, ĥ, ŷ) + ∆(yi, ĥ, ŷ)] ] − [ C ∑ i max h∈H wT φ(xi, h, yi) ] where ∆(·) is again a loss function indicating the dissimilarity between ŷ and yi. See [Yu and Joachims, 2009] for details of the algorithm to solve the above optimization. 13 Chapter 3 Early Work in Quantitative Reasoning Work on quantitative reasoning problems in NLP dates back to the 1960’s. Over the years, several efforts have been made to automatically solve different types of quantitative problems described in natural language text. Most of these systems were essentially rule based, and imposed strong restrictions on the input text. Only recently, statistical methods have been used in this context, and the focus has been on achieving robustness in handling a wide range of natural language text. In this chapter, we give an overview of these early systems, starting from the 1960’s to the present day. 3.1 Early Algebra and Arithmetic Solvers Several efforts were made to automatically solve arithmetic and algebra word problems. We describe a few of them in the following subsections. 3.1.1 STUDENT Program The STUDENT program written by [Bobrow, 1964] is the first system that tried to solve algebraic problems stated in natural language. It accepts inputs in a restricted set of English language and tries to solve the algebraic problem expressed in it. An example input to the STUDENT system is If the number of customer Tom gets is twice the square of 20% of the number of advertisement he runs, and the number of advertisements he runs is 45, what is the number of customer Tom gets ? The system parses individual sentences to extract mathematical relations among variables. They identify variables by several keywords, and consider the words around these keywords as identifiers for the variables. For example, “the number of customer Tom gets” is a variable where the key word is “number”. They cannot handle coreference, hence multiple usage of the same variable will require the exact same phrase to 14 be mentioned. In the above example, both the variable phrases “the number of customer Tom gets” and “the number of advertisements he runs” are repeated verbatim when the same variable is used again. Applying a set of rules recursively, complex sentence constructs are simplified into simpler forms free of conjunctions. Certain substitutions (e.g. “2 times” for “twice”), and truncations (e.g. “square of” to “square”) etc. are done until only the phrases representing variables remain, and are related with basic operators (like plus, times, percent etc.) to form a set of equations representing the problem. After such transformation the above problem becomes: (EQUAL X00001 (NUMBER OF CUSTOMER TOM GETS)) (EQUAL (NUMBER OF ADVERTISEMENTS HE RUNS) 45) (EQUAL (NUMBER OF CUSTOMERS TOM GETS) (TIMES 2 (EXPT (TIMES 0.2 (NUMBER OF ADVERTISEMENTS HE RUNS)) 2))) Finally standard techniques for solving system of equations are used to get the answer. 3.1.2 Other Arithmetic Word Problem Solvers Another system called WORDPRO was developed by [Fletcher, 1985] for solving arithmetic word prob- lems. WORDPRO was the first to introduce the concept of set “schema” [Marshall, 1995] as the basis of construction of problem model. Schemas can be thought as categories for the arithmetic word problem; each schema also has several attributes associated with it. For example, a “change” schema represents relation amongst a start-set (“Joe had 3 marbles”), transfer-set (“Then Tom gave him 5 marbles”) and result-set (“How many marbles does Joe have now?”). Table 3.1 lists the instantiations of three “schemas” used in WORDPRO. They also used a categorization of these schemas based on which component of the schema is the question asking about. The program solves a problem guided by list of rules which are applied sequen- tially. A drawback of WORDPRO was that it did not accept natural language text as input, it assumes that the input problem has already been mapped to a set of propositions. Two other programs, namely CHIPS [Diane J. Briars, 1984] and ARITHPRO [Dellarosa, 1986] could also solve one-step arithmetic word problems with only one possible operation – addition/subtraction. Both CHIPS and ARITHPRO categorize simple word problems into the same three categories: compare, combine and change. But for both these models rigid limitations were imposed on the change verb (only “to give”) and on the order of the problem sentences (the first sentence of the problem must describe the number of objects the owner had to begin with, whereas the second sentence must contain the verb “gave”). 15 Schema Instantiations Change John has 3 apples. Then he gave 2 apples to Susan. How many apples does Fred have now ? Combine Adam has 3 apples. Sarah has 6 apples. How many apples do they have altogether ? Compare Dennis has 6 apples. Fred has 5 apples. How many apples does Fred have less than Dennis ? Table 3.1: Schemas used in WORDPRO [Bakman, 2007] developed a more advanced arithmetic problem solver called ROBUST, that could un- derstand multi-step arithmetic word problems and that too with extraneous information. An example for a problem handled by ROBUST is as follows: Ruth had 5 nuts more than Dan had. Ruth gave Dan 3 nuts. Dan gave 2 nuts to David. Now Dan has 4 nuts and David has 6 nuts. How many nuts does Ruth have now ? The concept of “schema” as used in the earlier works is adopted with further expansion of “change” schemas into six distinct categories as against only two used earlier. This fine grained categorization for change helped them to detect irrelevant numbers in the problem. The ROBUST simulation works by first parsing the problem text to split all sentences into propositions or simple sentences like “Ruth had 5 nuts more than Dan had”, “Ruth has ? nuts”, etc. for the above problem. The propositions having complex change verbs like “give” are further split into elementary ones (like “Ruth gave Dan 3 nuts” is split into “Ruth forfeited 3 nuts” and “Dan got 3 nuts”). Then the change schema relations are applied to the elementary propositions by substituting the constant values for the corresponding variables. Although it handles irrelevant numbers, ROBUST still restricts the operations to be addition and subtraction. A detailed discussion of these systems can be found in [Mukherjee and Garain, 2008], and indeed a lot of the examples were taken from that survey. All the systems described above have different restrictions in the types of problems they handle. Also, they do not perform evaluation on any benchmark datasets; some of the systems are also not publicly available. As a result, it is impossible to compare these systems. 16 3.2 Solvers for other Domains There were several systems developed also for related fields like Geometry, Physics, Chemistry and Calculus problems. Since they are less relevant for this dissertation, we only give a brief overview of these systems. • The program, CARPS i.e. Calculus Rate Problem Solver [Charniak, 1968] written by Eugene Charniak reads, solves calculus rate problems stated in English. • The system of HAPPINESS was developed by [Gelb, 1971] to solve simple probability questions. • The NEWTON program written by [de Kleer, 1977] is an expert problem solver in the domain of mechanics, specifically, relating to kinematics of objects moving on surface. • The program MECHO developed in Prolog by [Bundy et al., 1979] solves mechanics problems stated in English in the areas of pulley problems, statics problems, motion on smooth complex paths and motion under constant acceleration. • ISAAC program written by [Novak, 1976] can read, understand, solve and draw pictures of physics problems stated in English. • ALBERT developed by [Oberem, 1987] is an intelligent tutoring (CAI) system that not only under- stands and solves physics (more specifically kinematics) problems but can also teach a student how to solve them. 3.3 Recent Advances in Statistical Solvers All the methods discussed earlier in this chapter use some form of rule based systems to arrive at the output equation or answer. However, hand-engineered rules have an obvious drawback – they do not provide robustness to the variability of input text. Indeed, most of the early systems constrained their input to be of a few specific forms, which allowed them to develop transformation rules for problem solving. Recently, there has been a renewed interest in solving quantitative reasoning problems. It started in 2014 with the template based system of Nate Kushman [Kushman et al., 2014], which focused on solving algebra word problems. This was followed by the Aries system of Allen Institute for Artificial Intelligence (AI2), which focused on solving addition subtraction problems. All recent methods (including the work done in this thesis) use some form of machine learning to achieve robustness, something which earlier systems lacked. However, it should be noted that some of the challenges that were identified and ideas introduced by the 17 early works are still relevant for building robust statistical methods. Many of these ideas are used alongside statistical methods in this thesis. 18 Chapter 4 Quantity Entailment 4.1 Introduction In this chapter, we investigate the problem of reasoning with respect to quantities using only the local context. Consider the textual inference in Example 4.1, which we present as a Textual Entailment query [Dagan et al., 2013]. Example 4.1 T:A bomb in a Hebrew University cafeteria killed five Americans and four Israelis. H:A bombing at Hebrew University in Jerusalem killed nine people, including five Americans. To determine whether H is implied or entailed by T, we need to identify the quantities and the units from “five Americans” and “four Israelis”, as well as use the fact that “Americans” and “Israelis” are “people”. All these inferences can be made by processing local context of the quantities. In this chapter, we describe some key steps necessary to facilitate reasoning about quantities with local context. We first describe a system developed to recognize quantities in free form text, infer units associated with them and convert them to a standardized form. For example, in Example 4.2 About six and a half hours later , Mr. Armstrong opened the landing craft’s hatch. we would like to extract the number 6.5, the corresponding unit, “hour”, and also determine that the quantity describes an approximate figure, not an exact one. One of the difficulties is that any noun or noun phrase can be a unit, and inferring them requires analyzing contextual cues and local sentence structure. As we show, in some cases deeper NLP techniques are required to support that. We then develop a reasoning framework for quantities that we believe can play an important role in 19 general purpose textual inference. As an evaluation metric for local context based reasoning, we isolate the quantity reasoning component of the RTE task, and formulate Quantity Entailment (QE) – the task of determining whether a given quantity can be inferred from a given text snippet. We then describe our approach towards solving it. As an additional evaluation, we also show the effectiveness of our system on an application of QE, a search for ranges of currency values. Given a query range, say from 1 million USD to 3 million USD, we want to find all mentions of money with values in this range. Using standard search engine technology to query all values in the range, in the various forms they could be expressed, is not feasible. Instead, we use our proposed approach to extract monetary mentions from text and normalize them, and then we use QE to verify them against the query. We develop and annotate datasets1 for evaluation, and show that our approach can handle the aforemen- tioned reasoning tasks quite well. The next section presents some related work on quantities and reasoning. We then formally define a quantity and describe our knowledge representation. The following sections describe quantities extraction and standardization. We next present the formulation of Quantity Entailment, and describe our reasoning framework for it. 4.2 Related Work Quantities in Textual Entailment Quantities have been recognized as an important part of a textual entailment system [de Marneffe et al., 2008, Maccartney and Manning, 2008, Sammons et al., 2010], and [de Marneffe et al., 2008] claims that discrepancies in numbers are a common source of contradictions in natural language text. The authors describe a corpus of real-life contradictory pairs from multiple sources such as Wikipedia and Google News in which they found that 29% of the contradictions were due to numeric discrepancies. In addition, they analyzed several Textual Entailment datasets [Dagan et al., 2006] and found that numeric contradictions constitute 8.8% of contradictory entailment pairs. Monotonicity in Reasoning The reasoning framework described in this chapter borrows ideas of mono- tonicity.depends quantities often depends on reasoning about monotonicity. The role of monotonicity in NL reasoning has been described in [Barwise and Cooper, 1981]. The authors categorize noun phrases as upward 1The datasets are available for download at http://cogcomp.cs.illinois.edu/page/resource view/95. The related software is available at http://cogcomp.cs.illinois.edu/page/software view/Quantifier. 20 or downward monotonic, and also detect constructs where monotonicity depends on context. The large role of monotonicity in reasoning motivated attempts to reason directly at the surface level [Purdy, 1991], rather than converting first to logical forms. Our approach advocates this direction too. Representation of Quantities [Kuehne, 2004a] investigates the various cases in which physical quan- tities are represented in descriptions of physical processes. Later, in [Kuehne, 2004b], a system to extract Qualitative Process Theory [Forbus, 1984] representations is implemented for a controlled subset of the English language. While these approaches do not scale to unrestricted English, they have influenced the quantity representation that we use. 4.3 Representing Quantities In general, quantity refers to anything which is measurable. Our quantities representation is influenced by the one proposed in [Forbus, 1984] but we propose a simpler version of their Qualitative Process theory: Definition (Quantity-Value Representation) In Quantity-Value Representation (QVR), a quantity is represented as a triple (v,u,c), where constituents in the triple correspond, respectively, to: 1. Value: a numeric value, range, or set of values which measure the aspect, e.g. more than 500, one or two, thousands, March 18, 1986. The value can also be described via symbolic value (e.g., “below the freezing point”). We do not store surface forms explicitly, but convert them to a set or range. For example, “more than 500” is stored as the range (500, +∞). Details of these conversions are given in Section 4.4.2. 2. Units: a noun phrase that describes what the value is associated with. e.g., inches, minutes, bananas. The phrase “US soldiers” in the phrase “Five US soldiers” is a unit. 3. Change: specifies how the parameter is changing, e.g., increasing. This constituent often serves as an indication of whether or not the value is relative to another. For example, “She will receive an [additional 50 cents per hour]”, “The stock [increased 10 percent]”, “Jim has [5 balls more] than Tim”. 4.4 Extraction of Quantities In this section we describe the first component of our approach, that of identifying quantities and units in text and standardizing their representation. We use a a two step approach to extract quantities from free 21 form text. 1. Segmentation This step takes raw text and finds segments of contiguous text which describe quan- tities. 2. Standardization Using the phrases extracted in the previous step, we derive the QVR. An overview of our method is given in Algorithm 1. Algorithm 1 QuantityExtraction( T ) Input: Text T Output: Set of Quantity-value triples extracted from T 1: Q ←∅ 2: S ← Segmentation( T ) 3: for all segment s ∈ S do 4: q ← Standardization( s ) 5: if unit of q not inferred then 6: q ← InferUnitFromSemantics( q, s, T ) 7: end if 8: Q ← Q∪{q} 9: end for 10: return Q We model the segmentation step as a sequence segmentation task because quantities often appear as segments of contiguous text. We adapt and compare two approaches that were found successful in previous sequential segmentation work in NLP: 1. A Semi-CRF model [Sarawagi and Cohen, 2004], trained using a structured Perceptron algorithm [Collins, 2002], with Parameter Averaging [Freund and Schapire, 1998]. 2. A bank of classifiers approach [Punyakanok and Roth, 2001] that we retrain with a new set of features. The same feature set was used for both approaches. Despite the additional expressive power of CRFs, we found that the bank of classifiers (which is followed by a simple and tractable inference step) performs better for our task, and also requires significantly less computation time. 22 4.4.1 Features For each token xi in the input sequence we extract the following features: 1. Word class features: xi appears in a list of known scientific units (e.g., meters, Fahrenheit), written numbers (e.g., two, fifteen), names of a months, day of the week, miscellaneous temporal words (e.g. today, tomorrow), currency units, etc. 2. Character-based: xi contains a digit, is all digits, has a suffix (st,nd,rd,th). 3. Part of speech tags: we use the Illinois POS Tagger [Roth and Zelenko, 1998]. 4. Most of the features were generated from a window of [−3, 3] around the current word. Additional features were generated from these by conjoining them with offset values from the current word. 4.4.2 Mapping Text Segments into QVR We develop a rule-based standardization step, that is informed, as needed, by deeper NL processing, including semantic role labeling (SRL, [Palmer et al., 2010]) and Co-reference resolution. Some key steps of this procedure are as follows: 1. Convert written numbers to floating point: e.g., three thousand five hundred twenty → 3520.0 2. Convert dates to an internal date type: e.g., March 18th → Date(03/18/XXXX) 3. Replace known names for ranges: e.g., teenage → [13, 19] years-old. 4. Convert all scientific units to a standard base unit: e.g., 1 mile → 1609.344 meters. 5. Replace non-scientific units with WordNet synsets 6. Rewrite known units to a standard unit: e.g., USD, US$, dollars → US$. 7. Standardize changing quantity: e.g., “additional 10 books” → +10 [book]. 8. Extract bounds: we use a list of phrases, such as “more than”, “less than”, “roughly”, “nearly”. By default, if a bound keyword is not present we assume the bound is “=”. 9. Modify value using bounds : We convert values which have a bound to a range of values. 23 Scalar implicature is taken into consideration here. Consider the sentence “John bought 10 books.”, although it can be interpreted that buying 5 books is a corollary of buying 10, in this case, we make the assumption that 5 books were not purchased. See section 4.5.2 for a discussion on the subject. We use the following rules, where v is the value extracted before using bound information. • ≤ v → (−∞,v], similarly for ≥,<,>. • = v →{v} • ≈ v → [v − c.v,v + c.v], we use c = 0.2. 4.4.3 Extraction of Units In most cases, the units related to the numeric values appear adjacent to them. For example, in the sentence “There are two books on the table”, the unit “book” follows “two”. The sequence segmentation groups these words together, from which it is easy to extract the unit. However, in some cases, a better understanding of the text is needed to infer the units. Consider the following example: Example 4.3 A report from UNAIDS, the Joint United Nations Program on HIV/AIDS, released on Tuesday, shows the number of adults and children with HIV/AIDS reached 39.4 million in 2004. Here, we need to know that “39.4 million” refers to “the number of adults and children with HIV/AIDS”. Also, in: Example 4.4 The number of member nations was 80 in 2000, and then it increased to 95. we need to know that the pronoun “it” refers to “the number of member nations”. We employ a sequential process in our standardization. In case the first step described above fails to extract units, we make use of deeper processing of the sentence to accomplish that (see an evaluation of the contribution of this in the experimental section). These steps are denoted by the function InferUnitFrom- Semantics() in Algorithm 1. We apply coreference resolution to identify pronoun referents and then apply a Semantic Role Labeler, to recognize which terms are associated with the quantity, and can be potential units. In the case of example 4.3, the SRL tells us that for the verb “reached”, the associated subject is “the 24 number of adults and children with HIV/AIDS” and the object is the mention “39.4 million”. Hence, we conclude that the subject can be a candidate for the unit of “39.4 million”. For the purpose of entailment, we keep the entire set of possible word chunks, which are linked by the SRL to our quantity mention, as candidate units. Since most units are found in positions adjacent to the numeric mention, we optimize on runtime by ap- plying the SRL and coreference resolver only when the segmented chunk does not have adequate information to infer the unit. We use the Illinois Coreference Resolver ([Bengtson and Roth, 2008], [Kai-Wei Chang and Roth, 2013]) and the Illinois SRL [Punyakanok et al., 2008], for coreference and seman- tic role labelling, respectively. 4.5 Quantity Entailment In this section we describe our approach to quantitative reasoning using local context of quantities. We first formulate the task of Quantity Entailment, and then describe our reasoning framework. Definition (Quantity Entailment) Given a text passage T and a Quantity-Value triple h(ch,vh,uh), Quantity Entailment is a 3-way decision problem: 1. entails: there exists a quantity in T which entails h. 2. contradicts: no quantity in T entails h, but there is a quantity in T which contradicts h. 3. no relation: there exists no quantity in T, which is comparable with h. Since the decision is about a single quantity-value triple, we believe this will test the capability to capture local contextual cues of the quantity. This task can also be seen as a building block for a general textual entailment system. The need to identify sub-problems of textual inference, in the context of the RTE task, has been motivated by [Sammons et al., 2010]. Quantity Entailment can be considered as one such step. Since we envision that our QE module will be one module in an RTE system, we expect that the RTE system will provide it with some control information. For example, it is often important to know whether the quantity is mentioned in an upward or downward monotonic context. Since we are evaluating our QE approach in isolation, we will always assume upward monotonicity, which is a lot more common. Monotonicity has been modeled with some success in entailment systems [Maccartney and Manning, 2008], thus providing a clear and intuitive framework for incorporating an inference resource like the Quantity Entailment module into a full textual entailment system. 25 4.5.1 Reasoning Framework Our Quantity Entailment process has two phases: Extraction and Reasoning. In the Extraction Phase, we take a text passage T and extract Quantity-Value triples (value, units, change) from it. In the Reasoning phase, we apply a lightweight logical inference procedure to the triples extracted from T to check if h can be derived. There are two types of rules applied in the Reasoning phase: Implicit Quantity Productions and Quantity Comparisons. The combination of these rules provides good coverage for the QE task. Quantity Comparison Quantity Comparison compares a quantity t : (vt,ut,ct) extracted from T and the quantity h : (vh,uh,ch) and decides whether h can be derived via some truth preserving transformation of t. There are three possibilities: (t entails h), (t contradicts h), or (t has no relation with h). The overview is given in Alg. 2, which is designed under our assumption that entailing quantities should respect upward monotonicity. This requires monotonicity verification of both units and values. In order for a quantity to contradict or entail another, their units must be comparable. Determining the comparability of scientific units is direct since they form a closed set. Comparing non-scientific units is more involved. The inference rule used here is as follows: if the syntactic heads of the unit phrases match (i.e., there is an Is-A or synonymy relation in either direction), then the phrases are comparable. These comparisons are encoded as a function comparableUnits(ut, uh), which returns true if the units ut and uh are comparable, or else returns false. If the units are comparable, the direction of monotonicity (i.e., the direction of the Is-A relation between the heads and the effects of any relevant modifiers) is verified. The function checkMonotonicityOfU- nits(ut, uh) returns true, if ut is more specific than uh, false otherwise. To compute the Is-A and synonymy relations we use WordNet [Miller et al., 1990], an ontology of words which contains these relations. We also augment WordNet with two lists from Wikipedia (specifically, lists of Nationalities and Jobs). Next, we check whether the values of the quantities compared obey the monotonicity assumption; we say that vt is more specific than vh if vt is a subset of vh. (Note that vt and vh are both represented as sets and hence, checking subset relation is straightforward.) For example, “more than 50” ⊆ “at least 10”. This rule also applies to dates, e.g. “03/18/1986” ⊆ “March 1986”. Respecting scalar implicature, we assume that “5” is subset of “less than 10”, but not “10”. Similar to the case of units, we use the function 26 checkMonotonicityOfValues(vt, vh) which returns true, if vt is more specific than vh, and false otherwise. A quantity which represents some form of change of a quantity cannot be derived from a quantity which does not represent change and vice versa. We set ct = true if t denotes change in a quantity, otherwise we set ct = false. Algorithm 2 QuantityComparison( t, h ) Input: Quantity-value triples t(vt,ut,ct) and h(vh,uh,ch) Output: Returns whether t entails, contradicts or has no relation with h 1: if ct 6= ch then 2: return no relation 3: end if 4: if comparableUnits( ut, uh )= false then 5: return no relation 6: end if 7: if checkMonotonicityOfUnits( ut, uh )= true then 8: if checkMonotonicityOfValues( vt, vh )= true then 9: return entails 10: end if 11: end if 12: return contradicts Implicit Quantity Production Rules There are many relationships among quantities which can be the source of implicit information. The following is an incomplete, but relatively broad coverage list of common patterns: 1. Range may imply duration, e.g., “John lived in Miami from 1980 to 2000” implies that John lived in Miami for a duration of 20 years. 2. Compatible terms may be combined and abstracted. The sentence “I bought 3 bananas, 2 oranges, and 1 apple” implies that 6 fruits were purchased. 3. Ratios can imply percentages. The sentence “9 out of the 10 dentists interviewed recommend brushing your teeth” implies that 90% of the dentists interviewed recommend brushing. 27 4. Composition: Quantities and units may sometimes be composed. Consider the following examples, the phrase “six Korean couples” means that there are 12 people; the phrase “John gave six 30-minute speeches” implies that John spoke for 180 minutes. The rules used for producing implicit quantities employed in our system are the following: • (a ratio b) if a is a percentage, then multiply its value with the value of b to obtain a new quantity with the units of b. • (a ratio b) if a is not percentage, divide its value with the value of b to obtain a new quantity with the units of b. • (a range b) take the difference of the two values to obtain a new quantity with the appropriate change of units, e.g., time-stamp minus time-stamp results in units of time. Algorithm 3 QuantityEntailment( T, h ) Input: Text T and a quantity-value triples h(vh,uh,ch) Output: Returns whether T entails, contradicts or has no relation with h 1: Q ← QuantityExtraction( T ) 2: Q′ ← GenerateImplicitQuantities( Q ) 3: Q ← Q∪Q′ 4: contradict ← false 5: for all quantity-value triple q ∈ Q do 6: if QuantityComparison( q, h )= entails then 7: return entails 8: end if 9: if QuantityComparison( q, h )= contradicts then 10: contradict← true 11: end if 12: end for 13: if contradict= true then 14: return contradicts 15: else 16: return no relation 17: end if 28 Lightweight Logical Inference The QE inference procedure simply applies each of the implicit quantity production rules to the Quantity- Value triples extracted from the passage T, until no more quantities are produced. Then it compares each quantity t extracted from T with the quantity h, according to the quantity comparison rules described in Algorithm 2. If any quantity in T entails h, then “entails” is reported; if there is no quantity in T which can explain h, but there exists one which contradicts h, then “contradiction” is reported; otherwise “no relation” is reported. The complete approach to Quantity Entailment is given in Algorithm 3. 4.5.2 Scope of QE Inference Our current QE procedure is limited in several ways. In all cases, we attribute these limitations to subtle and deeper language understanding, which we delegate to the application module that will use our QE procedure as a subroutine. Consider the following examples: T : Adam has exactly 100 dollars in the bank. H1 : Adam has 50 dollars in the bank. H2 : Adam’s bank balance is 50 dollars. Here, T implies H1 but not H2. However for both H1 and H2, QE will infer that “50 dollars” is a contradiction to sentence T, since it cannot make the subtle distinction required here. T : Ten students passed the exam, but six students failed it. H : At least eight students failed the exam. Here again, QE will only output that T implies “At least eight students”, despite the second part of T. QE reasons about the quantities, and there needs to be an application specific module that understands which quantity is related to the predicate “failed”. There also exists limitations regarding inferences with respect to events that could occur over a period of time. In “It was raining from 5 pm to 7 pm” one needs to infer that “It was raining at 6 pm” although “6 pm” is more specific than “5 pm to 7 pm”. There is a need to understand the role of associated verbs and entities, and the monotonicity of the passages to infer the global entailment decision. Many of these challenges will be handled in the later chapters of the thesis. 29 4.6 Experimental Study In this section, we seek to validate our proposed modeling. We evaluate our system’s performance on four tasks: Quantity Segmentation, Quantity Entailment and Currency Range Search. We do not directly evaluate our system’s ability to map raw text segments into our representation, but instead evaluate this capability extrinsically, in the context of the aforementioned tasks, since good Standardization is necessary to perform quantitative inference. 4.6.1 Datasets QE: Due to lack of related work, an adequately annotated corpus does not exist. Thus, in order to evaluate our system, we used two collections: 1. Sub-corpus of the RTE Datasets We choose text-hypothesis pairs from RTE2–RTE4 datasets, which have quantity mentions in the hypothesis. Overall, we selected 384 text-hypothesis pairs with quantities in the hypothesis. 2. Newswire Text 600 sentences of newswire text were selected, all containing quantity mentions. Both these datasets were manually annotated with the phrase boundaries of quantity mentions and had an inter-annotator agreement of 0.91. We restricted annotation to contiguous segments of text. No instances of implicit quantities were annotated. We also did not annotate these mentions with QVRs. Limiting the annotations to contiguous spans of text results in a few instances of quantities which contain missing information, such as missing or ambiguous units, and several range and ratio relationships which were not annotated (e.g., we do not annotate the range expressed in “from [5 million] in [1995] to [6 million] in [1996]”, but do so in “[from 5 million to 6 million]”). In the RTE sub-corpus we also annotated entailment pairs with information about which quantities entail, in addition to the boundary information. For each quantity in the hypothesis we labeled it as either “entails”, “no relation”, or “contradicts”, with an inter-annotator agreement of 0.95. There were 309 entailing quantities, 71 contradicting quantities and 56 quantities which were unrelated to the corresponding text. We also maintained the information about general entailment, that is, whether the hypothesis can be explained by the text. An example of an annotated RTE example is shown below. 30 Annotation Example for RTE sub-corpus T:A bomb in a Hebrew University cafeteria killed [five Americans] and [four Israelis]. H:A bombing at Hebrew University in Jerusalem killed [nine people], including [five Americans]. “nine people” : entails “five Americans” : entails Global entailment decision : entails Although we limit our scope to infer the entailment decision for individual quantities mentioned in hypothesis, we hope to see future approaches use these individual decisions and combine them appropriately to obtain the global entailment decision. Currency Search We developed a new dataset for evaluating currency search. Queries of various amounts of money like “1000$”, “USD 2 million”, etc. were made on a search engine, and paragraphs containing monetary mentions were taken from the top search results. We collected 100 paragraphs containing various mentions of monetary values, and labeled them with the amount mentioned in them. We restricted the denominations to US dollars. The inter-annotator agreement was 0.98. 4.6.2 Quantity Segmentation We evaluate the phrase boundary recognizer on the annotated RTE and newswire datasets described in the previous section, using the phrase-based F1 score. We compare the accuracy and running times of the Semi-CRF model (SC) [Sarawagi and Cohen, 2004] and the bank of classifiers model (C+I) (PR) [Punyakanok and Roth, 2001], using 10-fold cross-validation. Note that the standardizer can often recover from mistakes made at the segmentation level. Therefore, this performance does not necessarily upper bound the performance of the next step in our pipeline. The segmentation we are aiming for does not directly follow from syntactic structure of a sentence. For example, in the sentence “ The unemployment rate increased 10%”, we would like to segment together “increased 10%”, since this tells us that the quantity denotes a rise in value. Also, in the sentence “Apple restores push email in Germany, nearly two years after Motorola shut it down” we would like to segment together ”nearly two years after” . We consider a quantity to be correctly detected only when we have the exact phrase that we want, otherwise we consider the segment to be undetected. 31 Model P% R% F% Train Test Time Time Semi-CRF (SC) 75.6 77.7 76.6 15.8 1.5 C+I (PR) 80.3 79.3 79.8 1.0 1.0 Table 4.1: 10-fold cross-validation results of segmentation accuracy and time required for segmentation, the columns for runtime have been normalized and expressed as ratios Table 4.1 describes the segmentation accuracy, as well as the ratio between the time taken by both approaches. The bank of classifiers approach gives slightly better accuracy than the semi-CRF model, and is also significantly faster. 4.6.3 Quantity Entailment We evaluate the complete Quantity Entailment system, determining the overall loss due to the segmentation, as well as the contribution of the Coreference Resolver and SRL. We show the performance of 4 systems. 1. GoldSeg : Uses gold segmentation, and does not use SRL and Coreference Resolver. 2. GoldSeg+Sem : Uses gold segmentation, and also uses SRL and Coreference Resolver to infer units. 3. PredSeg : Performs segmentation, and does not use SRL and Coreference Resolver. 4. PredSeg+Sem : Performs segmentation, and uses SRL and Coreference Resolver. The baseline is an exact string matching algorithm. It answers “entails” if the quantity unit and value are present in the text, and answers “contradicts” if only the unit matches and the value does not. Otherwise, it returns “no relation”. The results are shown in Table 4.2. Note that exact match only supports 43.3% of the entailment decisions. It is also evident that the deeper semantic analysis using SRL and Coreference improves the quantitative inference. 32 Task System P% R% F% Entailment Baseline 100.0 43.3 60.5 GoldSeg 98.5 88.0 92.9 +Sem 97.8 88.6 93.0 PredSeg 94.9 76.2 84.5 +Sem 95.4 78.3 86.0 Contradiction Baseline 16.6 48.5 24.8 GoldSeg 61.6 92.9 74.2 +Sem 64.3 91.5 75.5 PredSeg 51.9 79.7 62.8 +Sem 52.8 81.1 64.0 No Relation Baseline 41.8 71.9 52.9 GoldSeg 81.1 76.7 78.8 +Sem 80.0 78.5 79.3 PredSeg 54.0 75.4 62.9 +Sem 56.3 72.7 63.5 Table 4.2: Results of QE; Adding Semantics(+SEM) consistently improves performance; Only 43.3% of entailing quantities can be recovered by simple string matching 4.6.4 Currency Range Search Table 4.3 shows the performance of our system in detecting currency phrases. We evaluate our system on the proportion of monetary mentions it recognized and standardized correctly from queried ranges of currency values, and report micro-averaged scores. Note that range search is a direct application of QE, where the quantity is a range of values, and the text is the corpus we want to search. All instances of “entails” correspond to search hits. The baseline here is also a string matching algorithm, which searches for numbers in the text. System P% R% F% Baseline 72.0 69.2 70.5 PredSeg+Sem 96.0 93.5 94.8 Table 4.3: Micro-averaged accuracy in detecting monetary mentions 33 4.6.5 Qualitative Analysis The segmentation module made mistakes in detecting exact boundaries for uncommon phrases, e.g., “hun- dreds of thousands of people”, and “mid-1970’s”. Detection of missing units is problematic in cases like “Three eggs are better than two”. The SRL returns “Three eggs” as a candidate unit, which needs to be pruned appropriately to obtain the correct unit. The primary limitation of the reasoning system in both tasks is the lack of an extensive knowledge base. Wordnet based synsets prove to be insufficient to infer whether units are compatible. Also, there are certain reasoning patterns and various implicit relations be- tween quantities which are not currently handled in the system. For example, inferring from the sentence “Militants in Rwanda killed an [average of 8,000 people per day] for [100 days]” that “around 800,000 peo- ple were killed”. Also, implication of ratios can be involved. For example, the sentence “[One out of 100 participating students] will get the award” implies that there were “100 participating students”, whereas “[9 out of 10 dentists] recommend brushing” does not imply there were 10 dentists. 4.7 Conclusion We studied reasoning about quantities in natural language text, with respect to local context. We have iden- tified and defined an interesting and useful slice of the Textual Entailment problem, the Quantity Entailment task. Since this problem involves inferences about individual quantities, it allows us to capture quantitative reasoning with respect to local context. Our ability to support local context based quantitative reasoning builds on a method we proposed for detecting and normalizing quantities in unrestricted English text; we developed a framework to remove variability and ambiguity from unstructured text by mapping it into a representation which makes reasoning more tractable. Once quantities are mapped into our representation, we can support the reasoning required by Quantity Entailment. 34 Chapter 5 Equation Parsing 5.1 Introduction We now expand our focus from the local context of numbers, to consider entire sentences. We investigate how numbers interact with the entities, verbs and other components of the sentence. In particular, we investigate how we can extract numeric relations expressed in sentences, along with the arguments for those relations. Extracting numeric relations from sentences is a key requirement for natural language understanding . As an example, consider the news article statement in Example 5.1. Understanding it requires identifying relevant entities and the mathematical relations expressed among them in text, and determining how to compose them. Similarly, solving a math word problem with a sentence like Example 5.2, requires realizing that it deals with a single number, knowing the meaning of “difference” and composing the right equation – “25” needs to be subtracted from a number only after it is multiplied by 3. Example 5.1 Emanuel’s campaign contributions total three times those of his opponents put together. Example 5.2 Twice a number equals 25 less than triple the same number. Example 5.3 Flying with the wind , a bird was able to make 150 kilometers per hour. Example 5.4 The sum of two numbers is 80. Example 5.5 There are 54 5-dollar and 10-dollar notes. As a first step towards understanding such relations, we introduce the Equation Parsing task - given a sentence expressing a mathematical relation, the goal is to generate an equation representing the relation, and to map the variables in the equation to their corresponding noun phrases. To keep the problem tractable, we restrict the final output equation form to have at most two (possibly coreferent) variables, and assume that each quantity mentioned in the sentence can be used at most once in the final equation.1 In example 5.1, the gold output of an equation parse should be V1 = 3×V2, with V1 = “Emanuel’s campaign contributions” 1We empirically found that around 97% of sentences describing a relation have this property. 35 and V2 = “those of his opponents put together”. The task can be seen as a form of semantic parsing [Goldwasser and Roth, 2011, Kwiatkowski et al., 2013] where instead of mapping a sentence to a logical form, we want to map it to an equation. However, there are some key differences that make this problem very challenging in ways that differ from the “standard” semantic parsing. In Equation Parsing, not all the components of the sentence are mapped to the final equation. There is a need to identify noun phrases that correspond to variables in the relations and determine that some are irrelevant and can be dropped. Moreover, in difference from semantic parsing into logical forms, in Equation Parsing multiple phrases in the text could correspond to the same variable, and identical phrases in the text could correspond to multiple variables. We call the problem of mapping noun phrases to variables the problem of grounding variables. Grounding is challenging for various reasons, key among them are that: (i) The text often does not mention “variables” explicitly, e.g., the sentence in example 5.3 describes a mathematical relation between the speed of bird and the speed of wind, without mentioning “speed” explicitly. (ii) Sometimes, multiple noun phrases could refer to the same variable. For instance, in example 5.2, both “a number” and “the same number” refer to the same variable. On the other hand, the same noun phrase might refer to multiple variables, as in example 5.4, where the noun phrase “two numbers” refer to two variables. In addition, the task involves deciding which of the quantities identified in the sentence are relevant to the final equation generation. In example 5.5, both “5” and “10” are not relevant for the final equation “V1 +V2 = 54”. Finally, the equation needs to be constructed from a list of relevant quantities and grounded variables. Overall, the output space becomes exponential in the number of quantities mentioned in the sentence. Determining the final equation that corresponds to the text is an inference step over a very large space. To address this, we define the concept of “projectivity” - a condition where the final equation can be generated by combining adjacent numbers or variables, and show that most sentences expressing mathematical relations exhibit the projectivity property. Finally, we restrict our inference procedure to only search over equations which have this property. Our approach builds on a pipeline of structured predictors that identify irrelevant quantities, recognize coreferent variables, and, finally, generate equations. We also leverage a high precision lexicon of math- ematical expressions and develop a greedy lexicon matching strategy to guide inference. We discuss and exemplify the advantages of this approach and, in particular, explain where the “standard” NLP pipeline fails to support equation parsing, and necessitates the new approach proposed here. Another contribution 36 of this work is the development of a new annotated data set for the task of equation parsing. We evaluate our method on this dataset and show that our method predicts the correct equation in 70% of the cases and that in 60% of the time we also ground all variables correctly. The next section presents a discussion of related work. Next we formally describe the task of equation parsing. The following sections describe our equation representation and the concept of projectivity, followed by the description of our algorithm to generate the equations and variable groundings from text. We conclude with experimental results. 5.2 Related Work The work most related to this work is [Madaan et al., 2016], which focuses on extracting relation triples where one of the arguments is a number. In contrast, our work deals with multiple variables and com- plex equations involving them. This work is also conceptually related to work on semantic parsing – map- ping natural language text to a formal meaning representation [Wong and Mooney, 2007, Clarke et al., 2010, Cai and Yates, 2013, Kwiatkowski et al., 2013, Goldwasser and Roth, 2011]. However, as mentioned earlier, there are some significant differences in the task definition that necessitate the development of a new ap- proach. 5.3 The Equation Parsing Task Equation parsing takes as input a sentence x describing a single mathematical equation, comprising one or two variables and other quantities mentioned in x. Let N be the set of noun phrases in the sentence x. The output of the task is the mathematical equation described in x, along with a mapping of each variable in the equation to its corresponding noun phrase in N. We refer to this mapping as the “grounding” of the variable; the noun phrase represents what the variable stands for in the equation. Table 5.1 gives an example of an input and output for the equation parsing of the text in example 5.2. Since an equation can be written in various forms, we use the form which most agrees with text, as our target output. So, for example 5.1, we will choose V1 = 3 ×V2 and not V2 = V1 ÷ 3. In cases where several equation forms seem to be equally likely to be the target equation, we randomly choose one of them, and keep this choice consistent across the dataset. 37 The Equation Parsing Task Input Twice a number equals 25 less than triple the same number. Output 2 ×V1 = (3 ×V1) − 25 (Equation) V1 = “a number” (Grounding) Table 5.1: Input and output for Equation Parsing 5.3.1 Equation Parse Representation Twice a number equals 25 less than triple the same number.Sentence Trigger List Equation Tree 2 V1 25 3 V1 × = −r × Figure 5.1: A sentence with its trigger list and equation tree. −r indicates subtraction with order rl. In this section, we introduce an equation parse for a sentence. An equation parse of a sentence x is a pair (T,E), where T represents a set of triggers extracted from x, and E represents an equation tree formed with the set T as leaves. We now describe these terms in detail. Trigger Given a sentence x mentioning a mathematical relation, a trigger can either be a quantity trig- ger expressed in x, or variable trigger which is a noun phrase in x corresponding to a variable. A quantity trigger is a tuple (q,s), where q is the numeric value of the quantity mentioned in text, and s is the span of text from the sentence x which refers to the quantity. A variable trigger is a tuple (l,s), where l represents the label of the variable, and s represents the noun phrase representing the variable. For example, for the sentence in Fig 5.1, the spans “Twice”, “25”, and “triple” generate quantity triggers, whereas “a number” and “the same number” generate variable triggers, with label V1. Trigger List The trigger list T for a sentence x contains one trigger for each variable mention and each nu- meric value used in the final equation expressed by the sentence x. The trigger list might consist of multiple triggers having the same label, or extracted from the same span of text. In the example sentence in Fig 5.1, 38 the trigger list comprises two triggers having the same label V1. The final trigger list for the example in Fig 5.1 is {(2, “2”), (V1, “a number”), (25, “25”), (3, “triple”), (V1, “the same number”)}. Note that there can be multiple valid trigger lists. In our example, we could have chosen both variable triggers to point to the same mention “a number”. Quantity triggers in the trigger list form the quantity trigger list, and the variable triggers in trigger list form the variable trigger list. Notation Definition Quantity Trigger Mention of a quantity in text Variable Trigger Noun phrase coupled with variable label Trigger Quantity or variable trigger Quantity Trigger List List of quantity triggers, one for each number mention in equation Variable Trigger List List of variable triggers, one for each variable mention in equation Trigger List Union of quantity and variable trigger list Equation Tree Binary tree representation of equation lc(n), rc(n) Left and right child of node n Expr(n) Expression represented by node n �(n) Operation at node n Order(n) Order of operation at node n Location(n) Character offset of trigger representing leaf node n Span-Start(n), Span-End(n) Start and end character offsets of span covered by node n Table 5.2: Summary of notations used in this chapter Equation Tree An equation tree of a sentence x is a binary tree whose leaves constitute the trigger list of x, and internal nodes (except the root) are labeled with one of the following operations – addition, subtraction, multiplication, division. In addition, for nodes which are labeled with subtraction or division, we maintain a separate variable to determine order of its children. The root of the tree is always labeled with the operation equal. An equation tree is a natural representation for an equation. Each node n in an equation tree represents an expression Expr(n), and the label of the parent node determines how the expressions of its children are to be composed to construct its own expression. Let us denote the label for a non-leaf node n to be �(n), where �(n) ∈ {+,−,×,÷, =} and the order of a node n’s children by Order(n) (defined only for subtraction and division nodes), which takes values lr (Left-Right) or rl (Right-Left). For a leaf node n, 39 the expression Expr(n) represents the variable label, if n is a variable trigger, and the numeric value of the quantity, if it is a quantity trigger. Finally, we use lc(n) and rc(n) to represent the left and right child of node n, respectively. The equation represented by the tree can be generated as follows. For all non-leaf nodes n, we have Expr(n) =   Expr(lc(n))�(n) Expr(rc(n)) if �(n) ∈{+,×, =} Expr(lc(n))�(n) Expr(rc(n)) if �(n) ∈{−,÷}∧Order(n) = lr Expr(rc(n))�(n) Expr(lc(n)) if �(n) ∈{−,÷}∧Order(n) = rl (5.1) Given an equation tree T of a sentence, the equation represented by it is the expression generated by the root of T (following Equation 5.1). Referring to the equation tree in Fig 5.1, the node marked “−r” represents (3 ×V1) − 25, and the root represents the full equation 2 ×V1 = (3 ×V1) − 25. 5.4 Projectivity For each leaf n of an equation tree T , we define a function Location(·), to indicate the position of the corresponding trigger in text. We also define for each node n of equation tree T , functions Span-Start(n) and Span-End(n) to denote the minimum span of text containing the leaves of the subtree rooted at n. We define them as follows: Span-Start(n) =   Location(n) if n is a leaf min(Span-Start(lc(n)),Span-Start(rc(n))) otherwise Span-End(n) =   Location(n) if n is a leaf max(Span-End(lc(n)),Span-End(rc(n))) otherwise An equation tree T is called projective iff for every node n of T , either Span-End(lc(n)) ≤ Span-Start(rc(n)) or Span-End(rc(n)) ≤ Span-Start(lc(n)). In other words, the span of the left child and the right child cannot intersect in a projective equation tree2. The key observation, as our corpus analysis indicates, is that for most sentences, there exists a trigger list, such that the equation tree representing the relation in the sentence is projective. However this might 2This is more general than the definition of projective trees used in dependency parsing [McDonald et al., 2005]. 40 involve mapping two mentions of the same variable to different noun phrases. Figure 5.1 shows an example of a projective equation tree, which requires different mentions of V1 to be mapped to different noun phrases. If we had mapped both mentions of V1 to same noun phrase “a number”, the resulting equation tree would not have been projective. We collected 385 sentences which represent an equation with one or two mentions of variables, and each number in the sentence used at most once in the equation. We found that only one sentence among these could not generate a projective equation tree. (See Section 5.6.1 for details on dataset creation). Therefore, we develop an algorithmic approach for predicting projective equation trees, and show empirically that it compares favorably with ones which do not make the projective assumption. 5.5 Predicting Equation Parse Equation parsing of a sentence involves predicting three components – Quantity Trigger List, Variable Trigger List and Equation Tree. We develop three structured prediction modules to predict each of the above components. All our prediction modules take a similar form: given input x and output y, we learn a scoring function fw(x,y), which scores how likely is the output y given input x. The scoring function fw(x,y) is linear, fw(y) = w T φ(x,y), where φ(x,y) is a feature vector extracted from x and y. The inference problem, that is, the prediction y∗ for an input x is then: y∗ = arg maxy∈Y fw(y), where Y is the set of all allowed values of y. 5.5.1 Predicting Quantity Trigger List Given input text and the quantities mentioned in it, the role of this step is to identify , for each quantity in the text, whether it should be part of the final equation. For instance, in example 5.5 in Section 5.1, both “5” and “10” are not relevant for the final equation “V1 + V2 = 54”. Similarly, in example 5.4, the number “two” is irrelevant for the equation “V1 + V2 = 80”. We define for each quantity q in the sentence, a boolean value Relevance(q), which is set to true if q is relevant for the final equation, and to false otherwise. For the structured classification, the input x is the sentence along with a set of recognized quantities mentioned in it, and the output y is the relevance values for all quantities in the sentence. We empirically found that predicting all relevance values jointly performs better than having a binary classifier predict each one separately. The feature function φ(x,y) used for the classification generates the following features : 41 1. Neighborhood features : For each quantity q in the input sentence, we add unigrams and bigrams generated from a window around q, part of speech tags of neighborhood tokens of q. We conjoin these features with Relevance(q). 2. Quantity Features : For each quantity q, we add unigrams and bigrams of the phrase representing the quantity. Also, we add a feature indicating whether the number is associated with number one or two, and whether it is the only number present in the sentence. These features are also conjoined with Relevance(q). 5.5.2 Predicting Variable Trigger List The goal of this step is to predict the variable trigger list for the equation. Our structured classifier takes as input the sentence x, and the output y is either one or two noun-phrases, representing variables in the final equation. As we pointed out earlier, multiple groundings might be valid for any given variable, hence there can be multiple valid variable trigger lists. For every sentence x, we construct a set Y of valid outputs. Each element in Y corresponds to a valid variable trigger list. Finally, we aim to output only one of the elements of Y . We modified the standard structured prediction algorithm to consider “superset supervision” and take into account multiple gold structures for an input x. We assume access to N training examples of the form : (x1,Y1), (x2,Y2), . . . , (xN,YN ), where each Yi is a set of valid outputs for the sentence xi. Since we want to output only one variable trigger list, we want to score at least one y from Yi higher than all other possible outputs, for each xi. We use a modified latent structured SVM to learn the weight vector w. The algorithm treats the best choice among all of Yi as a latent variable. At each iteration, for all xi, the algorithm chooses the best choice y∗i from the set Yi, according to the weight vector w. Then, w is updated by learning on all (xi,y ∗ i ) by a standard structured SVM algorithm. The details of the algorithm are in Algorithm 4. 42 Algorithm 4 Structural SVM with Superset Supervision Input: Training data T = {(x1,Y1), (x2,Y2), . . . , (xN,YN )} Output: Trained weight vector w 1: w ← w0 2: repeat 3: T ′ ←∅ 4: for all (xi,Yi) ∈ T do 5: y∗i ← arg maxy∈Yi wT φ(xi,y) 6: T ′ ← T ′ ∪{(xi,y∗i )} 7: end for 8: Update w by running standard Structural SVM algorithm on T ′ 9: until convergence 10: return w The distinction from standard latent structural SVM is in line 5 of Algorithm 4. In order to get the best choice y∗i for input xi, we search only inside Yi, instead of all of Y. A similar formulation can be found in [Björkelund and Kuhn, 2014]. The features φ(x,y) used for variable trigger prediction are as follows: 1. Variable features : Unigrams and bigrams generated from the noun phrase representing variables, part of speech tags of tokens in noun phrase representing variables. 2. Neighborhood Features : Unigrams and POS tags from neighborhood of variables. All the above features are conjoined with two labels, one denoting whether y has two variables or one, and the second denoting whether y has two variables represented by the same noun phrase. If the output of the classifier is a pair of noun phrases, we use a rule based variable coreference detector, to determine whether both noun phrases should have the same variable label or not. The rules for variable coreference are as follows : 1. If both noun phrases are the same, and they do not have the token “two” or “2”, they have the same label. 2. If the noun phrases are different, and the noun phrase appearing later in the sentence contains tokens “itself”, “the same number”, they have the same label. 3. In all other cases, they have different labels. 43 Finally, each noun phrase contributes one variable trigger to the variable trigger list. 5.5.3 Predicting Equation Tree It is natural to assume that the syntactic parse of the sentence could be very useful in addressing all the predictions we are making in the equation parsing tasks. However, it turns out that this is not the case – large portions of the syntactic parse will not be part of the equation parse, hence we need the aforementioned modules to address this. Nevertheless, in the next task of predicting the equation tree, we attempted to constraint the output space using guidance from the syntactic tree; we found, though, that even enforcing this weak level of output expectation is not productive. This was due to the poor performance of current syntactic parsers on the equation data (eg., in 32% of sentences, the Stanford parser made a mistake which does not allow recovering the correct equation). The tree prediction module receives the trigger list predicted by the previous two modules, and the goal is to create an equation tree using the trigger list as the leaves of that tree. The input x is the sentence and the trigger list, and the output y is the equation tree representing the relation described in the sentence. We assume that the output will be a projective equation tree. For features φ(x,y), we extract for each non-leaf node n of the equation tree y, the following: 1. Neighborhood Features : Unigrams, bigrams and POS tags from neighborhood of Span-Start(lc(n)), Span-Start(rc(n)), Span-End(lc(n)) and Span-End(rc(n)), conjoined with �(n) and Order(n). 2. Connecting Text Features : Unigrams, bigrams and part of speech tags between min(Span-End(lc(n)),Span-End(rc(n))) and max(Span-Start(lc(n)), Span-Start(rc(n))), conjoined with �(n) and Order(n). 3. Number Features : In case we are combining two leaf nodes representing quantity triggers, we add a feature signifying whether one number is larger than the other. The projectivity assumption implies that the final equation tree can be generated by combining only adjacent nodes, once the set of leaves is sorted based on Span-Start(·) values. This allows us to use CKY algorithm for inference. A natural approach to further reduce the output space is to conform to the projective structure of the syntactic parse of the sentence. However, we found this to adversely affect performance, due to the poor performance of syntactic parser on equation data. 44 Lexicon To bootstrap the equation parsing process, we developed a high precision lexicon to translate mathematical expressions to operations and orders, like “sum of A and B” translates to “A+B”, “A minus B” translates to “A-B”, etc. (where A and B denote placeholder numbers or expressions). At each step of CKY, while constructing a node n of the equation tree, we check for a lexicon text expression corresponding to node n. If found, we allow only the corresponding operation (and order) for node n, and do not explore other operations or orders. We show empirically that reducing the space using this greedy lexicon matching help improve performance. We found that using the lexicon rules as features instead of hard constraints do not help as much. Note that our lexicon comprises only generic math concepts, and around 50% of the sentences in our dataset do not contain any pattern from the lexicon. Details of the lexicon are added to the appendix. Finally, given input sentence, we first predict the quantity trigger and the variable trigger lists. Given the complete trigger list, we predict the equation tree relating the components of the trigger list. 5.5.4 Alternatives A natural approach could be to jointly learn to predict all three components, to capture the dependencies among them. To investigate this, we developed a structured SVM which predicts all components jointly, using the union of the features of each component. We use approximate inference, first enumerating possible trigger lists, and then equation trees, and find the best scoring structure. However, this method did not outperform the pipeline method. The worse performance of joint learning is due to: (1) search space being too large for the joint model to do well given our dataset size of 385, and (2) our independent classifiers being good enough, thus supporting better joint inference. This tradeoff is strongly supported in the literature [Punyakanok et al., 2005b, Sutton and McCallum, 2007]. Another option is to enforce constraints between trigger list predictions, such as, variable triggers should not overlap with the quantity triggers. However, we noticed that often noun phrases returned by the Stanford parser were noisy, and would include neighboring numbers within the extracted noun phrases. This prevented us from enforcing such constraints. 5.6 Experimental Results We now describe the data set, and the annotation procedure used. We then evaluate the system’s performance on predicting trigger list, equation tree, and the complete equation parse. 45 5.6.1 Dataset We created a new dataset consisting of 385 sentences extracted from algebra word problems and financial news headlines. For algebra word problems, we used the MIT dataset [Kushman et al., 2014], and two high school mathematics textbooks, Elementary Algebra (College of Redwoods) and Beginning and Intermediate Algebra (Tyler Wallace). Financial news headlines were extracted from The Latest News feed of MarketWatch, over the month of February, 2015. All sentences with information describing a mathematical relation among at most two (possibly coreferent) variables, were chosen. Next, we pruned sentences which require multiple uses of a number to create the equation. This only removed a few time related sentences like “In 10 years, John will be twice as old as his son.”. We empirically found that around 97% of sentences describing a relation fall under the scope of our dataset. Some statistics of the dataset are provided in Table 5.3. Source #Sentences MIT 245 Redwoods 25 Wallace 40 MarketWatch 75 Table 5.3: Statistics of dataset The annotators were shown each sentence paired with the normalized equation representing the relation in the sentence. For each variable in the equation, the annotators were asked to mark spans of text which best describe what the variable represents. They were asked to annotate associated entities if exact variable description was not present. For instance, in example 5.3 (Section 1), the relation holds between the speed of bird and the speed of wind. However, “speed” is not explicitly mentioned in the sentence. In such cases, the annotators were asked to annotate the associated entities “the wind” and “a bird” as representing variables. The guidelines also directed annotators to choose the longest possible mention, in case they feel the mention boundary is ambiguous. As a result, in the sentence, “City Rentals rent an intermediate-size car for 18.95 dollars plus 0.21 per mile.”, the phrase “City Rentals rent an intermediate-size car” was annotated as representing variable. We allow multiple mentions to be annotated for the same variable. In example 5.2 (Section 1), both “a number” and “the same number” were annotated as representing the same variable. We wanted to consider only noun phrase constituents for variable grounding. Therefore, for each anno- tated span, we extracted the noun phrase with maximum overlap with the span, and used it to represent the variables. Finally, a tuple with each variable being mapped to one of the noun phrases representing it, 46 forms a valid output grounding (variable trigger list). We computed inter-annotator agreement on the final annotations where only noun phrases represent variables. The agreement (kappa) was 0.668, indicating good agreement. The average number of mention annotations per sentence was 1.74. 5.6.2 Equation Parsing Modules In this section, we evaluate the performance of the individual modules of the equation parsing process. We report Accuracy - the fraction of correct predictions. Tables 5.4, 5.5 and 5.6 show the 5-fold cross validation accuracy of the various modules. In each case, we also report accuracy by removing each feature group, one at a time. In addition, for equation tree prediction, we also show the effect of lexicon, projectivity, conforming to syntactic parse constraints, and using lexicon as features instead of hard constraints. For all our experiments, we use the Stanford Parser [Chen and Manning, 2014], the Illinois POS tagger [Roth and Zelenko, 1998], the Illinois shallow parser [Punyakanok and Roth, 2001] and the Illinois-SL structured prediction package [Chang et al., 2015]. Accuracy All features 95.3 No Neighborhood features 42.5 No Quantity features 93.2 Table 5.4: Performance of Quantity Trigger List Prediction Accuracy All features 75.5 No Variable features 58.6 No Neighborhood features 70.3 Table 5.5: Performance of Variable Trigger List Prediction 47 Accuracy All features 78.9 No Neighborhood features 64.3 No Connecting Text features 70.2 No Number features 77.6 No Lexicon 72.7 No Projectivity 72.8 Conform with Syntactic Parse 70.2 Lexicon as Features 74.5 Table 5.6: Performance of Equation Tree Prediction 5.6.3 Equation Parsing Results In this section, we evaluate the performance of our system on the overall equation parsing task. We report Equation Accuracy - the fraction of sentences for which the system got the equation correct, and Equa- tion+Grounding Accuracy - the fraction of sentences for which the system got both the equation and the grounding of variables correct. Table 5.7 shows the overall performance of our system, on a 5-fold cross validation. We compare against Joint Learning - a system which jointly learns to predict all relevant com- ponents of an equation parse (Section 5.5.4). We also compare with SPF [Artzi and Zettlemoyer, 2013], a publicly available semantic parser, which can learn from sentence-logical form pairs. We train SPF with sentence-equation pairs and a seed lexicon for mathematical terms (similar to ours), and report equation accuracy. Our structured predictors pipeline approach is shown to be superior to both Joint Learning and SPF. Source Equation Accuracy Equation + Grounding Accuracy Our System 71.3 ± 3.7 61.2 ± 5.4 Joint Learning 60.9 ± 9.1 50.0 ± 5.7 SPF 3.1 N/A Table 5.7: Performance on equation parsing 48 SPF gets only a few sentences correct. We attribute this to the inability of SPF to handle overlapping mentions (like in Example 5.4), as well as its approach of parsing the whole sentence to the final output form. The developers of SPF also confirmed 3 that it is not suitable for equation parsing and that these results are expected. Since equation parsing is a more involved process, a slight adaptation of SPF does not seem possible, necessitating a more involved process , of the type we propose. Our approach, in contrast to SPF, can handle overlapping mentions, selects triggers from text, and parses the trigger list to form equations. 5.6.4 Error Analysis For variable trigger list prediction, around 25% of the errors were due to the predictor choosing a span which is contained within the correct span, e.g., when the target noun phrase is “The cost of a child’s ticket”, our predictor chose only “child’s ticket”. Although this choice might be sufficient for downstream tasks, we consider it to be incorrect in our current evaluation. Another 25% of the errors were due to selection of entities which do not participate in the relation. For example, in “A rancher raises 5 times as many cows as horses.”, our predictor chose “A rancher” and “cows” as variables, whereas the relation exists between “cows” and “horses”. For the prediction of the equation tree, we found that 35% of the errors were due to rare math concepts expressed in text. For example, “7 dollars short of the price” represents 7 dollars should be subtracted from the price. These errors can be handled by carefully augmenting the lexicon. Another 15% of the errors were due to lack of world knowledge, requiring understanding of time, speed, and distance. 5.7 Conclusion In this chapter, we investigate methods that identify and understand mathematical relations expressed in text. We introduce the equation parsing task, which involves generating an equation from a sentence and identifying what the variables represent. We define the notion of projectivity, and construct a high precision lexicon, and use these to reduce the equation search space. Our experimental results are quite satisfying and raise a few interesting issues. In particular, it suggests that predicting equation parses using a pipeline of structured predictors performs better than jointly trained alternatives. As discussed, it also points out the limitation of the current NLP tools in supporting these tasks. Code and dataset are available at http://cogcomp.cs.illinois.edu/page/publication_view/800. 3Private communication 49 http://cogcomp.cs.illinois.edu/page/publication_view/800 Chapter 6 Arithmetic Word Problems 6.1 Introduction In this chapter, we expand our focus on quantitative reasoning over multiple numbers mentioned across different sentences. As a test bed, we choose the task of automatically solving arithmetic word problems. We found these problems to be a good abstraction of the challenges faced in quantitative reasoning across multiple sentences. Word problems arise naturally when reading the financial section of a newspaper, or following election coverage. These problems pose an interesting challenge to the NLP community, due to its concise and relatively straightforward text, and seemingly simple semantics. Arithmetic word problems are usually directed towards elementary school students, and can be solved by combining the numbers mentioned in text with basic operations (addition, subtraction, multiplication, division). They are simpler than algebra word problems which require students to identify variables, and form equations with these variables to solve the problem. Initial methods to address arithmetic word problems have mostly focused on subsets of problems, re- stricting the number or the type of operations used [Roy et al., 2015, Hosseini et al., 2014] but could not deal with multi-step arithmetic problems involving all four basic operations. The template based method of [Kushman et al., 2014], on the other hand, can deal with all types of problems, but implicitly assumes that the solution is generated from a set of predefined equation templates. In this chapter, we present a novel approach which can solve a general class of arithmetic problems without predefined equation templates. In particular, it can handle multiple step arithmetic problems as shown in Example 6.1. 50 Example 6.1 Gwen was organizing her book case making sure each of the shelves had exactly 9 books on it. She has 2 types of books - mystery books and picture books. If she had 3 shelves of mystery books and 5 shelves of picture books, how many books did she have in total? The solution involves understanding that the number of shelves needs to be summed up, and that the total number of shelves needs to be multiplied by the number of books each shelf can hold. In addition, one has to understand that the number “2” is not a direct part of the solution of the problem. While a solution to these problems eventually requires composing multi-step numeric expressions from text, we believe that directly predicting this complex expression from text is not feasible. At the heart of our technical approach is the novel notion of an Expression Tree. We show that the arithmetic expressions we are interested in can always be represented using an Expression Tree that has some unique decomposition properties. This allows us to decompose the problem of mapping the text to the arithmetic expression to a collection of simple prediction problems, each determining the lowest common ancestor operation between a pair of quantities mentioned in the problem. We then formulate the decision problem of composing the final expression tree as a joint inference problem, via an objective function that consists of all these decomposed prediction problems, along with legitimacy and background knowledge constraints. Learning to generate the simpler decomposed expressions allows us to support effectively support quan- titative reasoning over multiple numbers, and allows generalization across problems types. In particular, our system could solve Example 6.1 even though it has never seen a problem that requires both addition and multiplication operations. We also introduce a second concept, that of quantity schema, that allows us to focus on the information relevant to each quantity mentioned in the text. We show that features extracted from quantity schemas help reasoning effectively about the solution. Moreover, quantity schemas help identify unnecessary text snippets in the problem text. For instance, in Example 6.2, the information that “Tom washed cars over the weekend” is irrelevant; he could have performed any activity to earn money. In order to solve the problem, we only need to know that he had $76 last week, and now he has $86. 51 Example 6.2 Last week Tom had $74. He washed cars over the weekend and now has $86. How much money did he make from the job? We combine the classifiers’ decisions using a constrained inference framework that allows for incorporating world knowledge as constraints. For example, we deliberatively incorporate the information that, if the problems asks about an “amount”, the answer must be positive, and if the question starts with “how many”, the answer will most likely be an integer. Our system is evaluated on two existing datasets of arithmetic word problems, achieving state of the art performance on both. We also create a new dataset of multistep arithmetic problems, and show that our system achieves competitive performance in this challenging evaluation setting. The next section describes the related work in the area of automated math word problem solving. We then present the theory of expression trees and our decomposition strategy that is based on it. Sec. 4 presents the overall computational approach, including the way we use quantity schemas to learn the mapping from text to expression tree components. Finally, we discuss our experimental study and conclude. 6.2 Related Work Work in automated arithmetic problem solvers has focused on a restricted subset of problems. The system described in [Hosseini et al., 2014, Mitra and Baral, 2016] handles only addition and subtraction problems, and requires additional annotated data for verb categories. In contrast, our system does not require any addi- tional annotations and can handle a more general category of problems. The approach in [Roy et al., 2015] supports all four basic operations, and uses a pipeline of classifiers to predict different properties of the problem. However, it makes assumptions on the number of quantities mentioned in the problem text, as well as the number of arithmetic steps required to solve the problem. In contrast, our system does not have any such restrictions, effectively handling problems mentioning multiple quantities and requiring multiple steps. Kushman’s approach to automatically solving algebra word problems [Kushman et al., 2014] might be the most related to ours. It tries to map numbers from the problem text to predefined equation templates. However, they implicitly assume that similar equation forms have been seen in the training data. In contrast, our system can perform competitively, even when it has never seen similar expressions in training. There is a recent interest in understanding text for the purpose of solving scientific and quantitative 52 problems of various kinds. Our approach is related to work in understanding and solving elementary school standardized tests [Clark, 2015]. The system described in [Berant et al., 2014] attempts to automatically answer biology questions, by extracting the structure of biological processes from text. There has also been efforts to solve geometry questions by jointly understanding diagrams and associated text [Seo et al., 2014]. Our constrained inference module falls under the general framework of Constrained Conditional Models (CCM) [Chang et al., 2012]. In particular, we use the L+I scheme of CCMs, which predicts structured out- put by independently learning several simple components, combining them at inference time. This has been successfully used to incorporate world knowledge at inference time, as well as getting around the need for large amounts of jointly annotated data for structured prediction [Roth and Yih, 2005, Punyakanok et al., 2005a, Punyakanok et al., 2008, Clarke and Lapata, 2006, Barzilay and Lapata, 2006, Roy et al., 2015]. 6.3 Expression Tree and Problem Decomposition We address the problem of automatically solving arithmetic word problems. The input to our system is the problem text P , which mentions n quantities q1,q2, . . . ,qn. Our goal is to map this problem to a read- once arithmetic expression E that, when evaluated, provides the problem’s solution. We define a read-once arithmetic expression as one that makes use of each quantity at most once. We say that E is a valid expression, if it is such a Read-Once arithmetic expression, and we only consider in this work problems that can be solved using valid expressions (it’s possible that they can be solved also with invalid expressions). An expression tree T for a valid expression E is a binary tree whose leaves represent quantities, and each internal node represents one of the four basic operations. For a non-leaf node n, we represent the operation associated with it as �(n), and its left and right child as lc(n) and rc(n) respectively. The numeric value of the quantity associated with a leaf node n is denoted as Q(n). Each node n also has a value associated with it, represented as Val(n), which can be computed in a recursive way as follows: Val(n) =   Q(n) if n is a leaf Val(lc(n))�(n) Val(rc(n)) otherwise (6.1) For any expression tree T for expression E with root node nroot, the value of Val(nroot) is exactly equal to the numeric value of the expression E. Therefore, this gives a natural representation of numeric expressions, providing a natural parenthesization of the numeric expression. Fig 6.1 shows an example of an arithmetic 53 problem with solution expression and an expression tree for the solution expression. Problem Gwen was organizing her book case making sure each of the shelves had exactly 9 books on it. She has 2 types of books - mystery books and picture books. If she had 3 shelves of mystery books and 5 shelves of picture books, how many books did she have total? Solution (3 + 5) × 9 = 72 Expression Tree of Solution 3 + 5 × 9 Figure 6.1: An arithmetic word problem, solution expression and the corresponding expression tree Definition An expression tree T for a valid expression E is called monotonic if it satisfies the following conditions: 1. If an addition node is connected to a subtraction node, then the subtraction node is the parent. 2. If a multiplication node is connected to a division node, then the division node is the parent. 3. Two subtraction nodes cannot be connected to each other. 4. Two division nodes cannot be connected to each other. Fig 6.2 shows two different expression trees for the same expression. Fig 6.2b is monotonic whereas fig 6.2a is not. Our decomposition relies on the idea of monotonic expression trees. We try to predict for each pair of quantities qi,qj, the operation at the lowest common ancestor (LCA) node of the monotonic expression tree for the solution expression. We also predict for each quantity, whether it is relevant to the solution. Finally, an inference module combines all these predictions. 54 3 + 5 − 9 7 8 × + (a) 3 + 5 − 97 8× + (b) Figure 6.2: Two different expression trees for the numeric expression (3 × 5) + 7 − 8 − 9. The right one is monotonic, whereas the left one is not. In the rest of the section, we show that for any pair of quantities qi,qj in the solution expression, any monotonic tree for the solution expression has the same LCA operation. Therefore, predicting the LCA operation becomes a multiclass classification problem. The reason that we consider the monotonic representation of the expression tree is that different trees could otherwise give different LCA operation for a given pair of quantities. For example, in Fig 6.2, the LCA operation for quantities 5 and 8 can be + or −, depending on which tree is considered. Definition We define an addition-subtraction chain of an expression tree to be the maximal connected set of nodes labeled with addition or subtraction. The nodes of an addition-subtraction (AS) chain C represent a set of terms being added or subtracted. These terms are sub-expressions created by subtrees rooted at neighboring nodes of the chain. We call these terms the chain terms of C, and the whole expression, after node operations have been applied to the chain terms, the chain expression of C. For example, in fig 6.2, the shaded nodes form an addition-subtraction chain. The chain expression is (3 × 5) + 7 − 8 − 9, and the chain terms are 3 × 5, 7, 8 and 9. We define a multiplication-division (MD) chain in a similar way. Theorem 6.3.1. Every valid expression can be represented by a monotonic expression tree. Proof. The proof is procedural, that is, we provide a method to convert any expression tree to a monotonic expression tree for the same expression. Consider a non-monotonic expression tree E, and without loss of generality, assume that the first condition for monotonicity is not valid. Therefore, there exists an addition node ni and a subtraction node nj, and ni is the parent of nj. Consider an addition-subtraction chain C which includes ni,nj. We now replace the nodes of C and its subtrees in the following way. We add a single subtraction node n−. The left subtree of n− has all the addition chain terms connected by addition nodes, 55 and the right subtree of n− has all the subtraction chain terms connected by addition nodes. Both subtrees of n− only require addition nodes, hence monotonicity condition is satisfied. We can construct the monotonic tree in Fig 6.2b from the non-monotonic tree of Fig 6.2a using this procedure. The addition chain terms are 3 × 5 and 7, and the subtraction chain terms are 8 and 9. As as was described above, we introduce the root subtraction node in Fig 6.2b and attach the addition chain terms to the left and the subtraction chain terms to the right. The same line of reasoning can be used to handle the second condition with multiplication and division replacing addition and subtraction, respectively. Theorem 6.3.2. Consider two valid expression trees T 1 and T 2 for the same expression E. Let C1, C2 be the chain containing the root nodes of T 1 and T2 respectively. The chain type (addition-subtraction or multiplication-division) as well as the the set of chain terms of C1 and C2 are identical. Proof. We first prove that the chains containing the roots are both AS or both MD, and then show that the chain terms are also identical. We prove by contradiction that the chain type is same. Let C1’s type be “addition-subtraction” and C2’s type be “multiplication-division” (without loss of generality). Since both C1 and C2 generate the same expression E, we have that E can be represented as sum (or difference) of two expressions as well as product(or division) of two expressions. Transforming a sum (or difference) of expressions to a product (or division) requires taking common terms from the expressions, which imply that the sum (or difference) had duplicate quantities. The opposite transformation adds same term to various expressions leading to multiple uses of the same quantity. Therefore, this will force at least one of C1 and C2 to use the same quantity more than once, violating validity. We now need to show that individual chain terms are also identical. Without loss of generality, let us assume that both C1 and C2 are “addition-subtraction” chains. Suppose the chain terms of C1 and C2 are not identical. The chain expression for both the chains will be the same (since they are root chains, the chain expressions has to be the same as E). Let the chain expression for C1 be ∑ i ti − ∑ i t ′ i, where ti’s are the addition chain terms and t′i are the subtraction chain terms. Similarly, let the chain expression for C2 be ∑ i si − ∑ i s ′ i. We know that ∑ i ti − ∑ i t ′ i = ∑ i si − ∑ i s ′ i, but the set of ti’s and t ′ i’s is not the same as the set of si and s ′ i’s. However it should be possible to transform one form to the other using mathematical manipulations. This transformation will involve taking common terms, or multiplying two terms, or both. Following previous explanation, this will force one of the expressions to have duplicate quantities, violating validity. Hence, the chain terms of C1 and C2 are identical. 56 Consider an expression tree T for a valid expression E. For a distinct pair of quantities qi,qj participating in expression E, we denote by ni,nj the leaves of the expression tree T representing qi,qj, respectively. Let nLCA(qi,qj;T ) to be the lowest common ancestor node of ni and nj. We also define order(qi,qj;T ) to be true if ni appears in the left subtree of nLCA(qi,qj;T ) and nj appears in the right subtree of nLCA(qi,qj;T ) and set order(qi,qj;T ) to false otherwise. Finally we define �LCA(qi,qj;T ) for a pair of quantities qi,qj as follows : �LCA(qi,qj,T ) =   + if �(nLCA(qi,qj;T )) = + × if �(nLCA(qi,qj;T )) = × − if �(nLCA(qi,qj;T )) = − and order(qi,qj;T ) = true −reverse if �(nLCA(qi,qj;T )) = − and order(qi,qj;T ) = false ÷ if �(nLCA(qi,qj;T )) = ÷ and order(qi,qj;T ) = true ÷reverse if �(nLCA(qi,qj;T )) = ÷ and order(qi,qj;T ) = false (6.2) Definition Given two expression trees T 1 and T 2 for the same expression E, T 1 is LCA-equivalent to T 2 if for every pair quantities qi,qj in the expression E, we have �LCA(qi,qj,T 1) = �LCA(qi,qj,T 2). Theorem 6.3.3. All monotonic expression trees for an expression are LCA-equivalent to each other. Proof. We prove by induction on the number of quantities used in an expression. For all expressions E with 2 quantities, there exists only one monotonic expression tree, and hence, the statement is trivially true. This satisfies our base case. For the inductive case, we assume that for all expressions with k < n quantities, the theorem is true. Now, we need to prove that any expression with n nodes will also satisfy the property. Consider a valid (as in all cases) expression E, with monotonic expression trees T 1 and T 2. From theorem 6.3.2, we know that the chains containing the roots of T 1 and T 2 have identical type and terms. Given two quantities qi,qj of E, the lowest common ancestor of both T 1 and T 2 will either both belong to the chain containing the root, or both belong to one of the chain terms. If the LCA node is part of the chain for both T 1 and T 2, monotonic property ensures that the LCA operation will be identical. If the LCA node is part of a chain term (which is an expression tree of size less than n), the property is satisfied by induction hypothesis. 57 The theory just presented suggests that it is possible to uniquely decompose the overall problem to simpler steps and this will be exploited in the next section. 6.4 Mapping Problems to Expression Trees Given the uniqueness properties proved in Sec. 6.3, it is sufficient to identify the operation between any two relevant quantities in the text, in order to determine the unique valid expression. In fact, identifying the operation between any pair of quantities provides much needed redundancy given the uncertainty in identifying the operation from text, and we exploit it in our final joint inference. Consequently, our overall method proceeds as follows: given the problem text P , we detect quantities q1,q2, . . . ,qn. We then use two classifiers, one for relevance and other to predict the LCA operations for a monotonic expression tree of the solution. Our training makes use of the notion of quantity schemas, which we describe in Section 6.4.2. The distributional output of these classifiers is then used in a joint inference procedure that determines the final expression tree. Our training data consists of problem text paired with a monotonic expression tree for the solution expression and alignment of quantities in the expression to quantity mentions in the problem text. Both the relevance and LCA operation classifiers are trained on gold annotations. 6.4.1 Global Inference for Expression Trees In this subsection, we define the scoring functions corresponding to the decomposed problems, and show how we combine these scores to perform global inference. For a problem P with quantities q1,q2, . . . ,qn, we define the following scoring functions: 1. Pair(qi,qj,op) : Scores the likelihood of �LCA(qi,qj,T ) = op, where T is a monotone expression tree of the solution expression of P . A multiclass classifier trained to predict LCA operations (Section 6.4.4) can provide these scores. 2. Irr(q) : Scores the likelihood of quantity q being an irrelevant quantity in P , that is, q is not used in creating the solution. A binary classifier trained to predict whether a quantity q is relevant or not (Section 6.4.3), can provide these scores. 58 For an expression E, let I(E) be the set of all quantities in P which are not used in expression E. Let T be a monotonic expression tree for E. We define Score(E) of an expression E in terms of the above scoring functions and a scaling parameter wIrr as follows: Score(E) =wIrr ∑ q∈I(E) Irr(q) + ∑ qi,qj /∈I(E) Pair(qi,qj,�LCA(qi,qj,T )) (6.3) Our final expression tree is an outcome of a constrained optimization process, following [Roth and Yih, 2004, Chang et al., 2012]. Our objective function makes use of the scores returned by Irr(·) and Pair(·) to deter- mine the expression tree and is constrained by legitimacy and background knowledge constraints, detailed below. 1. Positive Answer: Most arithmetic problems asking for amounts or number of objects usually have a positive number as an answer. Therefore, while searching for the best scoring expression, we reject expressions generating negative answer. 2. Integral Answer: Problems with questions such as ‘how many” usually expect integral solutions. We only consider integral solutions as legitimate outputs for such problems. Let C be the set of valid expressions that can be formed using the quantities in a problem P , and which satisfy the above constraints. The inference algorithm now becomes the following: arg max E∈C Score(E) (6.4) The space of possible expressions is large, and we employ a beam search strategy to find the highest scoring constraint satisfying expression [Chang et al., 2012]. We construct an expression tree using a bottom up approach, first enumerating all possible sets of irrelevant quantities, and next over all possible expressions, keeping the top k at each step. We give details below. 1. Enumerating Irrelevant Quantities: We generate a state for all possible sets of irrelevant quantities, ensuring that there is at least two relevant quantities in each state. We refer to each of the relevant quantities in each state as a term. Therefore, each state can be represented as a set of terms. 2. Enumerating Expressions: For generating a next state S′ from S, we choose a pair of terms ti and tj in S and one of the four basic operations, and form a new term by combining terms ti and tj with the operation. Since we do not know which of the possible next states will lead to the optimal 59 goal state, we enumerate all possible next states (that is, enumerate all possible pairs of terms and all possible operations); we prune the beam to keep only the top k candidates. We terminate when all the states in the beam have exactly one term. Once we have a top k list of candidate expression trees, we choose the highest scoring tree which satisfies the constraints. However, there might not be any tree in the beam which satisfies the constraints, in which case, we choose the top candidate in the beam. We use k = 200 in our experiments. In order to choose the value for the wIrr, we search over the set {10−6, 10−4, 10−2, 1, 102, 104, 106}, and choose the parameter setting which gives the highest accuracy on the training data. 6.4.2 Quantity Schema In order to generalize across problem types as well as over simple manipulations of the text, it is necessary to train our system only with relevant information from the problem text. E.g., for the problem in example 6.2, we do not want to take decisions based on how Tom earned money. Therefore, there is a need to extract the relevant information from the problem text. To this end, we introduce the concept of a quantity schema which we extract for each quantity in the problem’s text. Along with the question asked, the quantity schemas provides all the information needed to solve most arithmetic problems. A quantity schema for a quantity q in problem P consists of the following components. 1. Associated Verb For each quantity q, we detect the verb associated with it. We traverse up the dependency tree starting from the quantity mention, and choose the first verb we reach. We used the easy first dependency parser [Goldberg and Elhadad, 2010]. 2. Subject of Associated Verb We detect the noun phrase, which acts as subject of the associated verb (if one exists). 3. Unit We use a shallow parser to detect the phrase p in which the quantity q is mentioned. All tokens of the phrase (other than the number itself) are considered as unit tokens. Also, if p is followed by the prepositional phrase “of” and a noun phrase (according to the shallow parser annotations), we also consider tokens from this second noun phrase as unit tokens. Finally, if no unit token can be extracted, we assign the unit of the neighboring quantities as the unit of q (following previous work [Hosseini et al., 2014]). 4. Related Noun Phrases We consider all noun phrases which are connected to the phrase p containing 60 quantity q, with NP-PP-NP attachment. If only one quantity is mentioned in a sentence, we consider all noun phrases in it as related. 5. Rate We determine whether quantity q refers to a rate in the text, as well as extract two unit compo- nents defining the rate. For example, “7 kilometers per hour” has two components “kilometers” and “hour”. Similarly, for sentences describing unit cost like “Each egg costs 2 dollars”, “2” is a rate, with units “dollars” and “egg”. In addition to extracting the quantity schemas for each quantity, we extract the surface form text which poses the question. For example, in the question sentence, “How much will John have to pay if he wants to buy 7 oranges?”, our extractor outputs “How much will John have to pay” as the question. 6.4.3 Relevance Classifier We train a binary SVM classifier to determine, given problem text P and a quantity q in it, whether q is needed in the numeric expression generating the solution. We train on gold annotations and use the score of the classifier as the scoring function Irr(·). Features The features are extracted from the quantity schemas and can be broadly categorized into three groups: 1. Unit features: Most questions specifically mention the object whose amount needs to be computed, and hence questions provide valuable clue as to which quantities can be irrelevant. We add a feature for whether the unit of quantity q is present in the question tokens. Also, we add a feature based on whether the units of other quantities have better matches with question tokens (based on the number of tokens matched), and one based on the number of quantities which have the maximum number of matches with the question tokens. 2. Related NP features: Often units are not enough to differentiate between relevant and irrelevant quantities. Consider the following: Example 6.3 Problem : There are 8 apples in a pile on the desk. Each apple comes in a package of 11. 5 apples are added to the pile. How many apples are there in the pile? Solution : (8 + 5) = 13 61 The relevance decision depends on the noun phrase “the pile”, which is absent in the second sentence. We add a feature indicating whether a related noun phrase is present in the question. Also, we add a feature based on whether the related noun phrases of other quantities have better match with the question. Extraction of related noun phrases is described in Section 6.4.2. 3. Miscellaneous Features: When a problem mentions only two quantities, both of them are usually relevant. Hence, we also add a feature based on the number of quantities mentioned in text. We include pairwise conjunction of the above features. 6.4.4 LCA Operation Classifier In order to predict LCA operations, we train a multiclass SVM classifier. Given problem text P and a pair of quantities pi and pj, the classifier predicts one of the six labels described in Eq. 6.2. We consider the confidence scores for each label supplied by the classifier as the scoring function Pair(·). Features We use the following categories of features: 1. Individual Quantity features: Dependent verbs have been shown to play significant role in solving addition and subtraction problems [Hosseini et al., 2014]. Hence, we add the dependent verb of the quantity as a feature. Multiplication and division problems are largely dependent on rates described in text. To capture that, we add a feature based on whether the quantity is a rate, and whether any component of rate unit is present in the question. In addition to these quantity schema features, we add selected tokens from the neighborhood of the quantity mention. Neighborhood of quantities are often highly informative of LCA operations, for example, “He got 80 more marbles”, the term “more” usually indicates addition. We add as features adverbs and comparative adjectives mentioned in a window of size 5 around the quantity mention. 2. Quantity Pair features: For a pair (qi,qj) we add features to indicate whether they have the same dependent verbs, to indicate whether both dependent verbs refer to the same verb mention, whether the units of qi and qj are the same and, if one of them is a rate, which component of the unit matches with the other quantity’s unit. Finally, we add a feature indicating whether the value of qi is greater than the value of qj. 62 3. Question Features: Finally, we add a few features based on the question asked. In particular, for arithmetic problems where only one operation is needed, the question contains signals for the required operation. Specifically, we add indicator features based on whether the question mentions comparison- related tokens (e.g., “more”, “less” or “than”), or whether the question asks for a rate (indicated by tokens such as “each” or “one”). We include pairwise conjunction of the above features. For both classifiers, we use the Illinois-SL package 1 under default settings. 6.5 Experimental Results AI2 IL CC Relax Strict Relax Strict Relax Strict All features 88.7 85.1 75.7 75.7 60.0 25.8 No Individual Quantity features 73.6 67.6 52.0 52.0 29.2 0.0 No Quantity Pair features 83.2 79.8 63.6 63.6 49.3 16.5 No Question features 86.8 83.9 73.3 73.3 60.5 28.3 Table 6.1: Performance of LCA Operation classifier on the datasets AI2, IL and CC. In this section, we evaluate the proposed method on publicly available datasets of arithmetic word problems. We evaluate separately the relevance and LCA operation classifiers, and show the contribution of various features. Lastly, we evaluate the performance of the full system, and quantify the gains achieved by the constraints. 6.5.1 Datasets We evaluate our system on three datasets, each of which comprise a different category of arithmetic word problems. 1. AI2 Dataset: This is a collection of 395 addition and subtraction problems, released by [Hosseini et al., 2014]. They performed a 3-fold cross validation, with every fold containing problems 1 http://cogcomp.cs.illinois.edu/page/software view/Illinois-SL 63 from different sources. This helped them evaluate robustness to domain diversity. We follow the same evaluation setting. 2. IL Dataset: This is a collection of arithmetic problems released by [Roy et al., 2015]. Each of these problems can be solved by performing one operation. However, there are multiple problems having the same template. To counter this, we perform a few modifications to the dataset. First, for each problem, we replace the numbers and nouns with the part of speech tags, and then we cluster the problems based on unigrams and bigrams from this modified problem text. In particular, we cluster problems together whose unigram-bigram similarity is over 90%. We next prune each cluster to keep at most 5 problems in each cluster. Finally we create the folds ensuring all problems in a cluster are assigned to the same fold, and each fold has similar distribution of all operations. We have a final set of 562 problems, and we use a 5-fold cross validation to evaluate on this dataset. 3. Commoncore Dataset: In order to test our system’s ability to handle multi-step problems, we create a new dataset of multi-step arithmetic problems. The problems were extracted from www.commoncoresheets.com. In total, there were 600 problems, 100 for each of the following types: (a) Addition followed by Subtraction (b) Subtraction followed by Addition (c) Addition and Multiplication (d) Addition and Division (e) Subtraction and Multiplication (f) Subtraction and Division This dataset had no irrelevant quantities. Therefore, we did not use the relevance classifier in our evaluations. In order to test our system’s ability to generalize across problem types, we perform a 6-fold cross validation, with each fold containing all the problems from one of the aforementioned categories. This is a more challenging setting relative to the individual data sets mentioned above, since we are evaluating on multi-step problems, without ever looking at problems which require the same set of operations. 64 6.5.2 Relevance Classifier Table 6.2 evaluates the performance of the relevance classifier on the AI2 and IL datasets. We report two accuracy values: Relax - fraction of quantities which the classifier got correct, and Strict - fraction of math problems, for which all quantities were correctly classified. We report accuracy using all features and then removing each feature group, one at a time. AI2 IL Relax Strict Relax Strict All features 94.7 89.1 95.4 93.2 No Unit features 88.9 71.5 92.8 91.0 No NP features 94.9 89.6 95.0 91.2 No Misc. features 92.0 85.9 93.7 89.8 Table 6.2: Performance of Relevance classifier on the datasets AI2 and IL. We see that features related to units of quantities play the most significant role in determining relevance of quantities. Also, the related NP features are not helpful for the AI2 dataset. 6.5.3 LCA Operation Classifier Table 6.1 evaluates the performance of the LCA Operation classifier on the AI2, IL and CC datasets. As before, we report two accuracies - Relax - fraction of quantity pairs for which the classifier correctly predicted the LCA operation, and Strict - fraction of math problems, for which all quantity pairs were correctly classified. We report accuracy using all features and then removing each feature group, one at a time. The strict and relaxed accuracies for IL dataset are identical, since each problem in IL dataset only requires one operation. The features related to individual quantities are most significant; in particular, the accuracy goes to 0.0 in the CC dataset, without using individual quantity features. The question features are not helpful for classification in the CC dataset. This can be attributed to the fact that all problems in CC dataset require multiple operations, and questions in multi-step problems usually do not contain information for each of the required operations. 65 6.5.4 Global Inference Module Table 6.3 shows the performance of our system in correctly solving arithmetic word problems. We show the impact of various constraints, and also compare against previously best known results on the AI2 and IL datasets. We also show results using each of the two constraints separately, and using no constraints at all. AI2 IL CC All constraints 72.0 73.9 45.2 Positive constraint 78.0 72.5 36.5 Integral constraint 71.8 73.4 39.0 No constraint 77.7 71.9 29.6 [Hosseini et al., 2014] 77.7 - - [Roy et al., 2015] - 52.7 - [Kushman et al., 2014] 64.0 73.7 2.3 Table 6.3: Accuracy in correctly solving arithmetic problems. First four rows represent various configurations of our system. We achieve state of the art results in both AI2 and IL datasets. The previously known best result in the AI2 dataset is reported in [Hosseini et al., 2014]. Since we fol- low the exact same evaluation settings, our results are directly comparable. We achieve state of the art results, without having access to any additional annotated data, unlike [Hosseini et al., 2014], who use labeled data for verb categorization. For the IL dataset, we acquired the system of [Roy et al., 2015] from the authors, and ran it with the same fold information. We outperform their system by an ab- solute gain of over 20%. We believe that the improvement was mainly due to the dependence of the system of [Roy et al., 2015] on lexical and neighborhood of quantity features. In contrast, features from quantity schemas help us generalize across problem types. Finally, we also compare against the template based system of [Kushman et al., 2014]. [Hosseini et al., 2014] mentions the result of running the system of [Kushman et al., 2014] on AI2 dataset, and we report their result here. For IL and CC datasets, we used the system released by [Kushman et al., 2014]. The integrality constraint is particularly helpful when division is involved, since it can lead to fractional answers. It does not help in case of the AI2 dataset, which involves only addition and subtraction problems. The role of the constraints becomes more significant in case of multi-step problems and, in particular, they contribute an absolute improvement of over 15% over the system without constraints on the CC dataset. 66 The template based system of [Kushman et al., 2014] performs on par with our system on the IL dataset. We believe that it is due to the small number of equation templates in the IL dataset. It performs poorly on the CC dataset, since we evaluate on unseen problem types, which do not ensure that equation templates in the test data will be seen in the training data. 6.5.5 Discussion The leading source of errors for the classifiers are erroneous quantity schema extraction and lack of under- standing of unknown or rare verbs. For the relevance classifier on the AI2 dataset, 25% of the errors were due to mistakes in extracting the quantity schemas and 20% could be attributed to rare verbs. For the LCA operation classifier on the same dataset, 16% of the errors were due to unknown verbs and 15% were due to mistakes in extracting the schemas. The erroneous extraction of accurate quantity schemas is very significant for the IL dataset, contributing 57% of the errors for the relevance classifier and 39% of the errors for the LCA operation classifier. For the operation classifier on the CC dataset, 8% of the errors were due to verbs and 16% were due to faulty quantity schema extraction. Quantity Schema extraction is challenging due to parsing issues as well as some non-standard rate patterns, and it will be one of the future work targets. For example, in the sentence, “How many 4-dollar toys can he buy?”, we fail to extract the rate component of the quantity 4. 6.6 Conclusion In this chapter, we present a novel method for understanding and solving a general class of arithmetic word problems. Our approach can solve all problems whose solution can be expressed by a read-once arithmetic expression, where each quantity from the problem text appears at most once in the expression. We de- velop a novel theoretical framework, centered around the notion of monotone expression trees, and showed how this representation can be used to get a unique decomposition of the problem. This theory natu- rally leads to a computational solution that we have shown to uniquely determine the solution - determine the arithmetic operation between any two quantities identified in the text. This theory underlies our al- gorithmic solution - we develop classifiers and a constrained inference approach that exploits redundancy in the information, and show that this yields strong performance on several benchmark collections. In particular, our approach achieves state of the art performance on two publicly available arithmetic prob- lem datasets and can support natural generalizations for quantitative reasoning over multiple quantities 67 in text. Specifically, our approach performs competitively on multistep problems, even when it has never observed the particular problem type before. The datasets used in this work are available for download at http://cogcomp.cs.illinois.edu/page/resource view/98. 68 Chapter 7 Unit Dependency Graphs 7.1 Introduction Most statistical word problem solvers till now (including the one described in the last chapter) were entirely data-driven. They depend on training examples to learn all the concepts needed for math word problem solving. However, often domain knowledge is needed for quantitative reasoning, which is not easy to learn from training data. Dimensional analysis is an example of this kind of domain knowledge. Applying the knowledge of dimensional analysis to the units of quantities can inform several quantitative inferences as exemplified below. Let us look at the arithmetic word problem in Example 7.1. The units of “66” and “10” are both “flowers”, which indicate they can be added or subtracted. Although unit of “8” is also “flower”, it is associated with a rate, indicating the number of flowers in each bouquet. As a result, “8” effectively has unit “flowers per bouquet”. Detecting such rate units and applying dimensional analysis help understand that “8” will more likely be multiplied or divided to arrive at the solution. Finally, the question asks for the number of “bouquets”, indicating “8” will likely be divided, and not multiplied. Knowing such interactions could help understand the situation and perform better quantitative reasoning. In addition, given that unit extraction is a noisy process, this can make it more robust via global reasoning. This is exactly the focus of this chapter – to integrate domain knowledge about dimensional analysis into statistical methods for word problem solving. Example 7.1 Isabel picked 66 flowers for her friends wedding. She was making bouquets with 8 flowers in each one. If 10 of the flowers wilted before the wedding, how many bouquets could she still make? We introduce the concept of unit dependency graph (UDG) for math word problems, to represent the re- lationships among the units of different numbers, and the question being asked. We also introduce a strategy 69 to extract annotations for unit dependency graphs, with minimal additional annotations. In particular, we use the answers to math problems, along with the rate annotations for a few selected problems, to generate complete annotations for unit dependency graphs. Finally, we develop a decomposed model to predict UDG given an input math word problem. We augment the arithmetic word problem solver of [Roy and Roth, 2015] to predict a unit dependency graph, along with the solution expression of the input arithmetic word problem. Forcing the solver to respect the dependencies of the unit dependency graph enables us to improve unit extractions, as well as leverage the domain knowledge about unit dependencies in math reasoning. The introduction of unit dependency graphs reduced the error of the solver by over 10%, while also making it more robust to reduction in lexical and template overlap of the dataset. 7.2 Unit Dependency Graph We first introduce the idea of a generalized rate, and its unit representation. We define rate to be any quantity which is some measure corresponding to one unit of some other quantity. This includes explicit rates like “40 miles per hour”, as well as implicit rates like the one in “Each student has 3 books”. Consequently, units for rate quantities take the form “A per B”, where A and B refer to different entities. We refer to A as Num Unit (short for Numerator Unit), and B as Den Unit (short for denominator unit). Table 7.1 shows examples of Num and Den Units for various rate mentions. Mention Num Unit Den Unit 40 miles per hour mile hour Each student has 3 books. book student Table 7.1: Units of rate quantities A unit dependency graph (UDG) of a math word problem is a graph representing the relations among quantity units and the question asked. Fig. 7.1 shows an example of a math word problem and its unit dependency graph. For each quantity mentioned in the problem, there exists a vertex in the unit dependency graph. In addition, there is also a vertex representing the question asked. Therefore, if a math problem mentions n quantities, its unit dependency graph will have n + 1 vertices. In the example in Fig 7.1, there is one vertex corresponding to each of the quantities 66, 8 and 10, and one vertex representing the question part “how many bouquets could she still make ?”. 70 Problem Isabel picked 66 flowers for her friends wedding. She was making bouquets with 8 flowers in each one. If 10 of the flowers wilted before the wedding, how many bouquets could she still make? Unit Dependency Graph 66 8 10how many bouquets Rate Num Unit Den Unit Num Unit Same Unit Expression Tree of Solution 66 − 10 ÷ 8 Figure 7.1: An arithmetic word problem, its UDG, and a tree representation of the solution (66 − 10)/8. Several dependencies exist between the UDG and the final solution of a problem. Here, “66” and “10” are connected via Same Unit edge, hence they can be added or subtracted, “8” is connected by Den Unit to the question, indicating that some expression will be divided by “8” to get the answer’s unit. A vertex representing a number, is labeled Rate, if the corresponding quantity describes a rate relation- ship (according to the aforementioned definition). In fig 7.1, “8” is labeled as a Rate since it indicates the number of flowers in each bouquet. Similarly, a vertex corresponding to the question is marked Rate if the question asks for a rate. Edges of a UDG can be directed as well as undirected. Each undirected edge has the label Same Unit, indicating that the connected vertices have the same unit. Each directed edge going from vertex u to vertex v can have one of the following labels: 1. Num Unit : Valid only for directed edges with source vertex u labeled as Rate, indicates that Num Unit of u matches the unit of the destination vertex v. 2. Den Unit : Valid only for directed edges with source vertex labeled as Rate, indicates that Den Unit of source vertex u matches the unit of the destination vertex v. If no edge exists between a pair of vertices, they have unrelated units. Several dependencies exist between the vertex and edge labels of the unit dependency graph of a problem, 71 and its solution expression. Sec 7.4.4 discusses these dependencies and how they can be leveraged to improve math problem solving. 7.3 Learning to Predict UDGs Predicting UDG for a math word problem is essentially a structured prediction problem. However, since we have limited training data, we develop a decomposed model to predict parts of the structure inde- pendently, and then perform joint inference to enforce coherent predictions. This has been shown to be an effective method for structured prediction in the presence of limited data [Punyakanok et al., 2005b, Sutton and McCallum, 2007]. Empirically, we found our decomposed model to be superior to jointly trained alternatives (see Section 7.5). Our decomposed model for UDG prediction uses the following two classifiers. 1. Vertex Classifier : This is a binary classifier, which takes a vertex of the UDG as input, and decides whether it denotes a rate. 2. Edge Classifier : This is a multiclass classifier, which takes as input a pair of nodes of the UDG, and predicts the properties of the edge connecting those nodes. Finally, a constrained inference module combines the output of the two classifiers to construct a UDG. We provide details of the components in the following subsections. 7.3.1 Vertex Classifier In order to detect rate quantities, we train a binary classifier. Given problem text P and a vertex v of the UDG, the classifier predicts whether v represents a rate. It predicts one of two labels - Rate or Not Rate. The vertex v is either a quantity mentioned in P , or the question of P . The features used for the classification are as follows : 1. Context Features: We add unigrams, bigrams, part of speech tags, and their conjunctions from the neighborhood of v. 2. Rule based Extraction Features: We add a feature indicating whether a rule based approach can detect v as a rate. 72 7.3.2 Edge Classifier We train a multiclass classifier to determine the properties of the edges of the UDG. Given problem text P and a pair of vertices vi and vj (i < j), the classifier predicts one of the six labels : 1. Same Unit : Indicates that vi and vj should be connected by an undirected edge labeled Same Unit. 2. No Relation : Indicates no edge exists between vi and vj. 3. Rate→Num : Indicates that vi is a rate, and the Num Unit of vi matches the unit of vj. 4. Rate←Num : Indicates that vj is a rate, and the Num Unit of vj matches the unit of vi. 5. We similarly define Rate→Den and Rate ← Den. The features used for the classification are : 1. Context Features: For each vertex v in the query, we add the context features described for Vertex classifier. 2. Rule based Extraction Features: We add a feature indicating whether each of the queried vertices is detected as a rate by the rule based system. In addition, we also add features denoting whether there are common tokens in the units of vi and vj. 7.3.3 Constrained Inference Our constrained inference module takes the scores of the Vertex and Edge classifiers, and combines them to find the most probable unit dependency graph for a problem. We define Vertex(v,l) to be the score predicted by the Vertex classifier for labeling vertex v of a UDG with label l, where l ∈{Rate,Not Rate}. Similarly, we define Edge(vi,vj, l) to be the score predicted by the Edge classifier for the assignment of label l to the edge between vi and vj. Here the label l is one of the six labels defined for the edge classifier. Let G be a UDG with vertex set V . We define the score for G as follows: Score(G) = ∑ v∈V Label(G,v)=Rate Vertex(v,Rate) + λ× ∑ vi,vj∈V,i