Representation Learning for Grounded Spatial Reasoning Michael Janner, Karthik Narasimhan, and Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {janner, karthikn, regina}@csail.mit.edu Abstract The interpretation of spatial references is highly contextual, requiring joint inference over both language and the environment. We consider the task of spatial reasoning in a sim- ulated environment, where an agent can act and receive rewards. The proposed model learns a representation of the world steered by instruction text. This design allows for pre- cise alignment of local neighborhoods with corresponding verbalizations, while also han- dling global references in the instructions. We train our model with reinforcement learning using a variant of generalized value itera- tion. The model outperforms state-of-the-art approaches on several metrics, yielding a 45% reduction in goal localization error.1 1 Introduction Understanding spatial references in natural language is essential for successful human-robot communica- tion and autonomous navigation. This problem is challenging because interpretation of spatial refer- ences is highly context-dependent. For instance, the instruction “Reach the cell above the westernmost rock” translates into different goal locations in the two environments shown in Figure 1. Therefore, to enable generalization to new, unseen worlds, the model must jointly reason over the instruction text and environment configuration. Moreover, the rich- ness and flexibility in verbalizing spatial references further complicates interpretation of such instruc- tions. 1Code and dataset are available at https://github. com/JannerM/spatial-reasoning Reach the cell above the westernmost rock Figure 1: Sample 2D worlds and an instruction de- scribing a goal location. The optimal path from a common start position, denoted by a white dashed line, varies considerably with changes in the map layout. In this paper, we explore the problem of spa- tial reasoning in the context of interactive worlds. Specifically, we assume access to a simulated envi- ronment, in which an agent can take actions to inter- act with the world and is rewarded for reaching the location specified by the language instruction. This feedback is the only source of supervision the model uses for interpreting spatial references. The key modeling task here is to induce a repre- sentation that closely ties environment observations and linguistic expressions. In prior work, this issue was addressed by learning representations for each modality and then combining them, for instance, with concatenation (Misra et al., 2017). While this approach captures high-level correspondences be- tween instructions and maps, it does not encode de- 49 Transactions of the Association for Computational Linguistics, vol. 6, pp. 49–61, 2018. Action Editor: Hal Daumé III. Submission batch: 7/2017; Revision batch: 10/2017; Published 1/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. tailed, lower-level mappings between specific posi- tions on the map and their descriptions. As our ex- periments demonstrate, combining the language and environment representations in a spatially localized manner yields significant performance gains on the task. To this end, our model uses the instruction text to drive the learning of the environment representation. We start by converting the instruction text into a real- valued vector using a recurrent neural network with LSTM cells (Hochreiter and Schmidhuber, 1997). Using this vector as a kernel in a convolution opera- tion, we obtain an instruction-conditioned represen- tation of the state. This allows the model to reason about immediate local neighborhoods in references such as “two cells to the left of the triangle”. We further augment this design to handle global refer- ences that involve information concerning the entire map (e.g. “the westernmost rock”). This is achieved by predicting a global value map using an additional component of the instruction representation. The en- tire model is trained with reinforcement learning us- ing the environmental reward signal as feedback. We conducted our experiments using a 2D virtual world as shown in Figure 1. Overall, we created over 3,300 tasks across 200 maps, with instructions sourced from Mechanical Turk. We compare our model against two state-of-the-art systems adapted for our task (Misra et al., 2017; Schaul et al., 2015). The key findings of our experiments are threefold. First, our model can more precisely interpret instruc- tions than baseline models and find the goal location, yielding a 45% reduction in Manhattan distance er- ror over the closest competitor. Second, the model can robustly generalize across new, unseen map lay- outs. Finally, we demonstrate that factorizing the instruction representation enables the model to sus- tain high performance when handling both local and global references. 2 Related Work Spatial reasoning in text This topic has attracted both theoretical and practical interest. From the lin- guistic and cognitive perspectives, research has fo- cused on the wide range of mechanisms that speak- ers use to express spatial relations (Tenbrink, 2007; Viethen and Dale, 2008; Byrne and Johnson-Laird, 1989; Li and Gleitman, 2002). The practical impli- cations of this research are related to autonomous navigation (Moratz and Tenbrink, 2006; Levit and Roy, 2007; Tellex et al., 2011) and human-robot in- teraction (Skubic et al., 2004). Previous computational approaches include tech- niques such as proximity fields (Kelleher et al., 2006), spatial templates (Levit and Roy, 2007) and geometrically defined mappings (Moratz and Ten- brink, 2006; Kollar et al., 2010). More recent work in robotics has integrated text containing position in- formation with spatial models of the environment to obtain accurate maps for navigation (Walter et al., 2013; Hemachandra et al., 2014). Most of these ap- proaches typically assume access to detailed geome- try or other forms of domain knowledge. In contrast to these knowledge-rich approaches, we are learn- ing spatial reference via interaction with the envi- ronment, acquiring knowledge of the environment in the process. Instruction following Spatial reasoning is a com- mon element in many papers on instruction follow- ing (MacMahon et al., 2006; Vogel and Jurafsky, 2010; Chen and Mooney, 2011; Artzi and Zettle- moyer, 2013; Kim and Mooney, 2013; Andreas and Klein, 2015). As a source of supervision, these methods assume access to demonstrations, which specify the path corresponding with provided in- structions. In our setup, the agent is only driven by the final rewards when the goal is achieved. This weaker source of supervision motivates devel- opment of new techniques not considered in prior work. More recently, Misra et al. (2017) proposed a neural architecture for jointly mapping instructions and visual observations (pixels) to actions in the environment. Their model separately induces text and environment representations, which are concate- nated into a single vector that is used to output an action policy. While this representation captures coarse correspondences between the modalities, it doesn’t encode mappings at the level of local neigh- borhoods, negatively impacting performance on our task. Universal value functions The idea of general- ized value functions has been explored before in Schaul et al. (2015). The technique, termed UVFA, 50 presents a clever trick of factorizing the value func- tion over states and goals using singular value de- composition (SVD) and then learning a regression model to predict the low-rank vectors. This results in quick and effective generalization to all goals in the same state space. However, their work stops short of exploring generalization over map layouts, which our model is designed to handle. Furthermore, our setup also involves specifying goals using natural language instructions, which is different from the coordinate-style specification used in that work. 3 General Framework Task setup We model our task as a Markov Deci- sion Process (MDP), where an autonomous agent is placed in an interactive environment with the capa- bility to choose actions that can affect the world. A goal is described in text, and rewards are available to the agent correspondingly. The MDP can be rep- resented by the tuple 〈S,A,X,T,R〉, where S is the set of all possible state configurations, A is the set of actions available to the agent, X is the set of all goal specifications2 in natural language, T(s′|s,a,x) is the transition distribution, and R(s,x) is the reward function. A state s ∈ S includes information such as the locations of different entities along with the agent’s own position. In this work, T is determin- istic in the environments considered; however, our methods also apply in the stochastic case. Text instructions Prior work has investigated hu- man usage of different types of referring expres- sions to describe spatial relations (Levinson, 2003; Viethen and Dale, 2008). In order to build a ro- bust instruction following system, we examine sev- eral categories of spatial expressions that exhibit the wide range of natural language goal descriptions. Specifically, we consider instructions that utilize ob- jects/entities present in the environment to describe a goal location. These instructions can be catego- rized into three groups: (a) Text referring to a specific entity (e.g., “Go to the circle”). (b) Text specifying a location using a single refer- 2We will use the terms goal specifications and instructions interchangeably. ent entity (e.g., “Reach the cell above the west- ernmost rock”). (c) Text specifying a location using multiple refer- ent entities (e.g., “Move to the goal two squares to the left of the heart and top right of house”). These three categories exemplify an increasing level of complexity, with the last one having multiple lev- els of indirection. In each category, we have both local and global references to objects. Local references require an understanding of spatial prepositional phrases such as ‘above’, ‘in between’ and ‘next to’ in order to de- termine the precise goal location. This comprehen- sion is invariant to the global position of the object landmark(s) provided in the instruction. A global reference, on the other hand, contains superlatives such as ‘easternmost’ and ‘topmost’, which require reasoning over the entire map. For example, in the case of (a) above, a local reference would describe a unique object3 (e.g., “Go to the circle”), whereas a global reference might require comparing the po- sitions of all objects of a specific type (e.g., “Go to the northernmost tree”). A point to note is that we do not assume any ac- cess to mapping from instructions to objects or enti- ties in the worlds or a knowledge of spatial ontology – the system has to learn this entirely through feed- back from the environment. Generalized Value Iteration Learning to reach the goal while maximizing cumulative reward can be done by using a value function V (s) (Sutton and Barto, 1998) which represents the agent’s notion of expected future reward from state s. A popular al- gorithm to learn an optimal value function is Value Iteration (VI) (Bellman, 1957), which uses the tech- nique of dynamic programming. In the standard Bellman equation, the value func- tion is dependent solely on state. Schaul et al. (2015) proposed a value function V (s,g) describing the ex- pected reward from being in state s given goal g, capturing that state values are goal-dependent and that a single environment can offer many such goals. We also make use of such a generalized value func- tion, although our goals are not observed directly as 3A local reference for a non-unique object would be am- biguous, of course. 51 Figure 2: A schematic depiction of our model. Text instructions are represented as a vector h(t) and states as embeddings φ(s). A portion of the text representation is used as a convolutional kernel on φ(s), giving a text-conditioned local state representation z1. The remaining components are used as coefficients in a linear combination of gradient functions to give a global map-level representation z2. z1 and z2 are concatenated and input to a convolutional neural network to predict the final value map. coordinate locations or states themselves but rather described in natural language. With x denoting a textual description of a goal, our VI update equa- tions are: Q(s,a,x) = R(s,x) + γ ∑ s′∈S T(s′|s,a,x)V (s′,x) (1)V (s,x) = max a Q(s,a,x) where Q is the action-value function, tracking the value of choosing action a in state s. Once an opti- mal value function is learned, a straightforward ac- tion policy is: (2)π(s,x) = arg max a Q(s,a,x) 4 Model Generalization over both environment configura- tions and text instructions requires a model that meets two desiderata. First, it must have a flexible representation of goals, one which can encode both the local structure and global spatial attributes inher- ent to natural language instructions. Second, it must be compositional; the representation of language should be generalizable even though each unique in- struction will only be observed with a single map during training. Namely, the learned representation for a given instruction should still be useful even if the objects on a map are rearranged or the layout is changed entirely. To that end, our model combines the textual in- structions with the map in a spatially localized man- ner, as opposed to prior work which joins goal repre- sentations and environment observations via simpler functions like an inner product (Schaul et al., 2015). While our approach can more effectively learn local relations specified by language, it cannot naturally capture descriptions at the global environment level. To address this problem, we also use the language representation to predict coefficients for a basis set of gradient functions which can be combined to en- code global spatial relations. More formally, inputs to our model (see Figure 2) consist of an environment observation s and textual description of a goal x. For simplicity, we will as- sume s to be a 2D matrix, although the model can easily be extended to other input representations. We first convert s to a 3D tensor by projecting each cell to a low-dimensional embedding (φ) as a func- tion of the objects contained in that cell. In parallel, the text instruction x is passed through an LSTM recurrent neural network (Hochreiter and Schmid- huber, 1997) to obtain a continuous vector represen- tation h(x). This vector is then split into local and global components h(x) = [h1(x); h2(x)]. The lo- cal component, h2(x), is reshaped into a kernel to perform a convolution operation on the state embed- ding φ(s) (similar to Chen et al. (2015)): (3)z1 = ψ1(φ(s); h2(x)) Meanwhile, the three-element global component 52 Algorithm 1 Training Procedure 1: Initialize experience memory D 2: Initialize model parameters Θ 3: for epoch=1,M do 4: Sample instruction x∈X and associated environ- ment E 5: Predict value map V̂ (s,x;Θ) for all s∈E 6: Choose start state s0 randomly 7: for t=1,N do 8: Select at=argmax a ∑ s T(s|st−1,a)V̂ (s,x;Θ) 9: Observe next state st and reward rt 10: Store trajectory (s=s0,s1,...,r=r0,r1,...) in D 11: for j=1,J do 12: Sample random trajectory (s,r) from D 13: Perform gradient descent step on loss L(θ) h1(x) is used to form the coefficients for a vertical and horizontal gradient along with a corresponding bias term.4 The gradients, denoted G1 and G2 in Figure 2, are matrices of the same dimensionality as the state observation with values increasing down the rows and along the columns, respectively. The axis-aligned gradients are weighted by the elements of h1(x) and summed to give a final global gradi- ent spanning the entire 2D space, analogous to how steerable filters can be constructed for any orienta- tion using a small set of basis filters (Freeman and Adelson, 1991): (4)z2 = h1(x)[1]·G1 +h1(x)[2]·G2 +h1(x)[3]·J in which J is the all-ones matrix also of the same dimensionality as the observed map. Finally, the local and global information maps are concatenated into a single tensor, which is then pro- cessed by a convolutional neural network (CNN) with parameters θ to approximate the generalized value function: (5)V̂ (s,x) = ψ2([z1; z2]; θ) for every state s in the map. Reinforcement Learning Given our model’s V̂ (s,x) predictions, the resulting policy (Equa- tion 2) can be enacted, giving a continuous trajectory of states {st,st+1, . . .} on a single map and their associated rewards {rt,rt+1, . . .} at each timestep 4Note that we are referring to gradient filters here, not the gradient calculated during backpropagation in deep learning. t. We stored entire trajectories (as opposed to state transition pairs) in a replay memory D as described in Mnih et al. (2015). The model is trained to pro- duce an accurate value estimate by minimizing the following objective: (6) L(Θ) = Es∼D [ V̂ (s,x; Θ) − ( R(s,x) + γ max a ∑ s′ T(s′|s,a)V̂ (s′,x; Θ−) )]2 where s is a state sampled from D, γ is the discount factor, Θ is the set of parameters of the entire model, and Θ− is the set of parameters of a target network copied periodically from our model. The complete training procedure is shown in Algorithm 1. 5 Experimental Setup Puddle world navigation data In order to study generalization across a wide variety of environmen- tal conditions and linguistic inputs, we develop an extension of the puddle world reinforcement learn- ing benchmark (Sutton, 1996; Mankowitz et al., 2016). States in a 10×10 grid are first filled with ei- ther grass or water cells, such that the grass forms one connected component. We then populate the grass region with six unique objects which appear only once per map (triangle, star, diamond, circle, heart, and spade) and four non-unique objects (rock, tree, horse, and house) which can appear any num- ber of times on a given map. See Figure 1 for an example visualization. Split Local Global Train 1566 1071 Test 399 272 Table 1: Overall statistics of our dataset. Goal positions are chosen uniformly at random from the set of grass cells, encouraging the use of spatial references to describe goal locations which do not themselves contain a unique object. We used the Mechanical Turk crowdsourcing plat- form (Buhrmester et al., 2011) to collect natural lan- guage descriptions of these goals. Human annota- tors were asked to describe the positions of these 53 • Reach the horse below the rock and to the left of the green diamond • Move to the square two below and one left of the star • Go to the cell above the bottommost horse Figure 3: Example goal annotations collected with Mechanical Turk. goals using surrounding objects. At the end of each trial, we asked the same participants to provide goal locations given their own text instructions. This helped filter out a majority of instructions that were ambiguous or ill-specified. Table 1 provides some statistics on the data, and Figure 3 shows example annotations. In total, we collected 3308 instructions, ranging from 2 to 43 words in length, describing over 200 maps. There are 361 unique words in the annotated instructions. We do not perform any pre- processing on the raw annotations. It is plausible that a model designed to handle only local references could not handle global ones (consider our own model without the global gradi- ent maps). For clearer interpretation of results, we evaluate our model in two modes: trained and tested on local and global data separately, or as a combined dataset. While local instructions were obtained eas- ily, the global instructions were collected by design- ing a task in which only nonunique objects were pre- sented to the annotators.5 This precluded simple in- structions like “go left of the 〈object〉” because there would always be more than one of each object type. Therefore, we obtained text with global properties (e.g. middle rock, leftmost tree) to sufficiently pin- point an object. On average, we collected 31 unique local instructions and 10 unique global instructions per map. To quantify the diversity of our dataset, we find the five nearest instructions in the training set for every instruction in the test set, as measured by edit distance (using the word as a unit) normalized by test instruction length. For each of these pairs, we also measure the Manhattan distance between their corresponding goal locations. Figure 4, which visu- 5The other objects were added back into the map after col- lecting the instruction. Figure 4: A heatmap showing the normalized in- struction edit distance and goal Manhattan distance corresponding to the most similar instructions be- tween the train and test set. For each instruction in the test set, we find the five most similar instructions in the training set. Even for those instructions which are similar, the goal locations they describe can be far apart. alizes this analysis, underscores the difficulty of this task; even when two instructions are highly similar, they might correspond to entirely different target lo- cations. This is the case in the example in Figure 1, which has a distance of four between the references goals. Baselines We compare our model to the following baselines: UVFA (text) is a variant of the model described in (Schaul et al., 2015) adapted for our task. The original model made use of two MLPs to learn low dimensional embeddings of states and goals which were then combined via dot product to give value estimates. Goals were represented either as (x,y) coordinates or as states themselves. As our goals are not observed directly but described in text, we replace the goal MLP with the same LSTM as in our model. The state MLP has an identical architecture to that of the UVFA: two hidden layers of dimension 128 and ReLU activations. For consistency with the UVFA, we represent states as binary vectors denot- ing the presence of each type of object at every po- sition. CNN + LSTM is a variant of the model described in Misra et al. (2017), who developed it for a language-grounded block manipulation task. It first convolves the map layout to a low-dimensional rep- 54 Figure 5: Reward achieved by our model and the two baselines on the training environments during re- inforcement learning on both local and global instructions. Each epoch corresponds to simulation on 500 goals, with a goal simulation terminating either when the agent reaches the goal state or has taken 75 actions. resentation (as opposed to the MLP of the UVFA) and concatenates this to the LSTM’s instruction em- bedding (as opposed to a dot product). These con- catenated representations are then input to a two- layer MLP. We also perform analysis to study the represen- tational power of our model, introducing two more comparison models: UVFA (pos) is the original UVFA model from (Schaul et al., 2015), which we evaluate on our mod- ified puddle worlds to determine the difficulty of environment generalization independently from in- struction interpretation. Our model (w/o gradient) is an ablation of our model without the global gradient maps, which allows us to determine the gradients’ role in representation-building. In additional to our reinforcement learning ex- periments, we train these models in a supervised setting to isolate the effects of architecture choices from other concerns inherent to reinforcement learn- ing algorithms. For this purpose, we constructed a dataset of ground-truth value maps for all human- annotated goals using value iteration. We use the models to predict value maps for the entire grid and minimize the mean squared error (MSE) compared to the ground truth values: (7)L′(Θ) = ∑ (s,x) [V̂ (s,x; Θ) − V̄ (s,x)]2 5.1 Implementation details Our model implementation uses an LSTM with a learnable 15-dimensional embedding layer, 30 hid- den units, 8-dimensional embeddings φ(s), and a 3x3 kernel applied to the embeddings, giving a di- mension of 72 for h2(t). The final CNN has layers of {3, 6, 12, 6, 3, 1} channels, all with 3x3 kernels and padding of length 1 such that the output value map prediction is equal in size to the input observa- tion. For each map, a reward of 3 is given for reach- ing the correct goal specified by human annotation and a reward of −1 is given for falling in a puddle cell. The only terminal state is when the agent is at the goal. Rewards are discounted by a factor of 0.95. We use Adam optimization (Kingma and Ba, 2015) for training all models. 6 Results We present empirical results on two different datasets - our annotated puddle world and an exist- ing block navigation task (Bisk et al., 2016). 6.1 Puddle world navigation Comparison with the state-of-the-art We first investigate the ability of our model to learn solely from environment simulation. Figure 5 shows the discounted reward achieved by our model as well as the two baselines for both instruction types. In both experiments, our model is the only one of the 55 Local Global Combined Policy Quality Distance Policy Quality Distance Policy Quality Distance UVFA (text) 0.56 4.71 0.62 6.28 0.61 5.06 CNN + LSTM 0.80 5.73 0.82 6.13 0.82 5.67 Our model 0.87 2.18 0.90 3.35 0.89 3.04 Table 2: Performance of models trained via reinforcement learning on a held-out set of environments and instructions. Policy quality is the true expected normalized reward and distance denotes the Manhattan distance from goal location prediction to true goal position. We show results from training on the local and global instructions both separately and jointly. Local Global MSE Policy Quality Distance MSE Policy Quality Distance UVFA (pos) 1.30 0.48 4.96 1.04 0.52 5.39 UVFA (text) 3.23 0.57 4.97 1.9 0.62 5.31 CNN + LSTM 0.42 0.86 4.08 0.43 0.83 4.18 Our model (w/o gradient) 0.25 0.94 2.39 0.61 0.87 5.15 Our model 0.25 0.94 2.34 0.41 0.89 3.81 Table 3: Performance on a test set of environments and instructions after supervised training. Lower is better for MSE and Manhattan distance; higher is better for policy quality. The gradient basis significantly improves the reconstruction error and goal localization of our model on global instructions, and expectedly does not affect its performance on local instructions. three to achieve an average nonnegative reward after convergence (0.88 for local instructions and 0.49 for global instructions), signifying that the baselines do not fully learn how to navigate through these envi- ronments. Following Schaul et al. (2015), we also evaluated our model using the metric of policy quality. This is defined as the expected discounted reward achieved by following a softmax policy of the value predic- tions. Policy quality is normalized such that an op- timal policy has a score of 1 and a uniform random policy has a score of 0. Intuitively, policy quality is the true normalized expectation of score over all maps in the dataset, instructions per map, and start states per map-instruction pair. Our model outper- forms both baselines on this metric as well on the test maps (Table 2). We also note that the perfor- mance of the baselines flip with respect to each other as compared to their performance on the training maps (Figure 5). While the UVFA variant learned a better policy on the train set, it did not generalize to new environments as well as the CNN + LSTM. Finally, given the nature of our environments, we can use the predicted value maps to infer a goal lo- cation by taking the position of the maximum value. We use the Manhattan distance from this predicted position to the actual goal location as a third met- ric. The accuracy of our model’s goal predictions is more than twice that of the baselines on local refer- ences and roughly 45% better on global references. Analysis of learned representations For the rep- resentation analysis in a supervised setting, we com- pared the predicted value maps of all models against 56 Figure 6: Value map predictions for two environments paired with two instructions each. Despite the dif- ference in instructions, with one being global and the other local in nature and sharing no objects in their descriptions, they refer to the same goal location in the environment in (a). However, in (b), the descriptions correspond to different locations on the map. The vertical axis considers variance in goal location for the same instruction, depending on the map configuration. the unseen test split of maps. Table 3 shows the re- sults of this study. As expected, our model with- out the global gradient performs no differently from the full model on local references, but has higher MSE and average distances to true goal than the full model on global references. We also note that UVFA (pos) performs much worse than both CNN+LSTM and our model, showing the difficulty of environ- ment generalization even when the goals are ob- served directly. (The original UVFA paper (Schaul et al., 2015) demonstrated effective generalization over goal states within a single environment.) Surprisingly, our model trained via reinforcement learning has more precise goal location predictions (as measured via Manhattan distance) than when trained on true state values in a supervised manner. However, the MSE of the value predictions are much higher in the RL setting (e.g., 0.80 vs 0.25 for super- vised on local instructions). This shows that despite the comparative stability of the supervised setting, minimization of value prediction error does not nec- essarily lend itself to the best policy or goal local- ization. Conversely, having a higher MSE does not always imply a worse policy, as seen also in the per- formance of the two UVFA variants in Table 3. Generalization One of the criteria laid out for our model was its ability to construct language repre- sentations and produce accurate value maps, inde- pendent of layouts and linguistic variation. Figure 6 provides examples of two layouts, each with two dif- ferent instructions. In the first map (top), we have both instructions referring to the same location. Our model is able to mimic the optimal value map accu- rately, while the other baselines are not as precise, either producing a large field of possible goal loca- tions (CNN+LSTM) or completely missing the goal (UVFA-text). On the vertical axis, we observe generalization across different maps with the same instructions. Our model is able to precisely identify the goals in each scenario in spite of significant variation in their 57 Figure 7: Examples of failure cases for our model. Multiple levels of indirection in (a) and a long in- struction filled with redundant information in (b) make the instruction difficult to interpret. Intended goal locations are outlined in red for clarity. locations. This proves harder for the other represen- tations. Although our model is compositional in the sense that it transfers knowledge of spatial references be- tween different environments, some types of instruc- tions do prove challenging. We identify two of the poorest predictions in Figure 7. We see that multiple levels of indirection (as in 7a, which references a lo- cation relative to an object relative to another object) or unnecessarily long instructions (as in 7b, which uniquely identifies a position by the eighth token but then proceeds with redundant information) are still a challenge. Learning curve Due to the manual effort that comes with constructing a dataset of human anno- tations, it is also important to consider the sample- efficiency of a model. Figure 8 shows the quality policy and prediction error on local instructions as a function of training set size. Our model reaches 0.90 policy quality with only 400 samples, demonstrating efficient generalization capability. Figure 8: Effect of training set size on held-out pre- dictions. The curves show the mean of ten training runs and the shaded regions show standard devia- tion. Our model’s policy quality is greater than 0.90 with as few as 400 training goal annotations. Language Grounding Dataset Policy Quality Distance UVFA (text) 0.15 4.61 CNN + LSTM 0.17 4.11 Our model 0.74 3.94 Table 4: The performance of our model and two baselines on the ISI Language Grounding dataset (Bisk et al., 2016). Our model once again out- performs the baselines, although all models have a lower policy quality on this dataset than on our own. 6.2 ISI Grounding Dataset We also evaluate our model on the ISI Language Grounding dataset (Bisk et al., 2016), which con- tains human-annotated instructions describing how to arrange blocks identified by numbers and logos. Although it does not contain variable environment maps as in our dataset, it has a larger action space and vocabulary. The caveat is that the task as posed in the original dataset is not compatible with our model. For a policy to be derived from a value map with the same dimension as the state observation, it is implicitly assumed that there is a single con- trollable agent, whereas the ISI set allows multiple blocks to be moved. We therefore modify the ISI setup using an oracle to determine which block is given agency during each step. This allows us to 58 Figure 9: (a-c) Visualizations of tasks from the ISI Language Grounding dataset (Bisk et al., 2016) and our model’s value map predictions. The agentive block and goal location are outlined in red for visibility. (d) The MSE of the value map prediction as a function of a subgoal’s ordering in an overall task. The model performs better on subgoals later in a task despite the subgoals being treated completely independently during both training and testing. retain the linguistic variability of the dataset while overcoming the mismatch in task setup. The states are discretized to a 13×13 map and the instructions are lemmatized. Performance on the modified ISI dataset is re- ported in Table 4 and representative visualizations are shown in Figure 9. Our model outperforms both baselines by a greater margin in policy quality than on our own dataset. Misra et al. (2017) also use this dataset and report results in part by determining the minimum distance between an agent and a goal during an evaluation lasting N steps. This evaluation metric is therefore dependent on this timeout parameter N. Because we discretized the state space so as to be able to repre- sent it as a grid of embeddings, the notion of a single step has been changed and direct comparison limited to N steps is ill-defined.6 Hence, due to modifica- 6When a model is available and the states are not over- whelmingly high-dimensional, policy quality is a useful metric that is independent of this type of parameter. As such, it is our tions in the task setup, we cannot compare directly to the results in Misra et al. (2017). Understanding grounding evaluation An inter- esting finding in our analysis was that the difficulty of the language interpretation task is a function of the stage in task execution (Figure 9(d)). In the ISI Language Grounding set (Bisk et al., 2016), each individual instruction (describing where to move a particular block) is a subgoal in a larger task (such as constructing a circle with all of the blocks). The value maps predicted for subgoals occurring later in a task are more accurate than those occurring early in the task. It is likely that the language plays a less crucial role in specifying the subgoal position in the final steps of a task. As shown in Figure 9(a), it may be possible to narrow down candidate subgoal po- sitions just by looking at a nearly-constructed high- default metric here. However, estimating policy quality for en- vironments substantially larger than those investigated here is a challenge in itself. 59 level shape. In contrast, this would not be possible early in a task because most of the blocks will be randomly positioned. This finding is consistent with a result from Branavan et al. (2011), who reported that strategy game manuals were useful early in the game but became less essential further into play. It appears to be part of a larger trend that the marginal benefit of language in such grounding tasks can vary predictably between individual instructions. 7 Conclusions We have described a novel approach for grounded spatial reasoning. Combining the language repre- sentation in a spatially localized manner allows for increased precision of goal identification a nd im- proved performance on unseen environment config- urations. Alongside our models, we present Pud- dle World Navigation, a new grounding dataset for testing the generalization capacity of instruction- following algorithms in varied environments. Acknowledgement We thank the members of the MIT NLP group, the TACL reviewers and action editor for helpful feedback. We gratefully acknowledge support from the MIT Lincoln Laboratory and the MIT Super- UROP program. References Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction following. In EMNLP. Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping in- structions to actions. TACL, 1(1):49–62. Richard Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edi- tion. Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016. Natural language communication with robots. In NAACL HLT, pages 751–761. S.R.K. Branavan, David Silver, and Regina Barzilay. 2011. Learning to win by reading manuals in a Monte- Carlo framework. In ACL HLT, pages 268–277. Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. 2011. Amazon’s Mechanical Turk. Per- spectives on Psychological Science, 6(1):3–5. PMID: 26162106. Ruth M.J. Byrne and Philip N. Johnson-Laird. 1989. Spatial reasoning. Journal of memory and language, 28(5):564–575. David L. Chen and Raymond J. Mooney. 2011. Learn- ing to interpret natural language navigation instruc- tions fro mobservations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI- 2011), pages 859–865, San Francisco, CA, USA, Au- gust. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC- CNN: An attention based convolutional neural net- work for visual question answering. arXiv preprint arXiv:1511.05960. William T. Freeman and Edward H. Adelson. 1991. The design and use of steerable filters. IEEE TPAMI, 13(9):891–906. Sachithra Hemachandra, Matthew R. Walter, Stefanie Tellex, and Seth Teller. 2014. Learning spatial- semantic representations from natural language de- scriptions and scene classifications. In ICRA, pages 2623–2630. IEEE. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780. John D. Kelleher, Geert-Jan M. Kruijff, and Fintan J. Costello. 2006. Proximity in context: an empirically grounded computational model of proximity for pro- cessing topological spatial expressions. In ACL, pages 745–752. Joohyun Kim and Raymond J. Mooney. 2013. Adapting discriminative reranking to grounded language learn- ing. In ACL (1), pages 218–227. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Human-Robot Interaction, pages 259– 266. IEEE. Stephen C Levinson. 2003. Space in language and cog- nition: Explorations in cognitive diversity, volume 5. Cambridge University Press. Michael Levit and Deb Roy. 2007. Interpretation of spa- tial language in a map navigation task. IEEE Transac- tions on Systems, Man, and Cybernetics, Part B (Cy- bernetics), 37(3):667–679. Peggy Li and Lila Gleitman. 2002. Turning the ta- bles: Language and spatial reasoning. Cognition, 83(3):265–294. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. AAAI, 2(6):4. 60 Daniel J. Mankowitz, Timothy Arthur Mann, and Shie Mannor. 2016. Iterative hierarchical optimiza- tion for misspecified problems (IHOMP). CoRR, abs/1602.03348. Dipendra K. Misra, John Langford, and Yoav Artzi. 2017. Mapping instructions and visual observations to actions with reinforcement learning. EMNLP. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidje- land, Georg Ostrovski, Stig Petersen, Charles Beat- tie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 02. Reinhard Moratz and Thora Tenbrink. 2006. Spatial ref- erence in linguistic human-robot interaction: Iterative, empirically supported development of a model of pro- jective relations. Spatial cognition and computation, 6(1):63–107. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. 2015. Universal value function approximators. In ICML, pages 1312–1320. M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock. 2004. Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applica- tions and Reviews), 34(2):154–167, May. Richard S. Sutton and Andrew G. Barto. 1998. Introduc- tion to reinforcement learning. MIT Press. Richard S. Sutton. 1996. Generalization in rein- forcement learning: Successful examples using sparse coarse coding. In NIPS. Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natu- ral language commands for robotic navigation and mo- bile manipulation. In AAAI. Thora Tenbrink. 2007. Space, Time, and the Use of Lan- guage: An Investigation of Relationships, volume 36. Walter de Gruyter. Jette Viethen and Robert Dale. 2008. The use of spatial relations in referring expression generation. In Pro- ceedings of the Fifth International Natural Language Generation Conference, pages 59–67. ACL. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In ACL, pages 806–814. Matthew R. Walter, Sachithra Hemachandra, Bianca Homberg, Stefanie Tellex, and Seth Teller. 2013. Learning semantic maps from natural language de- scriptions. Proceedings of the 2013 Robotics: Science and Systems IX Conference. 61 62