key: cord-0455264-u7soz64p
authors: Yang, Kevin; Zhang, Tianjun; Cummins, Chris; Cui, Brandon; Steiner, Benoit; Wang, Linnan; Gonzalez, Joseph E.; Klein, Dan; Tian, Yuandong
title: Learning Space Partitions for Path Planning
date: 2021-06-19
journal: nan
DOI: nan
sha: de3d54a1b772f6b6db3333d0eb84672dcfa00a3c
doc_id: 455264
cord_uid: u7soz64p

Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method LaP3 which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, LaP3 outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into the planning components of model-based RL such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 39% on average across 9 tasks, and in molecular design by up to 0.4 on properties on a 0-1 scale. Code is available at https://github.com/yangkevin2/neurips2021-lap3.

(UCB). While such exploration-exploitation procedures adaptively focus on promising sub-regions and lead to sub-linear regret and optimality guarantees, their region partition procedure is manually designed by humans and remains non-adaptive. For example, DOO partitions the space with uniform axis-aligned grids and VOOT with Voronoi cells, both independent of the reward f to be optimized.

Recently, Wang et al. proposed LaNAS [46] and LaMCTS [45] , which adaptively partition the search regions based on sampled function values, and focus on good regions. They achieve strong empirical performance on Neural Architecture Search (NAS) and black-box optimization, outperforming many existing methods including evolutionary algorithms and BO. Notably, in recent NeurIPS'20 black-box optimization challenges, two teams that use variants of LaMCTS ranked 3rd [38] and 8th [23] .

In this paper, we provide a simple theoretical analysis of LaMCTS to reveal the underlying principles of adaptive region partitioning, an analysis missing in the original work. Based on this analysis, we propose Latent Space Partitions for Path Planning (LaP 3 ), a novel optimization technique for path-planning. Unlike LaMCTS, LaP 3 uses a latent representation of the search space. Additionally, we use the maximum (instead of the mean) as the node score to improve sample efficiency, verified empirically in Sec. 5.3. Both changes are motivated by our theoretical analysis.

We verify LaP 3 on several challenging path-planning tasks, including 2D navigation environments from past work with difficult-to-escape local optima, and real-world planning problems in compiler optimization and molecular design. In all tasks, LaP 3 demonstrates substantially stronger exploration ability to escape from local optima compared to several baselines including CEM, CMA-ES and VOOT. On compiler phase ordering, we achieve on average 39% and 31% speedup in execution cycles comparing to -O3 optimization and OpenTuner [1] , two widely used optimization techniques in compilers. On molecular design, LaP 3 outperforms all of our baselines in generating molecules with high values of desirable properties, beating the best baseline in average property value by up to 0.4 on properties in a [0, 1] range. Additionally, extensive ablation studies show factors that affect the quality of planning and verify the theoretical analysis.

LaP 3 is a general planning technique and can be readily plugged into existing algorithms with path planning components. For example, we apply LaP 3 to PETS [7] in model-based RL and observe substantially improved performance for high-dimensional continuous control and navigation, compared to CEM as used in the original PETS framework.

LaMCTS [45] is recently proposed to solve black-box optimization problems x * = arg max x f (x) via recursively learning f -dependent region partitions. Fig. 1 and Alg. 1 show the details of LaMCTS as well as our proposed approach LaP 3 (formally introduced in Sec. 4) for comparison. if t divides Npar then 8:

Train/fine-tune latent model h(·) using samples St ∪ D (Eqn. ??). 9:

Re-learn region partition Vt ← Partition(Ω, St, N thres , s(·)) in latent space Φs of s(·). 10:

end if 11:

for k := root, k / ∈ V leaf do LaMCTS first draw a few samples x ∈ Ω, then learn to partition Ω into a sub-region Ω1 with good samples (high f (x)) and a sub-region Ω2 with bad samples (low f (x)). Compared to LaMCTS, LaP 3 uses a latent space and reduces the dimensionality of the search space. (b) Sampling follows the learned recursive space partition, focusing on good regions while still exploring bad regions using UCB. LaP 3 uses the maximum of the sampled value in a region (maxx i ∈Ω f (xi)) as the node value, while LaMCTS uses the mean. (c) Upon reaching a leaf, new data points are sampled within the region and the space partition is relearned.

LaMCTS starts with N init random samples of the entire search space Ω (line 5 in Alg. 1). For a region Ω k , let n(Ω k ) be the number of samples within. LaMCTS dictates that, if n(Ω k ) ≥ N thres , then Ω k is partitioned into disjoint sub-regions Ω k = Ω good ∪ Ω bad as its children ( Fig. 1 (a)-(b), line 9 in Alg. 1, the function Partition). Intuitively, Ω good contains promising samples with high f , while Ω bad contains samples with low f . Unlike DOO and VOOT, such a partition is learned using S t ∩ Ω k , our samples so far in the region, and is thus dependent on the function f to be optimized.

Given tree-structured sub-regions, new samples are mostly drawn from promising regions and occasionally from other regions for exploration. This is achieved by Monte Carlo Tree Search (MCTS) [4] (line [11] [12] [13] : at each tree branching, the UCB score b is computed to balance exploration and exploitation (line 12). Then the subregion with highest UCB score is selected (e.g., it may have high f and/or low n). This is done recursively until a leaf sub-region Ω is reached. Then a new sample x is drawn from Ω (line 15) either uniformly, or from a local model constructed by an existing optimizer (e.g., TuRBO [10] , CMA-ES [16] ), in which case LaMCTS becomes a meta-algorithm. When more samples are collected, regions are further partitioned and the tree gets deeper.

Finally, the function Partition in Alg. 1 is defined as follows: first a 2-class K-means on (x, f (x)) is used to create positive/negative sample groups. Next, a SVM classifier is used to learn the decision boundary (hence the partition), so that samples with high f (x) fall into Ω good , and samples with low f (x) fall into Ω bad ( Fig. 1(a) ). See Appendix A for the pseudo code. The partition boundary can also be re-learned after more samples are collected (line 9).

While LaMCTS [45] shows strong empirical performance, it contains several components with no clear theoretical justification. Here we attempt to give a formal regret analysis when sub-regions {Ω k } are fixed and all at the same tree level, and the function f is deterministic. We leave further analysis of tree node splitting and evolution of hierarchical structure to future work.

Despite the drastic simplification, our regret bound still shows why an f -dependent region partition is helpful. By showing that a better regret bound can be achieved by a clever region partition as empirically used in the Partition function in Alg. 1, we justify the design of LaMCTS. Furthermore, our analysis suggests several empirical improvements over LaMCTS and motivates the design of LaP 3 , which outperforms multiple classic approaches on hard path planning problems.

We consider the following setting. Suppose we have K d-dimensional regions {Ω k } K k=1 , and n t (Ω k ) is the visitation count at iteration t. The global optimum x * resides in some unknown region Ω k * . At each iteration t, we visit a region Ω k , sample (uniformly or otherwise) a data point x t ∈ Ω k , and retrieve its deterministic function value f t = f (x t ). In each region Ω k , define x * k := arg max x∈Ω k f (x) and the maximal value g * (Ω k ) = f (x * k ). The maximal value so far at iteration t is g t (Ω k ) = max t ≤t f (x t ). It is clear that g t ≤ g * and g t → g * when t → +∞. Figure 2 : Theoretical understanding of space partitioning. (a) Definition of (z k , c k )-diluted region Ω k (Def. 1).

(b) Partition of region Ω k into good region Ω k1 and bad region Ω k2 . Optimal solution x * ∈ Ω k1 . (c) After space partitioning, F k is split into F k1 and F k2 . The good region F k1 has much smaller c k1 while the bad region has much larger best-to-optimality gap ∆ k2 . As a result, the expected total regret decreases.

We define the confidence bound r t = r t (Ω k ) so that with high probability, the following holds:

At iteration t, we pick region k t to sample based on the upper confidence bound: k t = arg max k g t (Ω k ) + r t (Ω k ). Many different confidence bounds can be applied; for convenience in this analysis, we use the "ground truth" bound from the cumulative density function (CDF) of f within the region Ω k (Please check Appendix B for all proofs):

be a strictly decreasing function, and let r k,t (Ω k ) := F −1 k δ 1/nt(Ω k ) . Then Eqn. 1 holds with probability 1 − δ.

Here F −1 k is the inverse function of F k and randomness arises from sampling within Ω k . Since F k is a strictly decreasing function, F −1 k exists and is also strictly decreasing. By definition, F k ∈ [0, 1], F k (0) = 1 and F −1 k (1) = 0. We then define the dilution of each region as follows:

where z k is the smallest F k (y) to make the inequality hold.

The intuition for dilution for a given region, as depicted in Fig. 2(a) , is that all but z k fraction of the region has function value close to the maximum, with "close" defined based on c k (smaller c k implies a stricter definition of "close"). Obviously if Ω k is (z k , c k )-diluted then it is (z k , c k )-diluted for any c k ≥ c k and z k ≥ z k . Therefore, we often look for the smallest z k and c k to satisfy the condition. If a region Ω k has small c k and z k , we say it is highly concentrated. For example, if f (x) is mostly constant within a region, then c k is very small since F k (y) drops to 0 very quickly. In such a case, most of the region's function values are concentrated near the maximum, making it easier to optimize.

While the definition of concentration may be abstract, we show it is implied by Lipschitz continuity: Corollary 1. If a region Ω k is L k -Lipschitz continuous, i.e., |f (x)−f (x )| ≤ L k x−x 2 , and there exists an 0 -ball B(x * k , 0 ) ⊆ Ω k , then with uniform sampling,

HereṼ k := V k /V 0 is the relative volume with respect to the unit sphere volume V 0 .

Typically, a smoother function (with small L k ) and large 0 yield a less diluted (and more concentrated) region. However, the concept of dilution (Def. 1) is much broader. For example, if we shuffle function values within Ω k , Lipschitz continuity is likely to break but Def. 1 still holds. Now we will bound the total regret. Let R t (a t ) := f * − g t (Ω at ) ≥ 0 be the regret of picking Ω at and R(T ) := T t=1 R t (a t ) be the total regret, where T is the total number of samples (queries to f ). Define the gap of each region ∆ k := f * − g * (Ω k ) and split the region indices into K good := {k : ∆ k ≤ ∆ 0 } and K bad := {k : ∆ k ≥ ∆ 0 } by a threshold ∆ 0 . Treating each region Ω k as an arm and applying a regret analysis similar to multi-arm bandits [41] , we obtain the following theorem:

The effect of space partitioning. Reducing {c k } results in a smaller regret R(T ). Thus if we can partition Ω k into two sub-regions Ω k1 and Ω k2 such that the good partition Ω k1 has smaller c k1 < c k and the bad partition Ω k2 has larger ∆ k2 > ∆ 0 and falls into K bad , then we can improve the regret bound ( Fig. 2 (b)-(c)). This coincides with the Partition function of LaMCTS very well: it samples a few points in Ω k , and trains a classifier to separate high f from low f . On the other hand, if we partition a region Ω k randomly, e.g., each f (x) is assigned to either Ω k1 or Ω k2 at random, then statistically F k1 = F k2 = F k and c k1 = c k2 = c k , which increases the regret bound. Therefore, the partition needs to be informed by data that have already been sampled within the region Ω k .

Recursive region partitioning. In Theorem 1, we assume all regions {Ω k } have fixed c k and z k , so the bound breaks for large enough T (as η/T 3 eventually becomes smaller than any fixed z k ). However, as LaMCTS conducts further internal partitioning within Ω k , its c k and z k keep shrinking with more samples T . If each split leads to slightly fewer bad f (i.e., lighter "tail"), with the ratio being γ < 1, then by the definition of CDF, z k is the probability mass of the tail and thus z k ∼ γ −T /Npar . This would yield z k ≤ η/T 3 for all T , since γ −T decays faster than 1/T 3 and Theorem 1 would hold for all T . See Appendix F.2 for empirical verification of decaying z k .

While related to Lipschitz bandits [28] and coarse-to-fine deterministic function optimization like DOO and SOO [32] , our analysis is fundamentally different. We have discussed how f -dependent region partitioning and a data-driven learning procedure affect the regret bound, which to our knowledge has not been previously addressed. See Appendix B.5 for further remarks on Theorem 1.

There is more work to be done to fully understand how LaMCTS works. In particular, we did not analyze when to split a node (e.g. how many samples we need to collect before making a decision), or the effect of relearning the space partition. We also have not considered stochastic reward functions, where the maximum function value in the sub-region may no longer be the best metric of goodness. We leave these to future work.

Based on our analysis, we propose LaP 3 , which extends LaMCTS to path planning, a problem with temporal structure. LaP 3 outperforms baseline path planning approaches in both continuous and discrete path planning problems. Here we represent trajectories as action sequences x = (a 0 , a 1 , . . . , a n−1 ) and treat them as high-dimensional vectors x in the trajectory space Ω.

Thus, LaP 3 searches over the space Ω, recursively partitioning Ω into subregions based on trajectory reward, and sampling from subregions using CMA-ES [16] (which is faster than TuRBO [10] used in the original LaMCTS). We emphasize again that LaP 3 's region partitioning procedure is fully adaptive, in contrast to traditional MCTS approaches such as VOOT, which only partition the trajectory space based on one action at a time.

Additionally, we have made several improvements over the original LaMCTS, as detailed in Algorithm 1. First, we use the maximal value max i∈Ω k f (x i ) rather than the mean value 1 n(Ω k ) i∈Ω k f (x i ) as the metric of goodness for each node k (and its associated region Ω k ). This is driven by Theorem 1, which gives a regret bound based on maximum values. Intuitively, using the mean value would cause the algorithm to be slow to respond to newly discovered territory: it takes time for the mean metric to boost, and we may miss important leaves. We show the difference empirically in Sec. 5.3.

Second, Theorem 1 suggests that a lower-dimensional (smaller d) and smoother (smaller c k ) representation leads to lower regret. Therefore, LaP 3 employs a latent space as described below.

LaP 3 leverages a latent space Φ s for the partition space, by passing Ω through some encoder s. That is, we disentangle the sampling space Ω from which we sample new candidate trajectories, from the partition space Φ s on which we construct the search space partition. Critically, we do not need s −1 : we never decode from Φ s back to Ω. Thus s can dramatically reduce the dimension of the partition space, which may improve regularization due to the small number of samples, without suffering large reconstruction loss. s will be fixed rather than learned in this case. Once the partition has been constructed on Φ s , and we select a leaf region to propose from, we sample new x from Ω as before. 1 In principle, the sampling space can itself be a latent space Φ h , with an encoder h and decoder h −1 . That is, one runs the inner solver in Φ h to propose samples before decoding back to Ω. h could be a principal component analysis (PCA) [49] , a random network encoding [43] , or a reversible flow [9] , depending on the environment's particular Ω and state/action structure. While some latent representations can be fixed by specifying the inductive bias (e.g., random network encoding), others can be learned from data, optimizing reconstruction loss

is a weighting function emphasizing trajectories with high cumulative reward f (x). In this case, h and h −1 may be fine-tuned using each new (x, f (x)) pair when LaP 3 proposes and queries a new trajectory x, or they may be pre-trained using a set of unlabeled x with w(x) ≡ 1. For consistency in our main experiments, we do not use a latent Φ h , although we observe that using this second latent space can yield a slight performance in some environments (Appendix F.7).

We test LaP 3 on a diverse set of environments to evaluate its performance in different settings.

Baselines. We compare LaP 3 to several baselines. LaMCTS is the original LaMCTS algorithm using CMA-ES as an inner solver, like LaP 3 . Random Shooting (RS) [36] samples random trajectories and returns the best one. Cross-Entropy Methods (CEM) [2] use the top-k samples to fit a local model to guide future sampling. A related approach, Covariance matrix adaptation evolution strategy (CMA-ES) [16] , tracks additional variables for improved local model fitting. Voronoi optimistic optimization applied to trees (VOOT) [22] is a "traditional" MCTS method for continuous action spaces that builds a tree on actions at each timestep. iLQR [26] is a seminal gradient-based local optimization approach used extensively in controls. Finally, proximal policy optimization (PPO) [39] is a standard reinforcement learning algorithm.

LaP 3 does not require substantially more tuning effort than CEM or CMA-ES, the best-performing among our baselines experimentally. The only additional hyperparameter tuned in LaP 3 is the C p controlling exploration when selecting regions to sample from, which is dependent on the scale of the reward function. However, our C p only varies by a factor of up to 10 across our diverse environments, and performance is not overly sensitive to small changes (Appendix F.5).

We use MiniWorld [5] for continuous path planning and MiniGrid [6] for discrete. reaching the farther goal, while a distance-based reward misleadingly points to the closer goal. For full environment specifics, see Appendix H.1.

We modify the original setup to use a continuous action space (∆x and ∆y), and provide a sparse reward (proximity to goal, with an additional bonus for reaching the goal) at end-of-episode. We use a high-dimensional top-down image view as the state. We featurize this image using a randomly initialized convolutional neural network, a reasonable feature extractor as shown in [43] . LaP 3 uses periodic snapshots of the featurized state as the partition space Φ s . That is, we collect all the observed states over the course of the full trajectory, and then form the latent space by concatenating every n th state (here n = 20), while discarding the rest to reduce overall dimensionality. Success is defined using a binary indicator for reaching the goal (far goal for SelectObj).

Results. LaP 3 substantially outperforms all baselines on all three tasks, despite heavily tuning the baselines' hyperparameters (Appendix G), showing that LaP 3 works for challenging tasks containing suboptimal local maxima. In MazeS3, LaP 3 succeeds but CEM gets stuck ( Figure 3 ). VOOT, which builds an MCTS tree on actions at each timestep, struggles on all environments; LaP 3 can be viewed as an extension of MCTS that performs better on such long-horizon tasks. PPO also performs poorly, perhaps due to the sparse reward given only at the end of an episode, and the relatively small (for RL) number of episodes. In the most difficult SelectObj task, LaP 3 solves nearly half of environment seeds within 4,000 queries of the oracle, whereas most baselines-including the original LaMCTS-quickly reach the near goal but struggle to escape this local optimum.

We also evaluate LaP 3 when combined with a model-based approach, PETS [7] , on FourRooms and SelectObj (omitting MazeS3 because the changing maze walls for each seed make it difficult to learn a world model). Following PETS' setting and due to difficulty in learning image-based world models [12, 14] , we use 2D agent position as the state. As shown in Fig. 5 , LaP 3 substantially outperforms the authors' original CEM implementation in the PETS framework, demonstrating that it is not reliant on access to the oracle model but can work with learned models as well.

MiniGrid [6] is a popular sparse-reward symbolic environment for benchmarking RL algorithms. It contains tasks with discrete states and actions such as DoorKey (DK): pick up a key and open the door connecting two rooms; MultiRoom (MR): traverse several rooms by opening doors; and KeyCorridor (KC), a combination of MR and DK: some doors are locked and require a key. As in MiniWorld, we add proximity to the goal to the final sparse reward.

In discrete action spaces, LaP 3 optimizes the vector of all action probabilities over all timesteps, and takes the highest-probability action at each step. As in MiniWorld, we use periodic state snapshots featurized by a randomly initialized CNN as the partition space Φ s . We compare LaP 3 to the same baselines as in MiniWorld, except VOOT and iLQR which are designed for continuous tasks.

Results. LaP 3 is equal to or better than baselines on all six tasks (Table 1) . Especially in the hardest tasks with the most rooms (MR-N4S5, MR-N6), LaP 3 improves substantially over baselines. Table 1 : Results for LaP 3 in MiniGrid. LaP 3 is equal or better on all tasks (higher is better).

We run several ablations on LaP 3 in MiniWorld to justify our methodological choices. See Appendix F for further analysis on hyperparameter sensitivity, UCB metric, and latent spaces.

Region Selection in LaP 3 . We consider four alternative region selection methods. (1) LaP 3 -mean: using mean function value rather than max for UCB, as in LaMCTS [45] ; (2) LaP 3 -nolatent: not using a latent space for partitioning; (3) LaP 3 -notree: directly selecting the leaf with the highest UCB score; and (4) LaP 3 -noUCB: only using node value rather than UCB. LaP 3 greatly outperforms all variations in MiniWorld, justifying our design. Data-driven space partition in LaP 3 vs. random partitioning. We examine c k in Def. 1 and Lipschitz constant L k in Corollary 1 to verify the theory. We conduct a preliminary analysis on LaP 3 's tree after the full 2,000 queries (4,000 for SelectObj). At each intermediate node, we estimate L k and c k of its children from the LaP 3 partition, against a random partition that divides the node's samples with the same ratio (see Appendix F.1 for estimation details). We then average the values for both LaP 3 and random partitions over all nodes in the tree. We find that LaP 3 does yield lower average L k and c k ( Table 2 ), indicating that our data-driven space partition is effective.

6 LaP 3 on Real-World Applications

Compiler optimization applies a series of program transformations from a set of predefined optimizations (e.g., loop invariant code motion, function inlining [30] ) to improve code performance. Since these optimizations are not commutative, the order in which they are applied is extremely important. This problem, known as phase ordering, is a core challenge in the compiler community. Current solutions to this NP-hard problem rely heavily on heuristics: groups of optimizations are often packed into "optimization levels" (such as -O3 or -O0) hand-picked by developers [34, 42] .

We apply LaP 3 to the standard CHStone benchmarks [17] , and use periodic snapshots of states as Φ s and the identity as Φ h . See Appendix H.2 for full environment details.

Results. LaP 3 is 31% faster on average compared to OpenTuner, and 39% compared to -O3 (not shown in figure) . Compared to a stronger PPO baseline using 50 samples (PPO_50) and to CMA-ES, we achieve up to 10% and 7% speedup respectively. Finally, compared to final PPO results at convergence after 4000 samples (PPO_4000) as an oracle, LaP 3 does similarly on most tasks, despite being much more sample efficient (only 50 samples). Full results in Appendix E. Finally, we evaluate LaP 3 on molecular design. Given an oracle for a desired molecular property, the goal is to generate molecules with high property score after the fewest trials. This is critical to pharmaceutical drug development [44] , as property evaluations require expensive wet-lab assays.

Similar to [18] , we fix a query budget and optimize several properties: QED: a synthetic measure of drug-likeness, relatively simpler to optimize; DRD2: a measure of binding affinity to a human dopamine receptor; HIV, the probability of inhibition potential for HIV; and SARS: the same probability for a variant of the SARS virus, related to the SARS-CoV-2 virus responsible for COVID-19. All four properties have a range of [0, 1]; higher is better. For DRD2, HIV, and SARS, we evaluate using computational predictors from [33] (DRD2) and [51] (HIV, SARS) in lieu of wet-lab assays.

To run LaP 3 on molecular design, we view the molecular string representation (SMILES string [48] ) as the action sequence, similar to how many generative models generate molecules autoregressively [11, 25, 8, 50] . Following the state-of-the-art HierG2G model from [19] , we learn a latent representation from a subset of ChEMBL [29] , a dataset of 1.8 million drug-like molecules, without using any of its property labels (e.g., effectiveness in binding to a particular receptor). During this unsupervised training, we only use the 500k molecules with the lowest property scores to ensure a good molecule is discovered by search rather than a simple retrieval from the dataset. Our setting differs from many existing methods for molecular design, which assume a large preexisting set of molecules with the desired property for training the generator [33, 20, 52, 50] .

On this task only, the latent space is trained on additional unlabeled data, and is used as both the partition space Φ s and sampling space Φ h for LaP 3 . All baselines operate in the same space for fair comparison. Otherwise, all methods struggle to generate well-formed molecules of reasonable length.

Results. Figure 8 shows the highest property score discovered by each method for each property. The absolute difference is small in the relatively simple synthetic QED task. However, LaP 3 outperforms all baselines by a much greater margin-up to 0.4 in DRD2-in the more challenging and realistic DRD2, HIV, and SARS tasks, where CEM and CMA-ES quickly plateau but LaP 3 continues to improve with more function evaluations.

We propose LaP 3 , a novel meta-algorithm for path planning that learns to partition the search space so that subsequent sampling focuses more on promising regions. We provide a formal regret analysis of region partitioning, motivating improvements that yield large empirical gains. LaP 3 particularly excels in environments with many difficult-to-escape local optima, substantially outperforming strong baselines on 2D navigation tasks as well as real-world compiler optimization and molecular design. 

Algorithm 2 details the pseudocode for the partition function used in LaMCTS, which we use in LaP 3 as well.

Algorithm 2 Partition Function 1: Input: Input Space Ω, Samples S t , Node partition threshold N thres , Partitioning Latent Model

if n(Ω p ) ≥ N thres then 7:

S good , S bad ← samples from S t corresponding to indices of k-means(s(Ω p ∩ S t )) 8:

Fit SVM on S good , S bad 9:

Use SVM to split Ω p into Ω good , Ω bad 10: Proof. Let δ < 1. Define the following cumulative density function (CDF):

where g * k := sup x∈Ω k f (x). It is clear that F k (y) is a monotonically decreasing function with F k (0) = 1 and lim y→+∞ F k (y) = 0. Here we assume it is strictly decreasing so that F k (y) has a well-defined inverse function F −1 k . In the following, we will omit the subscript k for brevity. Let us bound P[g t ≥ g * − y]:

Note that 1 is due to the fact that all samples x 1 , . . . , x nt are independently drawn within the region Ω k . Given δ, let r t := F −1 k (δ 1/nt ) and we have:

Proof. Since f is L k -Lipschitz over region Ω k , we have:

Since the optimal solution x * k ∈ Ω k is in the interior of Ω k , there exists 0 so that B(x * k , 0 ) ⊆ Ω k . From the Lipschitz condition, we know that in the ball B(x * k , ) with ≤ 0 , the function values are also quite good:

Therefore, at least in the ball of B(x * k , ), all function values are larger than a threshold g * − L k . This means that for ≤ 0 :

where V 0 is the volume of the unit d-dimensional sphere. LettingṼ k := V k /V 0 be the relative volume with respect to unit sphere, we have:

Lemma 2.

If Ω k are (z k , c k )-diluted, then for any δ ∈ [z k , 1] and j ≥ 1, we have:

can be also be written as:

Since now we have z k ≤ δ ≤ δ 1/j ≤ 1 for any j ≥ 1, following Eqn. 12 we have:

Due to the inequality that for a < 1 and x > 0, a x ≥ 1 + x ln a (which can be proven by simply showing the derivative is non-negative), if we take a = δ and x = 1/j, we have:

which gives:

Proof. Take δ = η/T 3 so that δ ≥ z k for all regions Ω k . Then Eqn. 1 holds for all T iterations and all K arms with probability at least 1 − KT δ by union bound, which we consider a "good event".

For brevity, define g * k := g(Ω k ) as the optimal function value within the region of Ω k and n k,t := n t (Ω k ) the visitation count of region Ω k . Define ∆ k := f * − g * k the minimal regret of each arm and r k,t := r t (Ω k ) the confidence bound. At iteration t, since we pick k = a t as the region to explore, it must be the case that:

where k * is the index of the optimal region Ω k * where its maximum g * k * is the global optimal value f * . Here 1 is due to global optimality of f * , 2 is due to global optimality of g * k within region Ω k : g * k ≥ g k,t , 3 is due to the fact that we pick a t = k at iteration t, and 4 is due to the non-negativity of the confidence bound: r k * ,t ≥ 0. Therefore, since g * k + r k,t ≥ f * , we have:

Now we bound the total regret.

Note that for R t (a t ) := f * − g at,t , we have:

due to the fact that ∆ k = f * −g * k ≤ r k,t and the property of the confidence bound that g k,t ≥ g * k −r k,t with k = a t .

One the other hand, using Lemma 2, we also have

which means that

So if the region Ω k has a large gap ∆ k , then n k,t would have a small upper-bound (and be small). As a result, we would never visit that region after a fixed number of visitations. This also helps bound the regret.

If we sum over R t (a t ) over t iterations, we get R(T ). We could reorganize them into two kinds of regions, the good regions where K good := {k : ∆ k ≤ ∆ 0 } and the bad regions where K bad := {k : ∆ k > ∆ 0 }:

Let M := sup x∈Ω f (x) − inf x∈Ω f (x) be the maximal gap between the highest and lowest function values. Note that M is also the largest regret for a single move at any iteration. Letting C bad := k∈K bad c d k 1/d be the d -norm of c k over bad regions, we then have:

For R good (T ), this is because for each region k we visit it n k,T times and each time we pay a price that is proportional to 1/n k,t for n k,t = 1 . . . n k,T .

Using Lemma 2, since all Ω k are (z k , c k )-concentrated and z k ≤ δ, this leads to:

Assuming d > 1 (high-dimensional case), we use the bound

and we have:

Hölder's inequality says if 1/p + 1/q = 1, then k |x k y k | ≤ ( k |x k | p ) 1/p ( k |y k | q ) 1/q . Using it with p = d and q = d d−1 , we get

where C good :=

is the d -norm of c k over good regions.

Finally, if a good event doesn't happen (with probability KT δ), we would pay a regret of at most M at each iteration t, yield a bound of M KT 2 δ for T iterations.

Since δ = η/T 3 then finally we have 

Then the sample complexity T to achieve the global optimum within an -ball is ∼ 1/ d , which is the best we can achieve without structured information on f . Previous papers [41] show a slightly worse bound O(T d+1 d+2 ) since they also consider stochastic functions and discretization error.

Which region to split? Since C good := k∈K good c k 1/d is an d -norm, when d is large (i.e., high-dimensional), C good ∼ max k∈Ω good c k so ideally we should split the region with the highest c k to reduce C good the most. Intuitively this means the most diluted / scattered region.

LaP 3 can escape local minima and achieve significantly better results in various RL tasks using a simulated environment. In MiniWorld, we showed that we could also plug LaP 3 into the PETS [7] framework, replacing the CEM method that was originally used as a planner (Sec. 5). Here we additionally use Mujoco, a commonly used benchmark, to validate the performance. Note that Mujoco is a very smooth task and doesn't contain many local minima, so traditional methods work reasonably well in this domain. In Tab. 3, we can see that in easier tasks like Reacher and Pusher, LaP 3 is a little worse than CEM. However, in hard tasks like Halfcheetah and Walker, LaP 3 has over 1000 reward gain over baseline methods. Table 3 : Results for Mujoco with replanning frequency of 5. We see that LaP 3 performs substantially better than CEM and RS in hard tasks like Halfcheetah and Walker.

We additionally evaluate LaP 3 on some synthetic functions (Ackley and Levy functions, both 20dimensional and 100-dimensional) used in the original LaMCTS paper [45] . For these tasks we compare to just the original LaMCTS method. Both LaP 3 and LaMCTS use the TuRBO inner solver following [45] for the 20-dimensional version of both functions, and the CMA-ES inner solver for the 100-dimensional version for computational efficiency. LaP 3 performs equal or better on these tasks (Table 4 ; note lower is better).

LaMCTS 0.48 ± 0.03 0.51 ± 0.09 0.65 ± 0.25 14.24 ± 4.87 LaP 3 0.49 ± 0.04 0.34 ± 0.07 0.46 ± 0.15 11.95 ± 3.56 Table 4 : LaP 3 vs the original LaMCTS method on some synthetic functions evaluated in the original LaMCTS work. Note lower is better. LaP 3 performs equal or better on these tasks.

We provide in Tables 5 through 9 the numerical final rewards for our tasks, corresponding to the plots in the main text. Table 6 : Results for MiniWorld tasks for different methods using a learned PETS transition model. The oracle model is only used for final evaluation on each environment seed, and the resulting trajectory becomes future training data to the PETS model. We report the total fraction of environment seeds solved by each method, out of 256 total, averaged across 5 trials. LaP 3 substantially outperforms the original PETS implementation. This table corresponds to Figure 5 . Table 8 : Results for compiler phase ordering, in execution cycles for program after a series of transformations, following setup of [15] . 50 oracle accesses per method. This table corresponds to Figure 7 . 

To loosely approximate the Lipschitz constant in our analysis from Sec. 5.3, we simply check all pairwise Lipschitz constants between existing samples (candidate trajectories) in the tree node (region). Similarly, to loosely approximate c k , we take the highest-scoring sample in the region as the "optimum" and estimate c k for z k = 0.5 following our definition using the remaining samples in the region.

We estimate z k over time in our MiniWorld tasks, fixing several different values of c k in intervals of 1 reward (Figure 9 ). z k is estimated using 50 samples (in between each dynamic re-partitioning of the space) at each timestep in intervals of 50, with 32 trial runs. In all cases z k initially drops very quickly, and then somewhat plateaus after finding the initial local optimum (whether global or not), especially in SelectObj. However, in most cases it still continues to decrease over time.

While this is consistent with our qualitative analysis in Sec. B.5 about how z k changes with recursive splitting, in some cases, z k seems to stop decreasing over time. Upon inspection, we find that those regions whose z k remain high correspond to low-performing regions which do not receive many samples according to UCB exploration. Therefore, such regions won't improve over time and the z k remains high.

We show a t-SNE visualization (Figure 10 ) of the latent space of trajectories at the end of a sample MazeS3 run of LaP 3 . The first sampled trajectories are colored red, with a gradient toward blue for the later-sampled trajectories. The later trajectories are clearly separated in the latent space.

It is of course possible to optimize the parameters of a policy which outputs an action given the current state, as in the original LaMCTS formulation, or in PPO. Nevertheless, we tune and run a parameter-space version of LaMCTS in the MiniWorld tasks, which is essentially the original LaMCTS adapted to path planning, only with TuRBO replaced with CMA-ES as in LaP 3 due to speed considerations. Specifically, following LaMCTS, we learn the parameters of a linear policy for outputting actions given states.

FourRooms SelectObj

LaMCTS-parameter 10.9 ± 1.9 9.4 ± 1. Working in the parameter space can be clearly advantageous when states are relatively simple and low-dimensional, as in the Mujoco environments evaluated in LaMCTS and POPLIN [47] , or when the policy barely needs to depend on the state at all, as in our SelectObj task (Table 10) . We designed the SelectObj task as a challenge for path planning algorithms operating in the action space, which struggle to escape the local optimum, but in truth this environment can be trivially solved by simply moving in the same correct direction at every step (toward the far goal).

On the other hand, more complex policies may be more challenging to learn when the state representation is higher-dimensional, which may be the case in practical tasks. This is the case in our MazeS3 and FourRooms environments, where the state is represented as a top-down image rather than a vector of position and velocity information. Unlike SelectObj, these tasks require navigation around obstacles rather than just moving in a straight line. Despite featurizing with the same randomly initialized CNN as LaP 3 , LaMCTS-parameter performs very poorly on MazeS3 and FourRooms in comparison. Additionally, methods like LaMCTS-parameter which use a parameter space must critically depend on the specific parametric form of the policy to be learned (e.g., whether it is a linear policy, a nonlinear policy parameterized by neural networks, etc); therefore, it is not obvious how to take advantage of a latent space which encodes a sequence of states and/or actions, which is critical in environments such as our molecular design tasks.

Since the C p parameter is the only additional parameter we tune in LaP 3 , we analyze the sensitivity of LaP 3 's performance with respect to C p on the MiniWorld tasks. Our main results use C p = 2 except for SelectObj where we used C p = 4; here we run C p = 1, 2, 4 for all three tasks and show the results in Table 11 . LaP 3 even with poorly tuned C p values still substantially outperforms CEM on these tasks with difficult-to-escape local optima. 

Our theory suggests that the UCB metric for MCTS should be based on the max function value rather than the mean for the deterministic functions that we consider in this work. Figure 6 

We conduct additional analysis on the use of a latent partition space Φ s in the MiniWorld, MiniGrid, and compiler phase ordering tasks in Tables 16 (reproduced from Table 7 We additionally ablate on the latent space (used for both partitioning and sampling) in the easiest of our molecular design tasks, the QED property. Specifically, we build the molecular SMILES string autoregressively, using a discrete action space with 10 choices: the 9 most common characters in molecular SMILES strings, in addition to an end token. (We limit the space of possible characters in order to increase the chances of generating well-formed SMILES strings.) We optimize in a continuous space of action probabilities as in MiniGrid, and allow a maximum length of 50 characters.

The poor results demonstrate the absolute necessity of a latent space in the molecular design task (Table 20) . While typical molecular SMILES strings for this task are 30 to 50 characters long, both LaP 3 and baselines struggle to generate well-formed strings even of length 3 to 5 without the pre-trained latent space. Accordingly, the performance is drastically lower for all methods. Table 20 : Mean and standard deviation across 128 random seeds for LaP 3 and baselines on QED, with and without the pre-trained latent space.

In this work we have used CMA-ES as the inner solver due to its speed and acceptable performance. The original LaMCTS work used the TuRBO solver [10] , which is prohibitively slow for many of our experiments. Nevertheless, we have run experiments on the MiniWorld tasks using a smaller number of trials to check performance using an alternate inner solver, both on LaP 3 and also on the LaMCTS baseline ( 

LaP 3 . For our method, we try C p in {0.5, 1, 2, 4}. If the search space is not explicitly bounded, we sample the first N init points used to initialize the partition tree using the same σ as CEM. Note N init is not tuned; we use 5 for compiler phase optimization where our query budget is only 50, and 50 elsewhere. No other hyperparameters are tuned.

LaMCTS. Detailed in Algorithm 1, where we summarize the changes made in LaP 3 compared to LaMCTS. It is tuned similarly to our own method LaP 3 .

The simplest baseline, in which one simply samples random trajectories and in the end returns the best-performing among them. We do not tune this baseline.

CEM. An evolutionary method which tracks a population of N samples. At each step, it selects the best N e samples from its population to initialize the mean µ for the next generation of N samples, drawn from a Gaussian distribution with standard deviation σ. However, while too-small σ may prevent CEM from escaping local optima, too-large σ may yield results little better than random shooting. We find that the choice of σ is critical to CEM's performance in our test environments. Therefore, we systematically tune σ when running CEM in all environments (checking {1, 2, 4, 8}).

While other parameters such as N and N e are also tunable, we find that these make a smaller difference, so we did not tune them extensively.

A more complex evolutionary method which can be viewed as a variant of CEM. After providing an initial µ and σ for the first generation, CMA-ES determines its own σ automatically afterward, while also fitting additional parameters. Even so, we find that its performance is highly sensitive to the initial σ, and we tune this parameter in the same way that we do for CEM.

VOOT. An MCTS method which builds a tree on actions. We tune the exploration parameter in the VOO submodule, trying values in {0.1, 0.3, 0.5}.

RandDOOT. An MCTS method which builds a tree on actions similar to VOOT, but which splits using axis-aligned boundaries rather than splitting into Voronoi regions; used as a baseline in their original paper [22] . We did not tune hyperparameters.

iLQR. A gradient-based optimization method for continuously optimizing the planned trajectory, which we run to convergence. It cannot easily escape local optima. As the performance was relatively insensitive to hyperparameters, we did not systematically tune.

A standard reinforcement learning algorithm which operates in the parameter space, unlike our other baselines. Since PPO is relatively robust to hyperparameters [39] , we didn't systematically tune.

We modified the original MiniWorld environments to have continuous action spaces and to have more consistent difficulty across random seeds, as follows. 

The action space consists of 46 different program transformations, and a trajectory consists of 45 transformations (quite short, considering many transformations have no effect unless applied in a specific order). The reward is the difference between the original and final number of execution cycles. Since the environment is deterministic, we only run 1 trial for each method. Thus far we have followed the setup in [15] ; however, unlike [15] , we allow a budget of only 50 trajectory queries.

Opentuner: An extensible framework for program autotuning

The cross-entropy method for optimization

A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

A survey of monte carlo tree search methods

Minimalistic gridworld environment for openai gym

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Syntax-directed variational autoencoder for structured data

Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp

Scalable global optimization via local bayesian optimization

Automatic chemical design using a data-driven continuous representation of molecules

Dream to control: Learning behaviors by latent imagination

Learning latent dynamics for planning from pixels

Mastering atari with discrete world models

Autophase: Juggling hls phase orderings in random forests with deep reinforcement learning

The cma evolution strategy: A tutorial

Chstone: A benchmark program suite for practical c-based high-level synthesis

Junction tree variational autoencoder for molecular graph generation

Hierarchical generation of molecular graphs using structural motifs

Learning multimodal graph-to-graph translation for molecular optimization

Autonomous molecular design by monte-carlo tree search and rapid evaluations using molecular dynamics simulations

Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds

Adaptive local bayesian optimization over multiple discrete variables

Sequential planning for steering immune system adaptation

Grammar variational autoencoder

Iterative linear quadratic regulator design for nonlinear biological movement systems

Heuristic approaches in robot path planning: A survey

Lipschitz bandits: Regret lower bound and optimal algorithms

ChEMBL: towards direct deposition of bioassay data

Advanced compiler design implementation

Optimistic optimization of a deterministic function without the knowledge of its smoothness

From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning

Molecular de-novo design through deep reinforcement learning

Fast and effective orchestration of compiler optimizations for automatic performance tuning

Chomp: Gradient optimization techniques for efficient motion planning

Robust constrained model predictive control

The cross-entropy method for combinatorial and continuous optimization

Solving black-box optimization challenge via learning search space partition for local bayesian optimization. workshop at NeurIPS 2020 Competition Track on Black-Box Optimization Challenge

Proximal policy optimization algorithms

Planning chemical syntheses with deep neural networks and symbolic ai

Introduction to multi-armed bandits

Compiler optimization-space exploration

Deep image prior

Applications of machine learning in drug discovery and development

Learning search space partition for black-box optimization using monte carlo tree search

Sample-efficient neural architecture search by learning actions for monte carlo tree search

Exploring model-based planning with policy networks

Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules

Principal component analysis. Chemometrics and intelligent laboratory systems

Improving molecular design by stochastic iterative target augmentation

We thank the members of the Berkeley NLP group as well as our four anonymous reviewers for their helpful feedback. This work was supported by Berkeley AI Research, and the NSF through a fellowship to the first author.