key: cord-0138067-q387r51u
authors: Wang, Dian; Walters, Robin; Zhu, Xupeng; Platt, Robert
title: Equivariant $Q$ Learning in Spatial Action Spaces
date: 2021-10-28
journal: nan
DOI: nan
sha: 49ff384c998c03a72186abab6e04613c73123c03
doc_id: 138067
cord_uid: q387r51u

Recently, a variety of new equivariant neural network model architectures have been proposed that generalize better over rotational and reflectional symmetries than standard models. These models are relevant to robotics because many robotics problems can be expressed in a rotationally symmetric way. This paper focuses on equivariance over a visual state space and a spatial action space -- the setting where the robot action space includes a subset of $rm{SE}(2)$. In this situation, we know a priori that rotations and translations in the state image should result in the same rotations and translations in the spatial action dimensions of the optimal policy. Therefore, we can use equivariant model architectures to make $Q$ learning more sample efficient. This paper identifies when the optimal $Q$ function is equivariant and proposes $Q$ network architectures for this setting. We show experimentally that this approach outperforms standard methods in a set of challenging manipulation problems.

A key question in policy learning for robotics is how to leverage structure present in the robot and the world to improve learning. This paper focuses on a fundamental type of structure present in visuo-motor policy learning for most robotics problems: translational and rotational invariance with respect to camera viewpoint. Specifically, the reward and transition dynamics of most robotics problems can be expressed in a way that is invariant with respect to the camera viewpoint from which the agent observes the scene. In spite of the above, most visuo-motor policy learning agents do not leverage this invariance in camera viewpoint. The agent's value function or policy typically considers different perspectives on the same scene to be different world states. A popular way to combat this problem is through visual data augmentation, i.e., to create additional samples or experiences by randomly translating and rotating observed images [1] but keeping the same labels. This can be used in conjunction with a contrastive term in the loss function which helps the system learn an invariant latent representation [2, 3] . While these methods can improve generalization, they require the neural network to learn translational and rotational invariance from the augmented data.

Our key idea in this paper is to model rotational and translation invariance in policy learning using neural network model architectures that are equivariant over finite subgroups of SE (2) . These equivariant model architectures reduce the number of free parameters using steerable convolutional layers [4] . Compared with traditional methods, this approach creates an inductive bias that can significantly improve the sample efficiency of the model, the number of environmental steps needed to learn a policy. Moreover, it enables us to generalize in a very precise way: everything learned with respect to one camera viewpoint is automatically also represented in other camera perspectives via selectively tied parameters in the model architecture. We focus our work on Q learning in spatial action spaces, where the agent's action space spans SE (2) or SE (3) . We make the following contributions. First, we identify the conditions under which the optimal Q function is SE(2) equivariant. Second, we propose neural network model architectures that encode SE(2) equivariance in the Q function. Third, since most policy learning problems are only equivariant in some of the state variables, we propose partially equivariant model architectures that can accommodate this. Finally, we compare equivariant models against non-equivariant counterparts in the context of several robotic manipulation problems. The results show that equivariant models are more sample efficient than non-equivariant models, often by a significant margin. Supplementary video and code are available at https://pointw.github.io/equi_q_page.

Data Augmentation: Data augmentation techniques have long been employed in computer vision to encode the invariance property of translation and reflection into neural networks [5, 6] . Recent work demonstrates the use of data augmentation improves the data efficiency and the policy's performance in reinforcement learning [7, 8, 9] . In the context of robotics, data augmentation is often used to generate additional samples [1, 10, 11] . In contrast to learning the equivariance property using data augmentation, our work utilizes the equivariant network to hard code the symmetries in the structure of the network to achieve better sample efficiency.

Contrastive Learning: Another approach to learning a representation that is invariant to translation and rotation is to add a contrastive learning term to the loss function [2] . This idea has been applied to reinforcement learning in general [3] and robotic manipulation in particular [12] . While this approach can help the agent learn an invariant encoding of the data, it does not necessarily improve the sample efficiency of policy learning.

Equivariant Learning: Equivariant model architectures hard-code E(2) symmetries into the structure of the neural network and have been shown to be useful in computer vision [13, 4, 14] . In reinforcement learning, some recent work applies equivariant models to structure-finding problems involving MDP homomorphisms [15, 16] . In addition, Mondal et al. [17] recently applied an E(2)-equivariant model to Q learning in an Atari game domain, but showed limited improvement. To our knowledge, equivariant model architectures have not been explored in the context of robotics applications. Spatial Action Representations: Several researchers have applied policy learning in spatial action spaces to robotic manipulation. A popular approach is to do Q learning with a dense pixel action space using a fully convolutional neural network (this is the FCN approach we describe and extend in Section 4.2) [18, 19, 20, 21] . Variations on this approach have been explored in [22, 23] . The FCN approach has been adapted to a variety of different manipulation tasks with different action primitives [24, 25, 26, 27, 28, 1, 29, 30, 31] . In this paper, we extend the work above by proposing new equivariant architectures for the spatial action space setting.

We are interested in solving complex robotic manipulation problems such as the packing and construction problems shown in Fig 1. We focus on problems expressed in a spatial action space. This section identifies conditions under which the Q function is SE(2)-invariant. The next section describes how these invariance properties translate into equivariance properties in the neural network.

Manipulation as an MDP in over a visual state space and a spatial action space: We assume that the manipulation problem is formulated as a Markov decision process (MDP): M = (S, A, T, R, γ). We focus on MDPs in visual state spaces and spatial action spaces [29, 20, 31] . The state space is factored into the state of the objects in the world, expressed as an n-channel h × w image I ∈ S world = R n×h×w , and the state of the robot (including objects held by the robot) s rbt ∈ S rbt , expressed arbitrarily. The total state space is S = S world × S rbt . The action space is expressed as a cross product of SE(2) (hence it is spatial) and a set of additional arbitrary action variables: A = SE(2) × A arb . The spatial component of action expresses where the robot hand is to move and the additional action variables express how it should move or what it should do. For example, in the pick/place domains shown in Fig 1, A arb = {PICK, PLACE}, giving the agent the ability to move to a pose and close the fingers (pick) or move and open the fingers (place). We will sometimes decompose the spatial component of action a sp ∈ SE(2) into its translation and rotation components, a sp = (x, θ). The goal of manipulation is to achieve a desired configuration of objects in the world, as expressed by a reward function R : S × A → R.

Translation and Rotation in SE(2): We are interested in learning policies that are invariant to translation and rotation of the state and action. To do that, we define rotation and translation of state and action as follows. Let g ∈ SE(2) be an arbitrary rotation and translation in the plane and let s = (I, s rbt ) ∈ S world × S rbt be a state. g operates on s by rotating and translating the image I, but leaving s rbt unchanged: gs = (gI, s rbt ), where gI denotes the image I translated and rotated by g. For action a = (a sp , a arb ), g rotates and translates a sp but not a arb : ga = (ga sp , a arb ). Notice that both S and A are closed under g ∈ SE(2), i.e. that ∀g ∈ SE(2), a ∈ A =⇒ ga ∈ A and s ∈ S =⇒ gs ∈ S.

Assumptions: We assume that the reward and transition dynamics of the system are invariant with respect to translation and rotation of state and action as defined above, and that the translation and rotation operations on state and action are invertible. Assumption 3.1 (Goal Invariance). The manipulation objective is to achieve a desired configuration of objects in the world without regard to the position and orientation of the scene. That is, R(s, a) = R(gs, ga) for all g ∈ SE(2). Assumption 3.2 (Transition Invariance). The outcome of robot actions is invariant to translations and rotations of both the scene and the action. Specifically, T (s, a, s ) = T (gs, ga, gs ) for all g ∈ SE(2). Assumption 3.3 (Invertibility). Translations and rotations in state and action are invertible. That is, ∀g ∈ SE(2), g −1 (gs) = s and g −1 (ga) = a.

Assumptions 3.1 and 3.2 are satisfied in problem settings where the objective and the transition dynamics can be expressed intrinsically to the world without reference to an external coordinate frame imposed by the system designer. These assumptions are satisfied in many manipulation domains including all those shown in Fig 1. In House Building, for example, the reward and transition dynamics of the system are independent of the coordinate frame of the image or the action space. Assumption 3.3 is needed to guarantee the Q function invariance described in the next section.

Assumptions 3.1, 3.2, and 3.3 imply that the optimal Q function is invariant to translations and rotations in SE(2). Our key idea is to use the invariance property of Proposition 4.1 to structure Q learning (and make it more sample efficient) by defining a neural network that is hard-wired to encode only invariant Q functions. However, in order to accomplish this in the context of DQN, we must allow for the fact that state is an input to the neural network while action values are an output. This neural network is therefore a function q : S → R A , where R A denotes the space of functions {A → R}. The invariance property of Proposition 4.1 now becomes an equivariance property, q(gs)(a) = q(s)(g −1 a),

where q(s)(a) denotes the Q value of action a in state s. We implement this constraint using equivariant convolutional layers as described below.

Equivariance over a finite group: In order to implement the equivariance constraint, it is standard in the literature to approximate SE(2) by a finite subgroup [4, 14] . Recall that the spatial component of an action is a sp = (x, θ) ∈ SE(2). We constrain position to be a discrete pair of positive integers x ∈ {1 . . . h} × {1 . . . w} ⊂ Z 2 , corresponding to a pixel in the input image I. We constrain orientation to be a member of a finite cyclic group θ ∈ C u , i.e. one of u discrete orientations. For example, if u = 8, then

Our finite approximation of a sp ∈ SE(2) isâ sp in the subgroupŜE(2) generated by translations Z 2 and rotations C u . Figure 2 : Illustration of Q-map equivariance when C u = C 4 . The output Qmap rotates and translates with the input image. The 4-vector at each pixel does a circular shift, i.e., the optimal rotation changes from 0 (the 1st element of C 4 ) to π 2 (the 2nd element of C 4 )

Input and output of an equivariant convolutional layer: A standard convolutional layer h takes as input an n-channel feature map and produces an m-channel map as output, h standard : R n×h×w → R m×h×w . We can construct an equivariant convolutional layer by adding an additional dimension to the feature map that encodes the values for each element of a group (C u in our case). 1 The equivariant mapping therefore becomes h equiv : R u×n×h×w → R u×m×h×w for all layers except the first. The first layer of the network generally takes a "flat" image as input: h in equiv : R 1×n×h×w → R u×m×h×w . 2 Equivariance constraint: Let h i (I)(x) denote the output of convolutional layer h at channel i and pixel x given input I. For an equivariant layer, h i (I)(x) ∈ R Cu describes feature values for each element of C u . For an element g ∈ SE(2), denote the rotational part by g θ ∈ C u . If we identify functions R Cu with vectors R u , then the group action of g θ ∈ C u on R Cu becomes left multiplication by a permutation matrix ρ(g θ ) that performs a circular shift on the vector in R u . Then the group action of SE(2) on a feature map h(I) ∈ R u×m×h×w can be expressed as g(h i (I))(x) = ρ(g θ )h i (I)(g −1 x).

for each i ∈ {1 . . . m}. This is illustrated in Fig 2. We can calculate the output feature map in the lower right corner by transforming the input by g and then doing the convolution (left side of Eq. 2) or by doing the convolution first and then taking the value of g −1 x and circular-shifting the output vector (right side of Eq. 2). In order to create a network that enforces the constraint of Eq. 1, we can simply stack equivariant convolutions layers that each satisfy Eq. 2.

Kernel constraint: The equivariance constraint of Eq. 2 can be implemented by strategically tying weights together in the convolutional kernel [4] . Since the standard convolutional kernel is already translation equivariant [13] , we must only enforce rotational (C u ) equivariance [33] :

where ρ in (g θ ) and ρ out (g θ ) are the permutation matrix of the group element g θ (note that for the first layer, K(y) will be a 1 × u matrix, and ρ in (g θ ) will be 1). More details are in Appendix B.

A baseline approach to encoding the Q function over a spatial action space is to use a fully convolutional network (FCN) that stacks convolutional layers to produce an output Q map with the same resolution as the input image. If we ignore the non-image state variables s rbt and the non-spatial action variables a arb , then we have all the tools we need -we simply replace all convolutional layers with equivariant convolutions and the Q network becomes fully equivariant.

Partial Equivariance: Unfortunately, in realistic robotics problems, Q function is generally not equivariant with respect to all state and action variables. For example, the non-equivariant parts of state and action in Section 3 are s rbt and a arb . We encode a arb by simply having a separate output head for each. However, to encode s rbt , we need a mechanism for inserting the non-equivariant information into the neural network model without "breaking" the equivariance property. We explored two approaches: the lift expansion approach and the dynamic filter approach. In the lift expansion, we tile the non-equivariant information across the equivariant dimensions of the feature map as additional channels (Fig 3a) . In the dynamic filter approach [34] , the non-equivariant data is passed through a separate pathway that outputs the weights of an equivariant kernel that is convolved into the main equivariant backbone. We constrain this filter to be equivariant by enforcing the kernel constraint of Eq. 3 (Fig 3b) . We empirically find that both methods have similar performance (Appendix G.2). In the remainder of this paper, we use the dynamic filter approach because it is more memory efficient.

Encoding Gripper Symmetry Using Quotient Groups: Another symmetry that we want to leverage is the bilateral symmetry of the gripper. The outcome of a pick action performed using a twofinger gripper in orientation θ is the same as for the gripper in orientation θ + kπ for any integer k.

Similarly, it is often valid to assume that the outcome of place actions is invariant 3 . We model this invariance using the quotient group C u /C 2 . The C 2 = {0, π} action equates rotations which differ by multiples of π in C u /C 2 . The steerable layer defined under the quotient group is applied with the same constraint as in Eq. 3, except that the output space will be in C u /C 2 .

Experimental Domains: We evaluate the equivariant FCN approach in the Block Stacking and Bottle Arrangement tasks shown in Fig 1. Both environments have sparse rewards (+1 at goal and 0 otherwise). The world state is encoded by a 1-channel heightmap I ∈ R 1×h×w and robot state is encoded by an image patch H that describes the contents of the robotic hand. The non-spatial action variable a arb ∈ {PICK, PLACE} is selected by the gripper state, i.e., a arb = PLACE if the gripper is holding an object, and PICK otherwise. The equivariant layers of the FCN are defined over group C 12 where the output is with respect to the quotient group C 12 /C 2 to encode the gripper symmetry. See Appendix C and D for detail on the experimental domains and the FCN architecture respectively. Experimental Comparison With Baselines: We evaluate against the following baselines: 1) Conventional FCN: FCN with 1-channel input and 6-channel output where each output channel corresponds to a Q map for one rotation in the action space (similar to Satish et al. [19] but without the z dimension). 2) RAD [7] FCN: same architecture as 1), while at each training step, we augment each transition in the minibatch with a rotation randomly sampled from C 12 . 3) DrQ [8] FCN: same architecture as 1), while at each training step, the Q targets and Q outputs are calculated by averaging over multiple augmented versions of the sampled transitions. Random rotations sampled from C 12 are used for the augmentation. 4) Rot FCN: FCN with 1-channel input and 1-channel output, the rotation is encoded by rotating the input and output for each θ [21] . 5) Transporter Network [1] , an FCN-based architecture with the last layer being a dynamic kernel generated by a separate FCN with an input of an image crop at the pick location. Baseline 2) and 3) are data augmentation methods that aim to learn the symmetry encoded in our equivariant network using rotational data augmentation sampled from the same symmetry group (C 12 ) as used by our equivariant model. All baselines have the same action space as the proposal. More detail on the baselines is in Appendix E.1. All methods except the Transporter Network use SDQfD, an approach to imitation learning in spatial action spaces that combines a TD loss term with penalties on non-expert actions [29] . (Transporter Network is a behavior cloning method.) Table 1 shows the number of demonstration steps. Those expert transitions are augmented by 9 random Fig 4 shows the results. Our equivariant FCN outperforms all baselines in the block stacking task. Notice that in the Bottle Arrangement task, the equivariant network learns faster than the baselines but converges to a similar level as RAD and DrQ. This is because the domain itself is already partially rotationally equivariant because the bottles are cylindrical and therefore our network has less of an advantage.

The FCN approach does not scale well to challenging manipulation problems. Therefore, we design an equivariant version of the augmented state representation (ASR) method of [29] , which has been shown to be faster and have better performance. The ASR method transforms the original MDP with a high dimensional action space into a new MDP with an augmented state space but a lower dimensional action space. Instead of encoding the value of all dimensions of action in a single neural network, this model encodes the value of different factorized parts of the action space such as position and orientation using separate neural networks conditioned on prior action choices. (2), the finite approximation of SE(2) as in Section 4.1. However, the Q function is now computed using two separate functions, the position function Q 1 (s, x) = max θ Q(s, (x, θ)) and the orientation function

). Q 1 is encoded using a fully convolutional network q 1 : R n×h×w → R 1×h×w that takes an n-channel image I as input and produces a 1-channel Q map that describes Q 1 (s, x) for all x. We evaluate Q 2 on the "augmented state" (s, x) which contains the state s and the chosen x. The augmented state is encoded using the image patch P = CROP(I, x) ∈ R n×h ×w cropped from I and centered at x. We model Q 2 using the network q 2 : R n×h ×w → R u that takes input P and outputs Q 2 ((s, x), θ) for all u different orientation θ. These two networks are used together for both action selection and evaluation of target values during learning. We evaluate x * = arg max x q 1 (I), calculate P = CROP(I, x * ), and then evaluate θ * = arg max θ q 2 (P ) and Q * = max θ q 2 (P ). Note that Q maps produced by q 1 and q 2 are of size u + hw, significantly smaller than the Q map in the FCN approach which is size uhw. Essentially the ASR method takes advantage of the fact that the optimal θ depends only on the local patch P given an optimal position x. Equivariant architecture for ASR in SE (2): We decompose the SE(2) equivariance property of Eq. 1 into two equivariance properties for q 1 and q 2 , respectively:

The equivariance property of q 1 is similar to that of Eq. 2 except that the output of q 1 has only one channel which is invariant to rotations (since it is a maximum over all rotations). This means we can rewrite the q 1 equivariance property as q 1 (gI) = gq 1 (I), where g on the RHS of this equation translates and rotates the output Q map. In practice, we obtained the best performance for q 1 by enforcing equivariance to the Dihedral group D 4 , which is generated by 90 degree rotation and reflections over the coordinate axis. For q 2 , we used an equivariant feature map that outputs a single u-dimensional vector of Q values corresponding to the finite cyclic group C u used. (We use C 12 /C 2 and C 32 /C 2 in our experiments below). We handle the partial equivariance using the same strategies as earlier.

Appendix D describes the model details. The network q 2 is defined using C 12 and its quotient group C 12 /C 2 to match Section 4.2. The ASR method surpasses the FCN method in both tasks.

More Challenging Experimental Domains: The equivariant ASR method is able to solve more challenging manipulation tasks than equivariant FCN can. In particular, we could not run the FCN with as large a rotation space because it requires more GPU memory. We evaluate on the following four additional domains: House Building, Covid Test, Box Palletizing (introduced in [1]), and Bin Packing (Fig 1(c-f) ). All domains except Bin Packing have sparse rewards. In Bin Packing, the agent obtains a positive reward inversely proportional to the highest point in the pile after packing all objects. See Appendix C for more details about the environments. We now define q 2 using the group C 32 and its quotient group C 32 /C 2 , i.e., we now encode 16 orientations ranging from 0 to π. As in Section 4.2, we use the SDQfD loss term to incorporate expert demonstrations (except for Transporter Net which uses standard behavior cloning exclusively). The number of expert transitions provided is shown in (2) and (3) Table 2 shows the results. In Bottle Arrangement, the robot shows a 90% success rate. In one of the two failures, the arrangement is not compact enough, leaving no enough space left for the last bottle.

In the other failure, the robot arranges the bottles outside of the tray. In the House Building task, the robot succeeds in all 20 episodes. In the Box Palletizing task, the robot demonstrates a 95% success rate. In the failure, the robot correctly stacks 16 of 18 boxes, but the 17th box's placement position is offset slightly from the the rest of the stack and there is no room to place the last box. The same problem happens in another successful episode, where the fingers squeeze the boxes and make room for the last box. A strength of the ASR method is that it can be extended into SE(3) by adding networks similar to q 2 that encode Q values for additional dimensions of the action space [29] . Specifically, we add three networks to the SE(2)-equivariant architecture described in Section 4.3, q 3 , q 4 , and q 5 encoding Q values for Z (height above the plane), and angles Φ (rotation in XZ plane) and Ψ (rotation in YZ plane) dimensions of SE (3). Each of these networks takes as input a stack of 3 orthographic projections of a point cloud along the coordinate planes. The point cloud is re-centered and rotated to encode the partial SE (3) actions. (see [29] for details).

Unfortunately, we cannot easily make these additional networks equivariant using the same methods we have proposed so far because they encode variation outside of SE(2). Instead, we create an encoding that is approximately equivariant by explicitly transforming the input to these networks in a way that corresponds to a set of candidate robot positions or orientations (called a "deictic" encoding in [22] ). We will describe this idea using q 2 as an example. Define q 2 (P ) ∈ R to output the single Q value of the rotation encoded in the input image P . Then q 2 can be defined as a vectorvalued function:q 2 (P ) = (q 2 (g −1 1 P ), . . . , q 2 (g −1 u P )), where g 1 , . . . , g u ∈ C u .q 2 is approximately equivariant because everything learned by q 2 is automatically replicated in all orientations. We design deictic q 3 , q 4 , and q 5 similarly by selecting a finite subset of {g 1 , . . . , g K } ⊂ SE(3) corresponding to the dimension of the action space encoded by each q i . q i can then be defined by evaluating a network q i over input P transformed by (g i ) K i=1 . For q 3 , we evaluate over 36 translations g k (z) = z + k(0.18/36) + 0.02 where 0 ≤ k ≤ 35. For q 4 and q 5 , we use rotations g k ∈ {− π 8 , − π 12 , − π 24 , 0, π 24 , π 12 , π 8 }. Note we use q 2 for explanation, while our model uses equivariant q 1 , q 2 and deictic q 3 -q 5 . For an ablated version using deictic q 2 , see Appendix G.4.3 and G.4.4. Comparison to Non-Equivariant Approaches: We evaluate ASR in SE(3) in the House Building and Box Palletizing domains. We modified those environments so that objects are presented randomly on a bumpy surface with a maximum out of plane orientation of 15 degrees (Fig 9) . In order to succeed, the agent must correctly perform pick and place actions with the needed height and out of plane orientation. We evaluated the Equivariant ASR in comparison with a baseline Conventional ASR (same as [29] ). Both methods use SDQfD with 2000 expert demonstration steps. The results are shown in Fig 10. Our proposed approach outperforms the baseline by a significant margin.

In this paper, we show that equivariant neural network architectures can be used to improve Q learning in spatial action spaces. We propose multiple approaches and model architectures that can be used to accomplish this and demonstrate improved sample efficiency and performance on several robotic manipulation applications both in simulation and on a physical system. This work has several limitations and directions for future research. First, our methods apply directly only to problems in spatial action spaces. While many robotics problems can be expressed this way, it would clearly be useful to develop equivariant models for policy learning that can be used in other settings. Second, although we extend our ASR approach from SE(2) to SE(3) in the last section of this paper, this solution is not fully equivariant in SE(3) and it may be possible to do better by exploiting methods that are directly equivariant in SE(3).

We need the following Lemma regarding the visual state space S and spatial action space A described in Section 3. We use the following notation: gS = {gs|s ∈ S} and gA = {ga|a ∈ A}. Lemma A.1. Let S be a visual state space and let A by a spatial action space. Then, ∀g ∈ SE(2), we have that S = gS and A = gA.

Proof. First, consider the claim that S = gS. We will show 1) S ⊆ gS and 2) gS ⊆ S. 1) S ⊆ gS: This follows from the closure of state under g ∈ SE(2). 2) gS ⊆ S: Let s ∈ gS. By the definition of gS, ∃s ∈ S such that gs = s and gs ∈ gS. Multiplying both sides by g −1 , we have g −1 (gs) ∈ g −1 (gS). Using Assumption 3.3, we have s ∈ S. Using the closure of state under g, we have gs ∈ S or s ∈ S. A parallel argument can be used to show A = gA. 

Notice that Eq. 9 and Eq. 4 are the same Bellman equation. Since solutions to the Bellman equation are unique, we have that ∀s, a ∈ S × A, Q * (s, a) =Q * (s, a) = Q * (gs, ga).

Consider a standard convolutional layer that takes an n × h × w feature map as input and produces an m × h × w map as output. It computes h i (x) = y,j K ij (y)I j (x + y), where j ∈ {1 . . . n}, i ∈ {1 . . . m}, I j (x) is the value of the input at the x pixel and the j channel, h i (x) is the output at pixel x and channel i, and K ij (y) is the kernel value at y for the j input and i output channels. For a standard convolutional layer, I j (x), h i (x), and K ij (y) are all scalars. However, for an equivariant network over C u , h i (x) becomes a u-element vector and K ij (y) becomes a u × u matrix. The u elements of h i (x) encode the feature values of pixel x at channel i at each orientation in C u . The kernel constraint is [33] :

where ρ in (g θ ) and ρ out (g θ ) are the permutation matrix of the group element g θ (note that for the first layer, K ij (y) will be a 1 × u matrix, and ρ in (g θ ) will be 1).

(a) (b) Figure 11 : (a) All eight bottle models in the Bottle Arrangement task. (b) All seven object models in the Bin Packing task.

C Experimental Domains

In the Block Stacking task (Fig 1a) , there are four cubic blocks with a fixed size of 3cm×3cm×3cm randomly placed in the workspace. The goal is stacking all four blocks in a stack. An optimal policy requires six steps to finish this task, and the maximal number of steps per episode is 10.

In the Bottle Arrangement task (Fig 1b) , six bottles with random shapes (sampled from 8 different shapes shown in Fig 11a. The bottle shapes are generated from the 3DNet dataset [35] . The sizes of each bottle are around 5cm × 5cm × 14cm) and a tray with a size of 24cm × 16cm × 5cm are randomly placed in the workspace. The agent needs to arrange all six bottles in the tray. An optimal policy requires 12 steps to finish this task, and the maximal number of steps per episode is 20.

In the House Building task (Fig 1c) , there are four cubes with a size of 3cm × 3cm × 3cm, a brick with a size of 12cm × 3cm × 3cm, and a triangle block with a bounding box size of around 12cm × 3cm × 3cm. The agent needs to stack those blocks in a specific way to build a house-like block structure as shown in Fig 1c. An optimal policy requires 10 steps to finish this task, and the maximal number of steps per episode is 20.

In the Covid Test task (Fig 1d) , there is a new tube box (purple), a test area (gray), and a used tube box (yellow) placed arbitrarily in the workspace but adjacent to one another. Three swabs with a size of 7cm × 1cm × 1cm and three tubes with a size of 8cm × 1.7cm × 1.7cm are initialized in the new tube box. To supervise a COVID test, the robot needs to present a pair of a new swab and a new tube from the new tube box to the test area (see the middle figure in Fig 1d) . The simulator simulates the user testing COVID by putting the swab into the tube and randomly place the used tube in the test area. Then the robot needs to re-collect the used tube into the used tube box. Each episode includes three rounds of COVID test. An optimal policy requires 18 steps to finish this task, and the maximal number of steps per episode is 30.

In the Box Palletizing task (Fig 1e) (some object models are derived from Zeng et al. [1] ), a pallet with a size of 23.2cm × 19.2cm × 3cm is randomly placed in the workspace. The agent needs to stack 18 boxes with a size of 7.2cm × 4.5cm × 4.5cm as shown in Fig 1e. At the beginning of each episode and after the agent correctly places a box on the pallet, a new box will be randomly placed in the empty workspace. An optimal policy requires 36 steps to finish this task, and the maximal number of steps per episode is 40.

In the Bin Packing task (Fig 1f) , eight objects (the shape of each is randomly sampled from seven different object in Fig 11b. Object models are derived from Zeng et al. [21] ) with a maximum size of 8cm × 4cm × 4cm and a minimum size of 4cm × 4cm × 2cm and a bin with a size of 17.6cm × 14.4cm × 8cm are randomly placed in the workspace. The agent needs to pack all eight objects in the bin while minimizing the highest point (h max cm) of all objects in the bin. The Bin Packing task has real value sparse rewards: a reward of 8−h max is given when all objects are placed in the bin. An optimal policy requires 16 steps to finish this task, and the maximal number of steps per episode is 20.

In the SE(3) House Building (Fig 9a) and the Box Palletizing (Fig 9b) tasks, a bumpy surface is generated by nine pyramid shapes with a random angle sampled from 0 to 15 degrees. The orientation of the bumpy surface along the z axis is randomly sampled at the beginning of each episode. In the Bumpy House Building task, a flat platform with a size of 13cm × 13cm and a height same as the highest bump is randomly placed in the workspace. The agent needs to build the house on top of the platform. In the Bumpy Box Palletizing task, the pallet is raised by the same height as the highest bump (so that it will be horizontal to the ground). All other parameters mirror the original House Building task and the original Box Palletizing task.

All of our network architectures are implemented using PyTorch [36] . We use the e2cnn [14] library to implement the steerable convolutional layers. Appendix D.1 and Appendix D.2 respectively show the network architectures of the equivariant FCN and equivariant ASR using the dynamic filter for partial equivariance. Appendix D.3 shows the architecture of lift expansion partial equivariance. Appendix D.4 shows the architecture of the deictic encoding.

In the Equivariant FCN architecture (Fig 12a) , we use we use a 16-stride UNet [37] backbone where all layers are steerable layers. The input is viewed as a trivial representation and is turned into a 16-channel regular representation feature map after the first layer. Every layer afterward in the UNet uses the regular representation, and the output of the UNet is a 16-channel regular representation feature map. This feature map is sent to a quotient representation layer to generate the pick Q value maps for each θ. For the place Q values, the non-equivariant information from H must be Incorporated. H is sent to 4 conventional convolutional layers followed by 2 FC layers. The output is a vector with the same size as the number of the free weights in a 16-channel regular representation steerable layer with a kernel size of 3 × 3. This output vector is expanded into a steerable convolutional kernel and is convolved with the output of the UNet. The result is sent to a quotient representation layer to generate the place Q value maps for each θ.

Fig 12b shows the Equivariant ASR network architecture. The q 1 architecture is very similar to the Equivariant FCN network. Its output is a trivial representation instead of a regular representation to generate only one Q map for the x, y positions. The bottleneck feature map is passed through a group pooling layer (a max pooling over the group's dimension) to form e(s), a state encoding that is used by q 2 . q 2 uses e(s) and the feature vector from H to generate the weights for a steerable dynamic filter. q 2 processes P using a set of steerable convolution layers in the regular representation, then convolves the feature map with the dynamic filter. The result of the dynamic filter layer is sent to two separate quotient representation layers to generate pick and place values for each θ. A convolutional layer with a suffix of R indicates a regular representation layer (e.g., 16R is a 16-channel regular representation layer); a convolution layer with a suffix of Q indicates a quotient representation layer (e.g., 1Q is a 1-channel quotient representation layer); a convolution layer with a suffix of T indicates a trivial representation layer (e.g., 1T is a 1-channel trivial representation layer); a convolutional layer with a suffix of a number indicates a conventional convolutional layer. The convolutional layer colored in cyan is the dynamic filter layer whose weights are from the FC layer pointing to it. regular representation feature map (the output of the rightmost convolutional layer in the middle row) and concatenated. In ASR, the same Lift Expansion network can be used in q 1 . The RAD [7] baseline uses the same baseline architecture, but during the training, each transition in the minibatch is applied with a rotational augmentation randomly sampled from C 12 . The DrQ [8] baseline uses the same baseline architecture, but the Q targets are calculated by averaging over K augmented versions of the sampled transitions; the Q estimates are calculated by averaging over M augmented versions of the sampled transitions. Random rotation sampled from C 12 is used for the augmentation, and we use K = M = 2 as in [8] . Note that in RAD and DrQ, since we are learning an equivariant Q network instead of an invariant Q network, we apply the rotational augmentation on both the state and action, rather than only augmenting the state as in the prior works. The Rot FCN baseline uses the same network backbone, but the number of output channels n = 2 (for pick and place, respectively). Rotations are encoded by rotating the input and output accordingly for each θ in the action space [21] . The Transporter baseline uses three FCNs (one for picking and two for placing) with the same FCN backbone shown in Fig 15. For placing, there are two networks with the same architecture for features (with an input of I) and filters (with an input of H), and the outputs of both are 3-channel feature maps. The correlation between them forms the 1-channel output. Rotations are encoded by rotating the input H for each θ in the action space. The pick network is the same as the Rot FCN baseline.

Fig 12b shows the network architecture for the Conventional ASR baseline. The RAD [7] baseline uses the same baseline architecture, but during the training, each transition in the minibatch is applied with a rotational augmentation randomly sampled from C 32 . The DrQ [8] baseline uses the same baseline architecture, but the Q targets are calculated by averaging over K augmented versions of the sampled transitions; the Q estimates are calculated by averaging over M augmented versions of the sampled transitions. Random rotation sampled from C 32 is used for the augmentation, and we use K = M = 2 as in [8] . The Transporter network baseline uses the same architecture as in Appendix E.1. 

L SLM is combined with the TD loss L T D : L = L T D + wL SLM where w is the weight for the margin loss. Note that L SLM is only applied for expert transitions, while on-policy transitions only apply the TD loss term.

We implement our experimental environments using the PyBullet simulator [32] . The workspace has a size of 0.4m × 0.4m. In Section 4.2, I covers the workspace with a size of 90 × 90 pixels, and is padded with 0 to 128 × 128 pixels (this padding is required for the Rot FCN baseline because it needs to rotate the image to encode θ. To ensure a fair comparison, we apply the same padding to all methods). In Section 4.3, I covers the workspace with a size of 128×128 pixels. The in-hand image H is a 24 × 24 image crop centered and aligned with the previous pick in SE (2) experiments. In We train our models using PyTorch [36] with the Adam optimizer [39] with a learning rate of 10 −4 and weight decay of 10 −5 . We use Huber loss [40] for the TD loss and cross entropy loss for the behavior cloning loss. The discount factor γ is 0.95. The batch size is 16 for SDQfD agents and 8 for behavior cloning agents. In SDQfD, we use the prioritized replay buffer [41] with prioritized replay exponent α = 0.6 and prioritized importance sampling exponent β 0 = 0.4 as in Schaul et al. [41] . The expert transitions are given a priority bonus of d = 1 as in Hester et al. [38] . The buffer has a size of 100,000 transitions. The weight w for the margin loss term is 0.1, and the margin l = 0.1.

In Section 4.2 and Section 4.3, we compare the equivariant architectures with RAD [7] and DrQ [8] with rotational data augmentation. In this experiment, we run the comparison with more data augmentation operators: 1) Rotation: random rotation in C 12 and C 32 , same as in Section 4.2 and Section 4.3. 2) Translation: random translation. 3) SE(2): the combination of 1) and 2). 4) Shift: random shift of ± 4 pixels as in [8] . Note that only 1) is a fair comparison because our equivariant models do not inject extra translational knowledge into the network. Even though, the equivariant networks outperforms all data augmentation methods in five out of the six environments.

In this experiment, we compare the Dynamic Filter and Lift Expansion methods for encoding partial equivariance property. We evaluated both the equivariant FCN architecture and the equivariant ASR architecture (note that we only test this variation in q 1 . q 2 uses the Dynamic Filter regardless of the architecture of q 1 ). The results are shown in Fig 18. Both methods generally perform equally well. 

In this experiment, we evaluate the performance of our equivariant network in a behavior cloning setting compared with the Transporter network [1] . Both methods use the same cross entropy loss function and the same data augmentation strategy. The experimental parameters mirror Section 4.2.

The results are shown in Table 3 . The equivariant network outperforms the Transporter network in both environments.

G.4.1 Only Using Equivariant Network in q 1 or q 2

In this ablation study, we evaluate the effect of the equivariant network by only applying it in q 1 or q 2 . There are four variations: 1) Equivariant q 1 + Equivariant q 2 : both q 1 and q 2 use the equivariant network; 2) Equivariant q 1 + Conventional q 2 : q 1 uses the equivariant network, q 2 uses the conventional convolutional network; 3) Conventional q 1 + Equivariant q 2 : q 1 uses the conventional convolutional network, q 2 uses the equivariant network; 4) Conventional q 1 + Conventional q 2 : both q 1 and q 2 use the conventional convolutional network. The results are shown in Fig 19, where Using the equivariant network in both q 1 and q 2 (blue) always shows the best performance. Note that only applying the equivariant network in q 2 (red) demonstrates a greater improvement compared with only applying the equivariant network in q 1 (green) in three out of four environments. This is because q 2 is responsible for providing the TD target for both q 1 and q 2 [29] , which raises its importance in the whole system. 

In this experiment, we evaluate two different symmetry groups that q 1 can be defined upon: the Cyclic group C 8 that encodes eight rotations every 45 degrees, and the Dihedral group D 4 that encodes four rotations every 90 degrees and reflection. Both groups have an order of 8, i.e., the network will be equally heavy. As is shown in Fig 20 , D 4 has a minor advantage over C 8 .

This experiment compares the deictic encoding equipped with the equivariant ASR and the conventional ASR in SE (2) . The comparison is conducted in the following four variations: 1) Equivariant q 1 + Equivariant q 2 : both q 1 and q 2 use the equivariant network; 2) Equivariant q 1 + Deictic q 2 : q 1 uses the equivariant network, q 2 uses the deictic encoding; 3) Conventional q 1 + Conventional q 2 : both q 1 and q 2 use the conventional convolutional network; 4) Conventional q 1 + Deictic q 2 : q 1 uses the equivariant network, q 2 uses the deictic encoding. The results are shown in Fig 21. When q 1 is using the equivariant network, using the deictic encoding in q 2 (green) outperforms using equivariant network in q 2 (blue) in Bin Packing, while the equivariant q 2 outperforms in House Building. In Covid Test and Box Palletizing, they tends to have similar performance. When q 1 uses conventional CNN, using deictic encoding in q 2 (red) generally provides a significant performance, compared with using conventional CNN in q 2 (purple). In Covid Test, the use of the deictic encoding does not make a big difference. We suspect that this is because in Covid Test the bottleneck of the whole system is q 1 .

This experiment studies the different network choices for q 1 (equivariant network, conventional network), q 2 (equivariant network, deictic encoding, conventional network), and q 3 − q 5 (deictic encoding, conventional network). We evaluate two proposed approaches: 1) Equi+Equi+Equi uses the equivariant network in q 1 and q 2 and deictic encoding in q 3 through q 5 (the three components in the name mean the architecture of q 1 , q 2 , and q 3 -q 5 ); 2) Equi+Deic+Deic uses equivariant network in q 1 , and deictic encoding in q 2 through q 5 . We compare the proposal with the following baselines: 1) Equi+Conv+Conv uses equivariant network in q 1 , and conventional convolutional network in q 2 through q 5 ; 2) Conv+Equi+Deic uses conventional convolutional network in q 1 , equivariant network in q 2 , and deictic encoding in q 3 through q 5 ; 3) Conv+Deic+Deic uses conventional convolutional Table 5 : The average time for each training step in a rotation space of C 32 /C 2 network in q 1 , and deictic encoding in q 2 through q 5 ; 4) Conv+Conv+Conv uses conventional convolutional network in q 1 through q 5 . The results are shown in Fig 22, where our two proposed approaches outperform the baseline architectures in both environments. Note that swapping the conventional convolutional network with the equivariant network or the deictic encoding generally improves the performance, except that Equi+Conv+Conv in Bumpy Box Palletizing underperforms Conv+Conv+Conv. We suspect that this is because the target of q 1 given by the conventional convolutional networks is less stable. Table 4 and Table 5 shows the average runtime in the setting of the experiments in Section 4.2 and Section 4.3, respectively. The runtime is calculated by averaging over 500 training steps on a single Nvidia RTX 2080 Ti GPU. Both the Equivariant FCN and Equivariant ASR requires a longer time for each training step. However, Equivariant ASR is faster and similar to the best performing data augmentation method DrQ.

To ensure better sim to real transfer, we train our model used in the real world with a Perlin noise [42] (with a maximum magnitude of 7mm) applied to the observations. Bottle Arrangement: In the Bottle Arrangement task, we use the bottles and tray shown in Fig 23 for testing.

House Building: In the House Building task, we train the model with object size randomization within ±8.3%. A Gaussian filter is applied after the Perlin noise during training to make the observation noisier. The model is trained for 20k episodes instead of 10k as in the simulation experiment.

Box Palletizing: In the Box Palletizing task, we add an object size randomization within ±3.75% and increase the size of H and P from 24 × 24 to 40 × 40. 

Transporter networks: Rearranging the visual world for robotic manipulation

Representation learning with contrastive predictive coding

Curl: Contrastive unsupervised representations for reinforcement learning

Gradient-based learning applied to document recognition

Imagenet classification with deep convolutional neural networks

Reinforcement learning with augmented data

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

Generalization in reinforcement learning by soft data augmentation

Qt-opt: Scalable deep reinforcement learning for visionbased robotic manipulation

Invariant transform experience replay: Data augmentation for deep reinforcement learning

A framework for efficient robotic manipulation

Group equivariant convolutional networks

General e(2)-equivariant steerable cnns

Plannable approximations to mdp homomorphisms: Equivariance under actions

Mdp homomorphic networks: Group symmetries in reinforcement learning

Group equivariant deep reinforcement learning

Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach

On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks

Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching

Learning synergies between pushing and grasping with self-supervised deep reinforcement learning

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Deictic image mapping: An abstraction for learning pose invariant manipulation policies

Learning manipulation skills via hierarchical spatial attention

Robot learning of shifting objects for grasping in cluttered environments

Knowledge induced deep q-network for a slide-to-wall object grasping

Learning to throw arbitrary objects with residual physics

Form2fit: Learning shape priors for generalizable assembly from disassembly

Self-supervised learning for precise pick-and-place without object model

Policy learning in se (3) action spaces

Action priors for large action spaces in robotics

Spatial action maps for mobile manipulation

Pybullet, a python module for physics simulation for games, robotics and machine learning

A general theory of equivariant cnns on homogeneous spaces

Dynamic filter networks

Large-Scale Object Class Recognition from CAD Models

Automatic differentiation in PyTorch

U-net: Convolutional networks for biomedical image segmentation

Deep q-learning from demonstrations

Adam: A method for stochastic optimization

Robust estimation of a location parameter

Improving noise

This work is supported in part by NSF 1724257, NSF 1724191, NSF 1763878, NSF 1750649, and NASA 80NSSC19K1474. R. Walters is supported by a Postdoctoral Fellowship from the Roux Institute and NSF grants 2107256 and 2134178.