key: cord-0502071-61bo1za0
authors: Biswas, Ansuman; Gupta, Ashutosh; Missula, Meghana; Thattai, Mukund
title: Automated inference of production rules for glycans
date: 2021-07-05
journal: nan
DOI: nan
sha: 16a9b20edc89e438959d434b435ff1a959b07f57
doc_id: 502071
cord_uid: 61bo1za0

Glycans are tree-like polymers made up of sugar monomer building blocks. They are found on the surface of all living cells, and distinct glycan trees act as identity markers for distinct cell types. Proteins called GTase enzymes assemble glycans via the successive addition of monomer building blocks. The rules by which the enzymes operate are not fully understood. In this paper, we present the first SMT-solver-based iterative method that infers the assembly process of the glycans by analyzing the set of glycans from a cell. We have built a tool based on the method and applied it to infer rules based on published glycan data.

The ability to control the assembly of small building blocks into large structures is of fundamental importance in biology and engineering. Familiar examples of this process from biology include the synthesis of linear DNA from nucleotide building blocks and the synthesis of linear proteins from amino-acid building blocks. In both these examples, the synthesis is templated: the new DNA or protein molecule is essentially copied from an existing molecule. However, most biological assembly proceeds without a template. For example, when an adult animal is grown from a fertilized egg, the genome within the egg contains a dynamical recipe encoding the animal rather than a template. The genome restricts and controls the set of events that can take place subsequent to fertilization.

While the process of animal development is too complex to study comprehensively, the same themes arise in the synthesis of complex tree-like sugar polymers known as glycans [1] that are covalently attached to proteins. Unlike linear proteins and DNA, glycans are tree-like structures: their nodes are sugar monomers, and their edges are covalent carbon-carbon bonds. The tree-like structure of a glycan is a direct consequence of the fact that a sugar monomer can directly bond to at least three neighboring sugar monomers (in contrast to nucleotides or amino acids, which can only bind to two neighbors and are constrained to make a chain).

A given cell produces a specific set of glycan molecules that are present in the cell. Since different cells produce different sets of molecules, the assembly process must be programmable: the assembly process includes a set of production rules. The reactions that underlie glycan production are carried out by enzymes known as GTases [1] . There are hundreds of enzymes present in a given cell: each enzyme is a protein encoded by a distinct gene, which carries out a distinct biochemical reaction. The enzymes thus execute the production rules. A glycan tree is assembled piece by piece in successive steps. At each step, a production rule adds a small piece of a tree at the leaves or internal nodes of the current tree. Not all the rules are applicable at all the leaves. The monomer at a leaf and current surroundings of the leaf controls the applicability of a rule on the leaf. Identifying the exact set of the production rules by extensive biochemical experiments are costly and often needs an initial hypothesis for the rules to test. Biologists must identify the production rules and their control conditions by manually analyzing the set of observed glycans in a cell and using prior knowledge of biochemistry. We estimate that there may be more then 10 70 possible rule sets if we consider all biological variations. 1 This is an error-prone process since the production rules must generate exactly the set of molecules in the cells and nothing else; and moreover, the data sets about which glycans are present in which cells are often incomplete. Manually comprehending all possible tree generation rules is difficult and ad-hoc. It would be useful to automate the process of learning which rules are operating in a given cell, based on incomplete data.

In this paper, we are presenting the first automated synthesis method for the production rules. Our method takes the observed glycan molecules in a cell as input and synthesizes the possible production rules that may explain the observation. To our knowledge, our work is the first to consider the computational problem. Our method of synthesis is similar to counterexample guided inductive synthesis(CEGIS) [2] . Several methods for solving problems of searching in a complex combinatorial space use templates to define and limit their search space, such as learning invariants of programs [3] , and synthesizing missing components in programs [4, 5] . We also use templates to model the production rules.

Our method is iterative. We first construct constraints encoding that a set of unknown rules defined by templates can assemble the input set of molecules. The generated constraints involve Boolean variables, finite range variables, and integer ordering constraints. We solve the constraints using an off-the-shelf SMT solver. We call the query to the solver synthesis query. If the constraints are unsatisfiable, there are no production rules with the search space defined by the templates. Otherwise, a solution of the constraints gives a set of rules.

However, there is also an additional requirement that a molecule that is not in the input set must not be producible by the synthesized rules. Therefore, the method looks for the producible molecules that are not in the input set. Again the search of the molecule is assisted by a template, which bounds the height of searched molecules. We generate another set of constraints using the templates for the unknown molecule. We again solve the constraints using an SMT solver. We call the query to the solver counterexample query. If there is no such molecule, our method reports the synthesized rules. Otherwise, we have found a producible molecule that is not in the input set.

We append our synthesis constraints with additional constraints stating that no matter how we apply the synthesized rules, they will not produce the extra molecule. Since there is a requirement that all possible applications of rules must satisfy a condition, we have quantifiers in the constraints. We use a solver that handles quantifiers over finite range variable in the synthesis query. We go to the next iteration of the method. The method always terminates because the search space of rules is finite. The set of rules synthesized need not be minimal or unique. The solver reports the first set which satisfies the constraints. However, our method is adoptable. We can add various optimization criteria to find optimal rules for the given objectives, e.g., smallest rule sizes, number of rules, etc.

Our encoding to constraints depends on the model of execution of the rules. The current biological information is not sufficient to make a precise and definite model, and do the synthesis. We have also explored the variations of the models. For example, all rules may apply simultaneously. They apply in batches because they stay in different compartments. The molecule under assembly may flow through the compartments. The distribution of the stay of the molecules in a compartment also affects the execution model. Furthermore, we may have variations in the type and quality of data available to us. For example, we may have missed a produced molecule in experiments. We support the variations.

We have implemented the method in our tool GlySynth. We have applied the tool on data sets from published sources (available in the database UniCarbKB [6] ). The output rules are within the expectations of biological intuition.

We organize the paper as follows. In Section 2, we introduce the biological background. In Section 3, we present a motivating example to illustrate our method. In Section 4, we present the formal model of the glycans and their production rules. In Section 5, we present our method for the synthesis problem. In Section 6-7, we present our experiments and conclude the paper. We present the extended related work in Appendix A.

DNA and RNA are made by copying template DNA, in a process called transcription carried out by an enzyme called RNA polymerase. Proteins are made by copying a template messenger RNA, in a process called translation carried out by a molecular machine called a ribosome [7] . In contrast, glycans are grown without a template, in a process called glycosylation. Glycosylation is carried out, not by a single enzyme, but by a large collection of so-called GTase enzymes that assemble one sugar monomer at a time into a final glycan tree. This process involves an ordered series of reactions, in which an enzyme first recruits the correct monomer, the enzyme-monomer complex binds to the target glycan at the appropriate motif, and finally a chemical reaction occurs which serves to bind the new monomer at the correct place on the glycan. The enzyme's binding motif can corresponding to a single monomer, or a large sub-structure of the entire glycan several nodes deep [8] . This process is reminiscent of a factory assembly line to make a car [9] . However, the assembly process operates without a blueprint: the final glycan structure is determined by the behavior of the enzymes themselves.

The process of glycosylation is stochastic, governed by the Poisson statistics of single-step chemical reactions. One result of this stochasticity is that the enzymes can operate in different time orders [10] . It is as if factory workers could operate in many different orders while building the car, first adding doors and later windows. Moreover, the enzymes are promiscuous: they can add new monomers to many different places on the growing tree. This is as if the factory workers could add headlights at many different points on the car. Since there is no template, the existing tree determines where new monomers are added. Given the stochastic and promiscuous nature of the GTase enzymes, it is not surprising that the final product is highly variable [11] . The same set of enzymes can build many different glycan trees.

This variability is evident in the glycans observed to be produced by living cells. In a typical experiment, a protein is purified from a cell and the glycans attached to it are separated and their structure is characterized. Such an experiment produces a spectrum of glycan trees termed the protein's glycan profile [11] . A single glycan profile typically contains ten to twenty trees in measurable abundance, each tree being a tree of depth two to ten bonds.

In [12] , the authors had reported a method to infer the production rules when a single glycan is produced. However, the biologically interesting case is when the data set contains many glycan trees. This raises the following question: given a set of glycan trees produced by a cell, can we infer the set of enzymes that produce the glycans? This is the problem we tackle here.

In Figure 1 , we present details of glycan production. A glycan is a tree-like sugar tree (nodes linked by edges) attached to a substrate protein at the root (labeled 'R'). Distinct edge orientations correspond to covalent bonds of distinct carbons on the sugar monomer. Curved boxes represent reaction compartments within cells, which are the site of glycan production. Each step of glycan growth (black arrows) represents the addition of a single new monomer to a specific attachment point on the tree. Each such step is catalyzed by an enzyme, labeled E i . At any stage of growth, the tree can exit the reaction compartment as an output. Alternatively, it can be passed to a subsequent reaction compartment for further growth driven by different enzymes. Note that the enzymatic rule is sensitive to the two monomers being linked by a bond, as well as any branches. For example, enzyme E 2 will add a Galactose to a GalNAc only if the GlcNAc branch is present; otherwise, the reaction will not proceed ('X'). The structures, reactions, and enzymes shown here are illustrative, and they do not correspond to any measured data set; see the following section for a real example. In biological experiments, the combined outputs of every compartment are measured; the underlying reactions must be inferred.

In this section, we first present a motivating example to illustrate our method. In Figure 2 , we consider the glycan oligomers associated with human chorionic gonadotropin [13] . The data set has four glycan oligomers (shown in boxes and numbered). We assume that all these oligomers are built by starting from a root GalNAc (yellow square) by adding one monomer at a time (lines between glycans represent monomer addition reactions). At the top of the figure, we illustrate if all enzymes (rules) operate in a single compartment, a large number of glycans can potentially be made in addition to the measured ones. In the lower part of the Figure 1 : Biological details of glycan production. There are many types of sugar monomer building blocks; for example, GalNAc (yellow square), GlcNAc (blue square), Galactose (yellow circle), Sialic Acid (purple diamond), Fucose (red triangle) and so on [1] .

figure, we illustrate if the enzymes are split into three compartments (separated by dotted lines), then certain reactions are prevented from occurring. Thus, reducing the set of structures. In this case, we assume that only the terminal (bottom-most) structures will be produced as outputs. Here we have assumed certain rules of operation that are most consistent with the observed glycan data set. The goal of this paper is to infer the rules. Now we consider the synthesis problem. In Figure 3 (a), we present a set of glycan molecules present in a cell consists of three molecules, which are structurally similar to the three glycan molecules in Figure 2 . To keep illustration simple, we have dropped the third glycan molecule from the earlier set. The molecules contain four types of monomers. As we are considering the abstract case, we have named them A, B, C, D. Each monomer is associated with an arity, i.e., the maximum number of potential children. The arities of the monomers are 2, 1, 1, and 0, respectively.

Let us first consider six rules in Figure 3 (b) that produce the molecules. All the rules are in the same compartment, i.e., they can be applied in arbitrary order. The rules have two kinds of nodes. If the circular nodes are present around a node, the rule is enabled and may append the molecule at the node with the square nodes. In Figure 3 (c), we show the steps of generating the last glycan molecule. The first two steps add two nodes at a time. The last step looks at the two ancestors before adding a single node.

The second last rule in Figure 3 (b) has a non-trivial condition on the sibling of the anchor leaf node. It requires, the parent of the new node should be A and the right sibling must be B. If we do not have the sibling condition, we may be able to construct the molecule in Figure 3 (d) using the fourth and the modified fifth rule. The molecule is not in a subtree of any of the three input glycan molecules. Therefore, there are scenarios where rules must look into the context before applying themselves.

Our method for synthesis takes the three glycan molecules as input. It also needs the budget of resources to search for the production rules. If we allow an arbitrary number of rules, and the rules to look at their context up to an arbitrary depth, then there is a trivial solution. Therefore, our method limits the number and size of rules. For this illustration, we searched for the seven production rules with three as the limit on the rule heights. All rules are in a single compartment. GlySynth, the tool that implements the method, reported the synthesized rules from Figure 3 (b) in 0.85 seconds. In our tool, we first construct a synthesis query using the templates to encode that a set of rules produces the input molecules. We call a solver to solve the synthesis query. After the first query, we obtain the rules presented in Figure 3 (e). The rule set can produce molecules that are not in input. We need to iterate further. After 8 iterations, our tool synthesizes a set of rules that satisfies the requirements.

In this section, we present the formal model for the synthesis problem. We model glycan molecules and production rules as labeled trees. The glycan molecules are assembled by applying the production rules repeatedly. Our synthesis problem reduces into finding the pieces of trees that represent the production rules.

Let S be the set of sugar monomers that builds glycans, the oligomer molecules. Each s ∈ S is associated with arity m (written arity(s) = m). The children of the monomers are indexed. We refer to the kth child of s for some k ≤ arity(s). They correspond to bonds at specific positions in the monomers where children are connected. Now we define the glycan molecules as labeled trees. Now onward we refer to the glycans simply as molecules. 

Let height(m) be the length of the longest branch in m. We define ancestor relation recursively as follows. Let

Since we will be matching the parts of the trees and applying rules to expand them, let us introduce notations for matching. Let recursively-defined Now we define the model of the production process of the molecules. A production rule expands a molecule m by attaching a new piece of tree at a node that has a vacant spot among its children if the surroundings of the node satisfy some condition. The rule is modeled as a tree that has two parts. One part should already be there in m and the other part will be appended to m.

and v e ∈ V is the root of expanding part of the rule.

If we apply, a rule r on a molecule m, then it is extended at some node v ∈ m. A copy of the descendants of v e will be attached to v in m, and the rest of the nodes in the rule have to match v and above. We call the descendants of v e as expanding nodes and all the other nodes as matching nodes. Example: In Figure 4 (a), we present a rule. It has two kinds of nodes. The rule adds the square node (v e ). The circular nodes are the pattern, which must be present in the molecule to apply the rule. In Figure 4 (b), we present an application of the rule. The solid tree with three nodes is the initial molecule. The middle node A and its right child B form a pattern, where the rule is applicable. The rule adds a left child with the label A to the middle node. The rule is not applicable at the root A due to pattern mismatch. We naturally extend the definitions related to molecules, including M atch and Copy, to the production rules. Let us formally define the molecule productions using the rules. Let m = (V, M, C, v 0 ) be a molecule and r = (V r , M r , C r , v 0r , v e ) be a production rule. Let d be such that v 0r = ancestor(r, v e , d), i.e., v e is at the depth d in r. Let i be such that C r (v , i) = v 0r for some v ∈ r. We apply r on m at node v ∈ m such that C(v, i) = ⊥. We obtain an expanded molecule as follows.

where is the disjoint union. The match condition states that after attaching the new nodes V the rule tree must be embedded in m at the dth ancestor of v 0 . We write m = Apply(m, v, r) to indicate the application of r on molecule m at node v that results in m . If r is not applicable at v, we write Apply(m, v, r) = ⊥. We write m = Apply(m, r) if there is a v ∈ m such that m = Apply(m, v, r).

Let R be a set of rules. A molecule m is producible by R from a set of molecules Q if there is sequence of molecules m 0 , ..., m k such that m 0 ∈ Q, m k = m, and for each 0 < i ≤ k, m i = Apply(m i−1 , r) for some r ∈ R. Let P (Q, R) denote the set of molecules that are producible from rules R from a set of molecules Q. We have discussed in Section 2 that all the production rules are not applied at the same time. The rules may live in compartments and the rule sets of the compartments are applied one after another. To model compartments for the rules, let us suppose we have a sequence R 1 , ..., R k of set of rules. Let P (Q, R 1 , .., R k ) = P (..P (P (Q, R 1 ), R 2 ), .., R k ) denoting the trees obtained after applying the rule sets one after another.

In nature, we observe a set of glycan molecules µ present in a cell. However, we do not necessarily know the production rules to produce the molecules. We will be developing a method to find the production rules. The synthesis problem is to find a set R of production rules such that µ = P (S, R), where S is the set of monomers.

In this section, we present a method to solve the synthesis problem of finding a set of production rules that produce a given set µ of molecules. Our synthesis method SugarSynth is presented in Algorithm 1. Here, we are considering only the single compartment case. The generalizations are discussed in Appendix B.

The method assumes that the input set µ is finite. This is a reasonable assumption because even if a set of rules can produce an unbounded number of molecules, no biology will exhibit an infinite set in a cell. Our method bounds the search space of production rules. The method also takes two numbers as input: d is the maximum height of the learned rules and n is the maximum number of them. If the method fails to find production rules, the user may call the method with larger parameters. First, the method initializes S with the set of monomers occurring in µ and sets w to be the maximum arity of any monomer in S.

We use templates to model the search space of rules. A template is also a tree that has a depth and the internal nodes of the tree have the same number of children. Two variables label each node of the template. One variable is for choosing the sugar at the node and the other is for describing the 'situation' of the node. The domain of the first variables is S∪{⊥}. Let SV ars be the unbounded set of variables with the domain. We will use the pool of SV ars to add variables to the templates. A node in a production rule can be in four situations. In Figure 5 , we illustrate the situations. The first situation is when a node position is in the expanding part. The dark gray nodes are the expanding nodes. The second situation is when a node position is not in the rule. The dashed area are the absent nodes. Among the matching nodes, we have two cases. The third situation is when a node position is in the matching part and has expanding descendants. The nodes on the solid path from v 0 to the root v e of the expanding part are in the third situation. Finally, the fourth situation is the rest of the node positions in the matching part, which is the light gray area. A variable is mapped to a node to encode the four situations. Let K = {Expand, Absent, M atchAns, M atch} be the set of symbols to indicate the set of four situations. Let KV ars be the unbounded set of variables with domain K. Our templates are sufficiently expressive to cover all aspects of biology. The templates are defined as follows.

Definition 3 For given integers d and w, a rule template t = (V, ν, κ, C, v 0r ) is a labeled full tree with depth d and each internal node has w children, where V is a set of nodes, ν : V → SV ars maps nodes to distinct sugar choice variables, κ : V → KV ars maps nodes to distinct situation variables, C : V × N → V maps the indexed children of nodes, and v 0r ∈ V is the root of the tree.

For a node v in a template if we assign κ(v) = Absent, we call the node absent. Otherwise, we call the node present. We will also be searching for the molecules that may be produced by the learned rules. Therefore, we need to define the search space for the molecules. We use templates for defining the search space. We limit the template size using a parameter, namely the height of the template.

Definition 4 For given integers h and w, a molecule templatem = (V, ν, C, v 0m ) is a labeled full tree with height h and each internal node has w children, where V is a set of nodes, ν : V → SV ars maps nodes to sugar choice variables, C : V × N → V maps the indexed children of nodes, and v 0m ∈ V is the root.

In the Algorithm at line 2, we call MakeTemplatesRule( S, d, w, n) to create n templates for height d and children width w. Since w is the maximum arity of any sugar, we can map any node to any sugar. Next at line 3, we will construct constraints that encode the set of valid rules. A valid assignment to the variables in a template t = (V, ν, κ, C, r) must satisfy the following six conditions.

1. If a node is present, then it is labeled with a sugar. 

If a node is present, then the parent of the node is also present.

internal node v∈V i∈ [1,w] (κ(C(v, i)) = Absent ⇒ κ(v) = Absent) 4. If a node is Expand, then its children are also Expand if present.

i∈ [1,w] internal node v∈V (κ(v) = Expand ⇒ κ(C(v, i)) ∈ {Expand, Absent}) 5. If a node is M atch, then its children are also M atch if present. i∈ [1,w] internal node v∈V (κ(v) = M atch ⇒ κ(C(v, i)) ∈ {M atch, Absent}) 6. If a node is M atchAns, then exactly one child is M atchAns or Expand. v∈V (κ(v) = M atchAns ⇒ i∈ [1,w] (κ(C(v, i)) ∈ {M atchAns, Expand}) = 1

The call to RuleTemplateCorrectness at line 3 creates the above constraints and stores them in tCons. At line 5, we use the call to MakeTemplate-Mol(S, h, w) to create a molecule template of height h and children width w. Our method searches for unwanted producible molecules up to the height of h, which we set to the maximum of the heights of molecules in µ. The choice of h is arbitrary. Similar to the rule templates, not all assignments to molecule template variables are valid. We add the following conditions for valid assignments for molecule templatem = (V, ν, C,v 0 ).

i∈ [1,w] internal node v∈V (ν(C(v, i)) = ⊥ ⇒ ν(v) = ⊥) 2. The children count of a node matches with the arity of the labeled sugar. 3. We find a molecule that is not in µ. We encode (_,_,_,v0)∈µ N eq(v 0 , v 0 ), where predicate N eq be recursively defined as follows.

In our method, the call to MolTemplateCorrectness at line 6 generates the above constraints and stores them in mCons.

We need to encode that the rules do generate the molecules in µ and do not generate any other. Using procedure EncodeProduce, we generate the corresponding constraints. We will discuss the procedure in-depth shortly. Let us continue with SugarSynth. At line 7, we call EncodeProduce for each molecule in µ and generate constraints pCons stating that the solutions of the templates will produce the molecules in µ.

After producing the constraints tCons, mCons, and pCons, the method enters in the loop. It first checks the satisfiability of conjunction tCons∧pCons∧nCons, where nCons is tt in the first iteration and will encode constraints related to counterexample molecules. If the conjunction is unsatisfiable, there are no rules of the input number and height, and the method returns failure of synthesis. If it is satisfiable, it constructs the rules at line 11 from and stores them in R.

At line 15, we construct constraints rCons again using EnocdeProduce that says that template moleculem is generated by rules R. We check the satisfiability of rCons ∧ mCons. If it is not satisfiable, we have found the rules that generate exactly the molecules in µ and the loop terminates. Otherwise, we use the satisfying assignment to create a counterexample molecule m . At line 18, we add constraints to nCons stating that all possible applications of template rules in T must not produce m . We use shorthand F :∧= G for F := F ∧ G. As we will see that EncodeProduce introduces fresh variable maps τ and cuts in the encoding. Since we negate the returned formula by EncodeProduce and then we check the satisfiability. Therefore, we need to introduce universal quantifiers for the fresh variables. We introduce universal quantifiers over τ and cuts variables before adding to nCons. Afterwards, the loop goes to the next iteration. Since the domain of all the variables is finite, eventually the loop must terminate.

Now let us look at the encoding generated by EncodeProduce. The process of production of molecules adds pieces of trees one after another. In order to show that a molecule is producible by a set of production rules, we need to find the nodes where the production rules are applied, the rules that are applied on the nodes, and the order of the application of the rules. To model the production due to the application of the rules, we attach three maps to the molecule nodes.

-Let cuts map each node to a Boolean variable indicating the node is the point where a rule is applied to expand the molecule. We say points of the applications of the production rules as cuts of the tree. -Let rmatch map each node to a rule indicating the rule that is applied to expand at the node. We need to match a rule to a node if it is a cut point.

-Let τ map each node to an integer variable indicating the time point when the node was added to the molecule. Since already added nodes in a molecule decide what can be added later, we need to record the order of the addition.

Algorithm 2 EncodeProduce( m : molecule (template), T : rule (template) ) Algorithm 2 presents function EncodeProduce that returns the encoding. It takes a molecule m and a set of rules T . Both the inputs can be template or concrete. Our presentation assumes that the molecule is concrete and the rules are templates. This will cover the case when the EncodeProduce is called at line 7 in Algorithm 1. However, at line 15 the molecule is a template and the rules are concrete, which we will discuss later. EncodeProduce uses the help of three other supporting functions, EncodeP, MatchTree, and MatchCut.

EncodeProduce returns constraints stating for each node v and rule template t, if v is at a cut and t is applied at v, then t must match at the node v. We require rmatch(v) to be equal to some rule. Since we are matching with a template rule, we do not know the position of the root of expanding nodes. We enumerate to all possible depths from 1 to d − 1 for finding the root. For each ∈ [1, d), we call EncodeP(v, t, ) to construct the constraints encoding that t is applied at v and the depth of the root of the expanding nodes is at depth .

In EncodeP, we can traverse up from v for steps to find the node to match the root of t. Naturally, it returns ff if there is no th ancestor of v. The variable mark is the timestamp for node v. The local variable c collects the conjunction of constraints as they are generated. The local variable v r is initially equal to the root v 0r of t and traverses the nodes in t. The while loop at line 3 in EncodeP starts with the th ancestor from v, matches all the ancestors up to the parent of v. As the loop traverses down, it also traverses t along the matching using variable v r . In each iteration, v r is updated to the jth child due to lines 8 and 13 if the ancestors of v also traverse along with the jth child.

Let v be the ith ancestor at some iteration of the loop. At line 5, the loop adds constraints stating that the corresponding node v r in the template rule is M atchAns, sugar matches in v r and v , and node v is added before mark. The loop at line 6, iterates over children of v . If there is a child of v that is not along the path to v, it is matched at line 10 with the corresponding child of v r by calling MatchTree, which traverses rule template and molecule and generate constraints to encode that their nodes match. We will discuss MatchTree shortly. If the jth child of v is along the path to v, we get node v r for updating v r for the next iteration of the while loop. After the while loop in EncodeP at line 15, we declare v r is Expand, match the node v with the corresponding node v r in the template rule by calling MatchTree, and also call to match the cut pattern at the subtree of v with the template rule.

MatchTree is a recursive procedure and matches sugar assignments between the molecule and the rule template. If the rule template node v r does not exist, then there is nothing to match and it returns tt at line 1. At line 2, we encode if the molecule node v does not exist, then the rule node must also be flagged absent. Otherwise, we add constraints that if v r is not absent in the template, then the labels of v and v r must match at line 4. At the same line, MatchTree also adds constraints tCons that nodes are added in the molecule in the correct order. The last two inputs of the function are timestamp mark and a bit isExpand, which tells us that the matching nodes should be added before or after mark. The calls to MatchTree from EncodeP use the isExpand appropriately. Afterwords at line 5, the procedure calls itself for the children of v and v r .

Once a rule is applied to a molecule, all the nodes inside the expanding part are added together. Therefore, there should be no cuts within the set of added nodes. Furthermore, if there are children nodes below the added nodes, they must be added due to the application of some other rule. Therefore, the children are at the cut points. We call the above requirement cut pattern. MatchCuts is a recursive function over the trees, and matches cut patterns between the molecule node v and the rule template node v r . The cuts must occur whenever nodes of the rule template transition from present to absent. For helping to detect the transition, the third parameter is a constraint that encodes that the parent of v r is absent or not. The call to MatchCuts in EncodeP passes ff as the third input. Even if the parent of v r is present, v r is a cut point adding cut pattern constraints for the node will create inconsistency. Therefore, we are passing ff .

In the case when we call EncodeProduce with template molecule and concrete rules. We need to swap the roles of ν and M at their occurrences at line 5 in EncodeP and line 4 in MatchTree. Furthermore, κ(v r ) is a concrete value depending on the situation of v r in the rule. In the functions, the variable is to be replaced by the evaluated value of κ(v r ). Theorem 1. If SugarSynth(µ, d, n) returns a set of rules R, then if R produces a molecule that has depth less than h + 1, the it is in µ (soundness). If it fails to synthesize rules, there are no rule set from the budget of depth d and number n (completeness).

The soundness is true due to the construction of constraints. Our algorithm always terminates. At each iteration, at least one solution, i.e, a set of rules is discarded. In fact, we discard many because in each iteration we reject production of a counterexample molecule, but this may discard whole range of rule sets. Since the set of all possible rules is finite, we guarantee termination. Therefore, the completeness holds. In the worst case, the entire search space is explored.

In this section, we present our implementation of the proposed method.

Implementation: We have implemented method SugarSynth in a prototype tool GlySynth. The tool, written in C++, uses Z3 [14] as the SMT solver to discharge the generated satisfiability queries. All the experiments have been conducted on a laptop with 8GB (1x8GB) DDR4 at 2400MHz.

Benchmark: We have applied our tool to three sets of real data (D1, D2, D3, D4) and two sets of synthetic data (D5, D6). The molecules have been obtained from respiratory mucins of a cystic fibrosis patient (D1), horse chorionic gonadotropin (D2), SARS-CoV-2 spike protein T323/S325 (D3), and human chorionic gonadotropin from a cancer cell line (D4) [9, 15] . The availability of clean data, where we are clear about the source , limits our choices.

Results: We have applied GlySynth on the data set. For each data set, we choose several parameter combinations to illustrate the relative performance of the tool. If we did not budget large enough parameters such as the size of unknown molecules, number of rules etc, then the tool fails to synthesize the rules. We present the rules learned after giving minimum resources. However, the set of rules reported need not be either unique or minimal.

For D1, we synthesize the rules in 1.60 seconds. Even by reducing the first two parameters, we were able to synthesize the rules but it took longer time. Giving an extra compartment in the third row did not impact the performance. We also observe the trade-off between the number of compartments and the depth of the rules at the second and third row of D1 and how it impacts performance. We synthesize the rules for D2, involving runaway reactions, in 7.97 seconds. For D2, we also learned large rules, i.e., they are adding many nodes at a single time. We synthesize the rules for D3 in 0.57 seconds. We synthesize the rules for D4, which is also our motivating example, in 0.85 seconds. However, if we use too many resources or too little, the tool runs for a long time as the search in combinatorial space is highly sensitive to the parameters. The synthesis for D5 takes 0.72 seconds. We can observe that by reducing the number of rules to learn, the tool fails to learn rules as it required minimum 7 rules. Our synthesized production is in line with the reported rules in the literature [9] . More experiments on the variations of the problem are included in the Appendix C.

We have presented a novel method for synthesizing production rules for glycans. We have applied our method to real-world data sets. The viability of the synthesized rules can only be verified by conducting wet experiments. We are planning to work in biological labs to check the viability of the solutions found by our method. Our method is the first to apply formal methods for the synthesis problem. We have opened a new direction for the application of formal methods in biology. a program that exhibits a certain behavior into a satisfiability problem. The solvers attempt to find a solution to the satisfiability problem. The solution is the synthesized program. In [22, 5] , the method has been successfully applied to find programs that satisfy quantified specifications like sorting programs, and have missing "holes" and need proper implementations for them. In [23] , a set of pairs of input and output examples is the specification of a program, and the synthesis method searches for programs in a space defined by a template [4] .

Since the formal methods have been exhibiting effectiveness in the vast range of problems, the systems biology community has been applying the methods for various biological problems [24] . The formal methods approach is distinctly different from the statistical approach traditionally used in biology. In formal methods, we aim to match precisely the expected behavior rather than learn approximate artifacts from biological data. The use of formal methods belongs to two broad categories: the analysis of biological models and the synthesis of the models.

The key focus has been to model gene regulatory networks (GRNs), since they are the core of the central dogma of biology. Boolean networks are often used to model GRNs to find stable points or attractors, i.e., nodes in the networks where a system eventually reaches no matter where it starts [25] . The Boolean networks are not often sufficient to adequately model the behavior of GRNs. The continuous-time Markov chains (CTMCs) are used to bring in the aspects of timing and probabilistic constraints to GRNs. The transient behaviors of GRNs are also crucial for various aspects of biology. In [26] , a method based on modelchecking is used to estimate the time evolution of the probability distributions over states of GRNs.

There are many biological processes where we can observe the system behavior, but the exact mechanisms of the processes are not known. Recently, the methods for formal synthesis are finding their applications in biology [27, 28, 29, 30] . In [31] , GRNs with unknown interactions are modeled using an automata-like model with missing components and the specification is the variations of the behavior of the system under mutations. Their approach uses a counterexample-guided inductive synthesis (CEGIS) based algorithm [2] . CEGIS is a framework for synthesis. It first finds a program that may satisfy 'samples' from the specification. If the synthesized program satisfies the specifications, the CEGIS terminates. Otherwise, the method learns a new sample where the program violates the specifications. It adds the sample to the set and goes to the next iteration. Typically, the constraint solvers find the programs during intermediate steps of CEGIS by solving the generated queries. Our method follows a similar pattern, where a set of sampled observations is a specification, i.e., output molecules, and we need to find the governing programs, i.e., production rules. In a subsequent work [32] , a Boolean network model is used, the functions attached to the nodes are considered unknown, and a different kind of data sets provides the behavior specification. In this approach, their earlier method is adopted to work in the new situation.

As far as we know, there has been no similar computation based analysis of glycan production rules. However, there has been a theoretical analysis of the properties of the production rules in [9] . The work identifies the conditions for the production of a finite set of molecules and the cases of unique production methods. In our work, we are taking the computational approach of synthesis from formal methods instead of looking for the theoretical conditions.

We have presented a simplified version of the biological problem. However, in real settings, we often encounter many more variants which require us to support additional constraints to model the set of molecules and possible rules in that particular problem. Hence, we developed the following variations of the method described in the main text to support more realistic problems.

Compartments : The production rules may live in different compartments. A molecule moves from one compartment to the next. To support the compartments, we take an additional integer input k in SugarSynth to indicate the maximum number of compartments. We construct each template rule with a new variable with domain [1, k] . Let v r be a node of some template rule. We write compart(v r ) for the compartment variable for the template.

We will alter EncodeProduce and its sub-procedures to ensure that they enforce the compartment order. We need to encode that a rule is applied when all the pattern nodes were added in the current or earlier compartment. We modify the function MatchTree by replacing tCons assignment by the following code.

3: tCons := isExpand ? (mark ≤ τ (v) ∧ compart(v r ) = compart(v)) : (τ (v) < mark ∧ compart(v r ) ≥ compart(v)) Similarly, we modify the function EncodeP by inserting the following line after 5.

6: c :∧= compart(v r ) ≥ compart(v )

Fast and slow reactions: There is a rate associated with chemical reactions. We abstract this by defining slow and fast rules. The fast rules dominate the slow rule. A slow rule can occur only when no other fast rule is able to extend the molecule in that compartment. Let us define function EncodeP − , which generates constraints as EncodeP expect line 15 is missing, i.e., we do not analyze expand part. We now modify EncodeP after 15 to support fast reactions: 16: 

Here, Fast sets the constraint on the rule to be fast. A negative molecule which is a proper subtree of an input molecule will cease to be negative if there is any fast reaction that is able to extend it as fast reactions happen aggressively and can make partial molecules complete. Due to limited space, we will not present the exact constraints. Unbounded molecules: We only observe a finite set of molecules in cells. However, a set of production rules may be capable of producing an unboundedly large number of molecules. In such cases, rules produce molecules that have repeating patterns of a subtree while rest of the tree being exactly same as one of the input molecules. The rules may be acceptable in some biological settings. We modify constraints to not declare such molecules as negative. Let us suppose we have a repeat pattern of depth d with r repetitions. Let v be the node in template molecule where repetition has begin. We define RepeatHeads(v, r, d) that returns nodes v 0 , ....v r such that v 1 = v, v i is the dth ancestor of v i+1 . Let us define Repeat and Exact, which encodes that the trees rooted at v i s repeat.

We modify constraints of MolTemplateCorrectness(m, µ), which encodes that negative molecules are not in µ. We replace the definition of N eq as follows. [1,r0] ,d∈ [1,d0] Repeat (RepeatHeads(v, r, d) , v )) ∨ i∈ [1,w] N eq(C(v, i), C(v , i)), where d 0 and r 0 are limits on the depth of the repeating subtrees and the number of repetitions respectively. The change will accept molecules with repeating patterns as positive samples.

Non-monotonic rules : Some production rules can not be applied if another node is present in a sibling. We call such rules non-monotonic because it may get disabled as the molecule grows. This feature of rules helps in producing an exact set of desired molecules. We add an extra bit on each node of rule template called HardEnds. If the node is absent, its parent is present, and HardEnds bit is true, then no node must be present in the matching pattern at the time of the application of the rule. We modify the function MatchTree by inserting the following constraints after 4: 5: c :∧= HardEnds(v r ) =⇒ (mark ≤ τ (v)) The constraints state that if the applicable rule has HardEnds, then it has to be added at a time earlier than the current time of the molecule, effectively restricting the addition of further rules.

We conducted separate experiments for the variations of the synthesis problemnon-monotonic rules and unbounded molecules using synthetic data. We applied our tool on the synthetic datasets N1, N2 and N3 which required some rules to have HardEnds. There is a clear trend of increasing time as the size of our dataset increases. Between N1 and N2 which have the same number of molecules in the dataset, N1 has bulkier molecules and hence the time taken is higher. Reducing the number of compartments increases the time slightly however is successful in both N1 and N2. By Reducing the number of rules to be learnt, the tool fails in N1 as it required a minimum of 6 rules. Rules produced in these datasets have HardEnds in the appropriate places to ensure that the exact set of molecules can be made.

To demonstrate the working of our tool in case of runaway reactions or datasets with unbounded molecules, we applied our tool on the synthetic datasets R1, R2 and R3. The molecules in these datasets have repeating patterns and we observe the synthesized rules to capture this pattern. The constraints added as part of unbounded molecules take care not to declare the molecules created by the repeated application of these rules to be negative. The time taken for Yes 2.09 6 3 1 No 0.77 synthesizing the rules is particularly high for R1 as the molecules in the dataset are relatively bulky. Increasing the number of compartments or Increasing the number of rules to learn increases the time in R2. The search in the combinatorial space is expensive, hence the time taken is more.

We have incorporated several optimizations to gain efficiency or reduce the number of iterations in the method. Let us discuss some of the remarkable optimizations.

Ordered templates Our method constructs a list of templates. Since the templates are symmetric, there is a potential for unnecessary iterations. Let us suppose the method rejects a set of learned rules. In the subsequent queries include the constraints that reject the set of rules, the solvers may choose the same set of rules again by permuting the mapping from the templates to the learned rule.

To avoid these iterations, we define an arbitrary total order over the rules. We add ordering constraints stating that any assignments to the list of templates should produce ordered rules with respect to the total order.

The constraints for rejecting the counterexample molecule is universally quantified at line 18 in Algorithm 1. For any assignment, the solver needs to try sufficiently many instantiations of the quantifiers to ensure that the query is not unsatisfiable. The instantiations sufficiently slow down the solving process. We assist the solver by also adding an instantiation of the quantifiers that is equal to the values of the corresponding variables in the assignment a at line 16. We observe that for some inputs only adding the instantiations and not the universally quantified formula is more effective.

Constraints for counterexample molecules In the presentation of SugarSynth, we construct constraints of counterexample molecules at line 15 in each iteration.

The repeated work is impractical because the formula management system of Z3 will be overwhelmed by the construction of many terms repeatedly. In our implementation, we construct the constraints once outside the while loop. We pass the templates as the second parameter instead of concrete rules to Encode-Produce. Later at the solving time in line 16, constraints are added to assign values of the template variables that were returned by the solver at line 10. Due to the assignments, the templates become concrete rules in the context of solving.

E Features to support in future E.1 Full organism data vs single cell data

The experiments that observe the set of glycan molecules are of two kinds. In one kind, we isolate a single cell type organism and identify all the glycans present in the cells. In our presentation, we have assumed this kind of source of data. However, the experiments are difficult to conduct. In another and more convenient way, we smash the whole organism and identify all the glycans present in all the cells. So the information that which glycans are coming from which cells is unknown. In this situation, the synthesis has another task to map the molecules in µ to several cell types. A glycan may be present in multiple cell types.

Our method is easily adapted to consider this kind of data. We search for a small covering set of subsets of 2 µ such that each set in the cover allows a small set of production rules. The cover sets indicate the different types of cells in terms of the presence of glycans. This variation makes the problem particularly hard. There can be a potentially large number of covering subsets. So far, we have not encountered a large enough data set such that we apply the variation effectively.

The experiments are imperfect. They may not detect all possible glycans in a cell. We may need to leave the possibility of allowing a few more molecules to be producible beyond µ. We replace the no extra molecule constraint by a constraint that allows molecules that are 'similar' to molecules in µ. However, we could not imagine any measure of similarity that naturally stems from biological intuition. For now, our tool supports the count of the number of monomer differences as a measure of similarity.

Essentials of Glycobiology

Sketching concurrent data structures

From tests to proofs

Syntax-guided synthesis

Programming by sketching for bit-streaming programs

ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '05

UniCarbKB: building a knowledge platform for glycoproteomics

Promiscuity and specificity of eukaryotic glycosyltransferases

Algorithmic biosynthesis of eukaryotic glycans

A markov chain model for n-linked protein glycosylation -towards a low-parameter tool for model-driven glycoengineering

Systems glycobiology for glycoengineering

Algorithmic biosynthesis of eukaryotic glycans. bioRxiv

The carbohydrate chains of the beta subunit of human chorionic gonadotropin produced by the choriocarcinoma cell line BeWo. novel o-linked and novel bisecting-GlcNAc-containing n-linked carbohydrates

Z3: An efficient smt solver

Deducing the N-and O-glycosylation profile of the spike protein of novel coronavirus SARS-CoV-2

Symbolic model checking without bdds

Bounded model checking

Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints

Counterexample-guided abstraction refinement

On the synthesis of a reactive module

From program verification to program synthesis

Automating string processing in spreadsheets using input-output examples

See also correspondence in

Boolean modeling in systems biology: An overview of methodology and applications

Delayed continuous-time markov chains for genetic regulatory circuits

Defining an essential transcription factor program for naive pluripotency

Construction and validation of a regulatory network for pluripotency and self-renewal of mouse embryonic stem cells

Reconstructing boolean models of signaling

Analyzing and synthesizing genomic logic functions

Synthesis of biological models from mutation experiments

analysis of systems. In particular, the verification of hardware designs using SAT solver based model-checking has been applied to industrial-scale problems [17]. The verification of software systems is a much harder problem, and many methods like bounded model-checking