key: cord-0673297-o1tncz6m authors: Meng, Zaiqiao; Liu, Fangyu; Clark, Thomas Hikaru; Shareghi, Ehsan; Collier, Nigel title: Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT date: 2021-09-10 journal: nan DOI: nan sha: 8cdb9f975aaff5adb51cfa164199010bb9b9b6d1 doc_id: 673297 cord_uid: o1tncz6m Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets. Leveraging factual knowledge to augment pretrained language models is of paramount importance for knowledge-intensive tasks, such as question answering and fact checking (Petroni et al., 2021) . Especially in the biomedical domain where public training corpora are limited and noisy, trusted biomedical KGs are crucial for deriving accurate inferences (Li et al., 2020; . However, the infusion of knowledge from realworld biomedical KGs, where entity sets are very large (e.g. UMLS, Bodenreider 2004 , contains ∼4M entities) demands highly scalable solutions. Although many general knowledge-enhanced language models have been proposed, most of them rely on a computationally expensive joint training of an underlying masked language model (MLM) along with a knowledge-infusion objective function to minimize the risk of catastrophic forgetting (Xiong et al., 2019; Wang et al., 2021 Peters et al., 2019; Yuan et al., 2021) . Alternatively, entity masking (or entity prediction) has emerged as one of the most popular self-supervised training objectives for infusing entity-level knowledge into pretrained models Yu et al., 2020; He et al., 2020) . However, due to the large number of entities in biomedical KGs, computing an exact softmax over all entities is very expensive for training and predicting (De Cao et al., 2021) . Although negative sampling techniques could alleviate the computational issue (Sun et al., 2020) , tuning an appropriately hard set of negative instances can be challenging and predicting a very large number of labels may generalize poorly (Hinton et al., 2015) . To address the aforementioned challenges, we propose a novel knowledge infusion approach, named Mixture-of-Partitions (MoP), to infuse factual knowledge based on partitioned KGs into pretrained models (BioBERT, Lee et al. 2020; SciB-ERT, Beltagy et al. 2019; and PubMedBERT, Gu et al. 2020 ). More concretely, we first partition a KG into several sub-graphs each containing a disjoint subset of its entities by using the METIS algorithm (Karypis and Kumar, 1998) , and then the Transformer ADAPTER module (Houlsby et al., 2019; Pfeiffer et al., 2020b) is applied to learn portable knowledge parameters from each subgraph. In particular, using ADAPTER module to infuse knowledge does not require fine-tuning the parameters of the underlying BERTs, which is more flexible and efficient while avoiding the catastrophic forgetting issue. To utilise the independently learned knowledge from sub-graph adapters, we introduce mixture layers to automatically route useful knowledge from these adapters to downstream tasks. Figure 1 illustrates our approach. Our results and analyses indicate that our "divide and conquer" partitioning strategy effectively preserves the rich information presented in two biomedical KGs from UMLS while enabling us to ⇥ 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " m M y N q P N R l i j x / A H z u c P m 2 2 P r Q = = < / l a t e x i t > G < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 e T D x P S F n r l 0 G N L y q 4 0 W Z F H J s q k = " > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I R R c u C i 5 0 W c E + Y D q U T J p p Q z P J k G S E M v Q z 3 L h Q x K 1 f 4 8 6 / M d P O Q l s P B A 7 n 3 E v O P W H C m T a u + + 2 U 1 t Y 3 N r f K 2 5 W d 3 b 3 9 g + r h U U f L V B H a J p J L 1 Q u x p p w J 2 j b M c N p L F M V x y G k 3 n N z m f v e J K s 2 k e D T T h A Y x H g k W M Y K N l f x + j M 2 Y Y J 7 d z Q b V m l t 3 5 0 C r x C t I D Q q 0 B t W v / l C S N K b C E I 6 1 9 j 0 3 M U G G l W G E 0 1 m l n 2 q a Y D L B I + p b K n B M d Z D N I 8 / Q m V W G K J L K P m H Q X P 2 9 k e F Y 6 2 k c 2 s k 8 o l 7 2 c v E / z 0 9 N d B 1 k T C S p o Y I s P o p S j o x E + f 1 o y B Q l h k 8 t w U Q x m x W R M V a Y G N t S x Z b g L Z + 8 S j o X d a 9 R v 3 x o 1 J o 3 R R 1 l O I F T O A c P r q A J 9 9 C C N h C Q 8 A y v 8 O Y Y 5 8 V 5 d z 4 W o y W n 2 D m G P 3 A + f w B 6 N p F f < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W E p S v Z 3 x b z / K W 4 o T D D o c + S b x r b E = " > A A A B 9 H i c b V C 7 S g N B F L 3 r M 8 Z X 1 N J m M A h W Y V d E L S w C F l p G M A 9 I l j A 7 u U m G z M 6 u M 7 O B s O Q 7 b C w U s f V j 7 P w b Z 5 M t N P H A w O G c e 7 l n T h A L r o 3 r f j s r q 2 v r G 5 u F r e L 2 z u 7 e f u n g s K G j R D G s s 0 h E q h V Q j Y J L r B t u B L Z i h T Q M B D a D 0 W 3 m N 8 e o N I / k o 5 n E 6 I d 0 I H m f M 2 q s 5 H d C a o a M i v R u 2 h 1 1 S 2 W 3 4 s 5 A l o m X k z L k q H V L X 5 1 e x J I Q p W G C a t 3 2 3 N j 4 K V W G M 4 H T Y i f R G F M 2 o g N s W y p p i N p P Z 6 G n 5 N Q q P d K P l H 3 S k J n 6 e y O l o d a T M L C T W U i 9 6 G X i f 1 4 7 M f 1 r P + U y T g x K N j / U T w Q x E c k a I D 2 u k B k x s Y Q y x W 1 W w o Z U U W Z s T 0 V b g r f 4 5 W X S O K 9 4 l x X v 4 a J c v c n r K M A x n M A Z e H A F V b i H G t S B w R M 8 w y u 8 O W P n x X l 3 P u a j K 0 6 + c w R / 4 H z + A P 8 I k j s = < / l a t e x i t > G k < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 V O S V m b W C 0 n I t n d i l P D V p U Q s O D s = " > A A A B / X i c b V D L S s N A F L 2 p r 1 p f 8 b F z E y y C q 5 K I q A s X B R e 6 r G A f 0 I Q w m U 7 a o Z N J m J k I N Q R / x Y 0 L R d z 6 H + 7 8 G y d t F t p 6 Y O B w z r 3 c M y d I G J X K t r + N y t L y y u p a d b 2 2 s b m 1 v W P u 7 n V k n A p M 2 j h m s e g F S B J G O W k r q h j p J Y K g K G C k G 4 y v C 7 / 7 Q I S k M b 9 X k 4 R 4 E R p y G l K M l J Z 8 8 8 B t j a i f u R F S I 4 x Y d p P 7 4 9 w 3 6 3 b D n s J a J E 5 J 6 l C i 5 Z t f 7 i D G a U S 4 w g x J 2 X f s R H k Z E o p i R v K a m 0 q S I D x G Q 9 L X l K O I S C + b p s + t Y 6 0 M r D A W + n F l T d X f G x m K p J x E g Z 4 s U s p 5 r x D / 8 / q p C i + 9 j P I k V Y T j 2 a E w Z Z a K r a I K a 0 n 3 M s 9 c / y E M 6 k s 6 9 t Y W l 5 Z X V u v b F Q 3 t 7 Z 3 d s 2 9 / Y 6 M U 0 F o m 8 Q 8 F j 0 f S 8 p Z R N u K K U 5 7 i a A 4 9 D n t + u P r w u 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " L a u w r u 6 B F q z C y Q f H u g 7 S z f v z j l s = " p F e r F X l x h t 3 7 b J T d a Z A i 8 T N S R l y N L r 2 l 9 e L S R p R o Q n H S n V c J 9 F + h q V m h N N x y U s V T T A Z 4 j 7 t G C p w R J W f T e 8 f o 2 O j 9 F A Y S 1 N C o 6 n 6 e y L D k V K j K D C d E d Y D N e 9 N x P + 8 T q r D C z 9 j I k k 1 F W S 2 K E w 5 0 j G a h I F 6 T F K i + c g Q T C Q z t y I y w B I T b S I r m R D c + Z c X S a t W d c + q 7 u 1 p u Repeat for all K sub-graphs scale up training on these very large graphs. Additionally, we observe that while individual adapters specialize towards sub-graph specific knowledge, MoP can effectively utilise their individual expertise to enhance the performance of our tested biomedical BERTs on six downstream tasks, where five of them achieve new SOTA performances. We denote a KG as a collection of ordered triples where E and R are the sets of entities and relations, respectively. All the entities and relations are associated with their textual surface forms, which can be a single word (e.g. fever), a compound (e.g. sars-cov-2), or a short phrase (e.g. has finding site). Given a pretrained model Θ 0 , our task is to learn Φ G based on an input knowledge graph G, such that it encapsulates the knowledge from G. The training objective, L G , can be implemented in many ways such as relation classification , entity linking (Peters et al., 2019), next sentence prediction (Goodwin and Demner-Fushman, 2020), or entity prediction . In this paper, we focus on entity prediction, one of the most widely used objectives, and leave exploration of other objectives for future work. As mentioned earlier, exact softmax over all entities is extremely expensive (Mikolov et al., 2013; De Cao et al., 2021) for large-scale KGs, hence we resort to the principle of "divide and conquer", and propose a novel approach called Mixture-of-Partition (MoP). Specifically, our MoP first partitions a large KG into smaller sub-graphs (i.e., G → {G 1 , G 2 , . . . , G K }, §2.1), and learns subgraph specific parameters on each sub-graph separately (i.e., {Φ G 1 , Φ G 2 , . . . , Φ G K }, §2.2). Then these sub-graph parameters are fine-tuned through mixture layers to route the sub-graph specific knowledge into a target task ( §2.3). Graph partitioning (i.e., partitioning the node set into mutually exclusive groups) is a critical step to our approach, since we need to properly and automatically cluster knowledge triples for supporting data parallelism and controlling computation. In particular, it must satisfy the following goals: (1) maximize the number of resulting knowledge triples to retain as much factual knowledge as possible; (2) balance nodes over partitions to reduce the overall parameters across different entity prediction heads; (3) efficiency at scale for handling large KGs. In fact, an exact solution to (1) and (2) is referred to as the balanced graph partition problem, which is NP-complete. We use the METIS (Karypis and Kumar, 1998) algorithm as an approximation, simultaneously meeting all the above three requirements. METIS can handle billion-scale graphs by successively coarsening a large graph into smaller graphs, processing them quickly and then projecting the partitions back onto the larger graph, and has been used in many tasks (Chiang et al., 2019; Defferrard et al., 2016; . Once the large knowledge graph is partitioned, we use ADAPTER modules to infuse the factual knowledge into a pretrained Transformer model by training an entity prediction objective for each sub-graph. ADAPTERs (Houlsby et al., 2019; Pfeiffer et al., 2020b) are newly initialized modules inserted between the Transformer layers of a pretrained model. The training of ADAPTER does not require fine-tuning the existing parameters of the pretrained model. Instead, only the parameters within the ADAPTER modules are updated. In (Gu et al., 2020) 60.24 (Gu et al., 2020) 87.56 (Gu et al., 2020) 90.32 (Nentidis et al., 2020) 36.70 (Jin et al., 2020) 83.80 (Peng et al., 2019) this paper, we use the ADAPTER module configured by Pfeiffer et al. (2020a) , which is shown in Figure 1 (b). In particular, given a sub-graph G k , we remove the tail entity name for each triple (h, r, t) ∈ G k , and transform the triple into a list of tokens: The sub-graph specific ADAPTER module is trained to predict the tail entity using the representation of the [CLS] token and the parameters Φ G k are optimized by minimizing the cross-entropy loss. During the finetuning of downstream tasks, both the parameters of ADAPTER and pre-trained LM will be updated. Given a set of knowledge-encapsulated adapters, we use AdapterFusion mixture layers to combine knowledge from different adapters for downstream tasks. AdapterFusion is a recently proposed model (Pfeiffer et al., 2020a ) that learns to combine the information from a set of task adapters by a softmax attention layer. It learns a contextual mixture weight over adapters at layer l using an attention with the softmax weights: where s l,k is used to mix the adapter outputs to be passed into the next layer, and the final layer L is used to predict a task label y: where f is the target task prediction head. Closely related to ours is the sparsely-gated Mixture-of-Experts layer (Shazeer et al., 2017) . Alternatively, a more flexible mechanism such as Gumbel-Softmax (Jang et al., 2017) can be used for obtaining more discrete/continuous mixture weights. However, we found both alternatives underperform AdapterFusion (see Appendix for a comparison). We evaluate our proposed MoP on two KGs, named SFull and S20Rel, which are extracted from the large biomedical knowledge graph UMLS (Bodenreider, 2004) under the SNOMED CT, US Edition vocabulary. The SFull KG contains the full relations and entities of SNOMED 2 , while the S20Rel KG is a sub-set of SFull that only contains the top 20 most frequent relations. Note that since some relations in SFull are the reversed mappings of the same entity pairs, e.g. "A has causative agent B" and "B causative agent of A", therefore for S20Rel we exclude those reversed relations in the top 20 relations. Table 2 shows the statistics of the two KGs and the used 20 relations of the S20Rel are listed in the appendix. We evaluate our MoP on six datasets over various downstream tasks, including four question answering (i.e., PubMedQA, Jin et al. 2019; BioAsq7b, Nentidis et al. 2019; BioAsq8b, Nentidis et al. 2020; MedQA, Jin et al. 2020) , one document classification (HoC, Baker and Korhonen 2017), and one natural language inference (MedNLI, Romanov and Shivade 2018) datasets. While HoC is a multi-label classification, and MedQA is a multichoice prediction, the rest can be formulated as a binary/multiclass classification tasks. See Appendix for the detailed description of these tasks and their datasets. We experiment with three biomedical pretrained models, namely BioBERT (Lee et al., 2020) , SciB-ERT (Beltagy et al., 2019) and PubMedBERT (Gu et al., 2020) , as our base models, which have shown strong progress in biomedical text mining tasks. We first partition our KGs into different number of sub-graphs (i.e. {5, 10, 20, 40}), then for each sub-graph, we train the base models loaded with the newly initialized ADAPTER modules (with a compression rate CRate = 8) for 1-2 epochs by minimizing the cross-entropy loss. AdamW (Loshchilov and Hutter, 2018) is used as our training optimizer, and the learning rates for all the sub-graphs are fixed to 1e − 4, as suggested by (Pfeiffer et al., 2020b) . Unless specified otherwise, all the reported performances are based on a partition of 20 sub-graphs, since this was optimal for task performance (see Section 3.7 for the performances over different number of partitions.). In Figure 2 we report the average performance (10 runs) of the knowledge-infused PubMedBERT on two QA datasets over partitioned SFull. We can see that partitions contribute to various degrees while some (e.g. #5) have a negligible benefit. However, the role of the least contributing partitions could not be discarded as we repeated our downstream tasks by keeping only the top-10 performing partitions and observed the results were still worse than the model trained on all 20 partitions (i.e., accuracy drop of 2.7 on PubMedQA, and 1.4 on BioASQ7b). 3 This highlights the importance of automatically learning the contribution weights of partitions. Tasks Table 1 shows the overall performance of our MoP deployed on SciBERT, BioBERT and Pub-MedBERT pretrained models. We see that MoP pretrained on the SFull KG improves both the BioBERT and PubMedBERT models for all the tasks, while the SciBERT model can be also improved on 4 out of 6 tasks. The result shows that MoP pretrained with the S20Rel KG achieves new SOTA performances on four tasks. This suggests further pruning of the knowledge triples helps task performance by reducing noise, and is a promising direction to explore in future. We design a controlled random partitioning scheme to test whether METIS can produce high quality partitions for training. We fix the entity size for a 20-partitioned result produced by METIS, and randomly shuffle a percentage (ranging from 0%-100%) of entities across all the sub-graphs. Table 3 shows the number of training triples numbers over different shuffling ratios. In Figure 3 we report the results on BioASQ7b and PubMedQA under different shuffling rates. We can see that the performances of MoP on both datasets degrades significantly as the shuffling rate increases, which highlights the quality of the produced partitions. Table 4 shows the performance of PubMed-BERT+MoP trained on the SFull knowledge graph over different number of partitions. We can clearly see that under 20 partitions, PubMedBERT+MoP performs the best in both of the BioASQ7b and PubMedQA datasets, and an average entity size of 15k-30k for the sub-graphs usually yields better performance than others. In Figure 4 feature of these sub-graphs. We can observe that MoP identifies the most related sub-graphs for each example (e.g. Q2 has more weights on sub-graphs [1, 13, 14] , which specialise in 'tumor' knowledge). This validates the effectiveness of our MoP in balancing useful knowledge across adapters. In this paper, we proposed MoP, a novel approach for infusing knowledge by partitioning knowledge graphs into smaller sub-graphs. We show that while the knowledge-encapsulated adapters perform very differently over different sub-graphs, our proposed MoP can automatically leverage and balance the useful knowledge across those adapters to enhance various downstream tasks. In the future, we will evaluate our approach using some general domain KGs based on some general domain tasks. where H(Φ l,G k ) is a function for transferring hidden variables into scalars with tunable Gaussian noise, and TopK(·) is a function for keeping only the top K values. We denote our MoP with this mixture approach as MoE. To further validate that the performance improvements of the evaluated BERTs using our MoP are gained due to the infused knowledge from the subgraph adapters, rather than the newly added more parameters of adapters, we split the partitioned sub-graphs into two groups according to their test performance ranking, and use our MoP to fine-tune the adapters of each group. Table 7 and 8 show the performances of all the adapters and our MoP combining the grouped adapters over train/dev/test sets of the BioASQ7b and PubMedQA datasets respectively. As we can see from the two tables, our MoP fine-tuned under the group of higher performance adapters can consistently obtain better performance than the group of the lower performance adapters. Note that we have shown that in Figure 2 Initializing neural networks for hierarchical multi-label text classification SciB-ERT: A pretrained language model for scientific text The unified medical language system (umls): integrating biomedical terminology Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks Autoregressive entity retrieval Convolutional neural networks on graphs with fast localized spectral filtering Enhancing question answering by injecting ontological knowledge through regularization Jianfeng Gao, and Hoifung Poon. 2020. Domainspecific language model pretraining for biomedical natural language processing Infusing disease knowledge into BERT for health question answering, medical inference and disease name recognition Distilling the knowledge in a neural network Parameter-efficient transfer learning for nlp Categorical reparameterization with gumbel-softmax What disease does this patient have? a large-scale open domain question answering dataset from medical exams PubMedQA: A dataset for biomedical research question answering A fast and high quality multilevel scheme for partitioning irregular graphs BioBERT: a pretrained biomedical language representation model for biomedical text mining Towards medical machine reading comprehension with structural knowledge and plain text Self-alignment pre-training for biomedical entity representations Decoupled weight decay regularization Distributed representations of words and phrases and their compositionality Results of the seventh edition of the BioASQ challenge Overview of BioASQ 2020: The eighth BioASQ challenge on large-scale biomedical semantic indexing and question answering Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets Knowledge enhanced contextual word representations Adapterfusion: Non-destructive task composition for transfer learning AdapterHub: A framework for adapting transformers Lessons from natural language inference in the clinical domain Outrageously large neural networks: The sparsely-gated mixture-of-experts layer Colake: Contextualized language and knowledge embedding Ernie: Enhanced representation through knowledge integration KEPLER: A unified model for knowledge embedding and pre-trained language representation Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model Jaket: Joint pre-training of knowledge graph and language understanding Songfang Huang, and Fei Huang. 2021. Improving biomedical pretrained language models with knowledge Ernie: Enhanced language representation with informative entities DistDGL: Distributed graph neural network training for billion-scale graphs Nigel Collier and Zaiqiao Meng kindly acknowledge grant-in-aid funding from ESRC (grant number ES/T012277/1). We evaluate our MoP on six datasets over various downstream tasks, including four question answering (i.e., PubMedQA, Jin et al. 2019; BioAsq7b, Nentidis et al. 2019; BioAsq8b, Nentidis et al. 2020; MedQA, Jin et al. 2020) , one document classification (HoC, Baker and Korhonen 2017), and one natural language inference (MedNLI, Romanov and Shivade 2018) datasets. While HoC is a multi-label classification, and MedQA is a multichoice prediction, the rest can be formulated as a binary/multiclass classification tasks.• HoC (Baker and Korhonen, 2017) : The Hallmarks of Cancer corpus was extracted from 1852 PubMed publication abstracts by Baker and Korhonen (2017) , and the class labels were manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy, but in this paper we only consider the ten top-level ones. We use the publicly available train/dev/test split created by (Gu et al., 2020) and report the average performance over five runs by the average micro F1 across the ten cancer hallmarks.• PubMedQA (Jin et al., 2019) : This is a question answering dataset that contains a set of research questions, each with a reference text from a PubMed abstract as well as an annotated label of whether the text contains the answer to the research question (yes/maybe/no ). We use the original train/dev/test split with 450/50/500 questions, respectively. The reported performance are the average of ten runs under the accuracy metric.• BioASQ7b, BioASQ8b (Nentidis et al., 2019 (Nentidis et al., , 2020 : The both BioASQ datasets are yes/no question answering tasks annotated by biomedical experts. Each question is paired with a reference text containing multiple sentences from a PubMed abstract and a yes/no answer. We use the official train/dev/test splits, i.e. 670/75/140 and 729/152/152 for BioASQ7b and BioASQ8b respectively, and the reported performances are the average of ten runs under the accuracy metric.• MedNLI (Romanov and Shivade, 2018) : MedNLI is a Natural Language Inference (NLI) collection of sentence pairs extracted from MIMIC-III, a large clinical database. The objective of the NLI task is to determine if a given hypothesis can be inferred from a given premise. This task is formulated as the document classification task over three labels: {entailment, contradiction,neutral}. We use the same train/dev/test split generated by Romanov and Shivade (2018) , and report the average accuracy performance over three runs.• MedQA (Jin et al., 2020) : MedQA is a publicly available large-scale multiple-choice question answering dataset extracted from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, but in this paper we only adopt the English set, which is split by Jin et al. (2020) . Following (Jin et al., 2020) , we use the Elasticsearch system to retrieve the top 25 sentences to each question+choice pair as the context for each choice, and concatenate them to obtain the normalized log probability over the five choices. Since this dataset is very large, we only report the average accuracy performance under three runs for all the models. Our MoP approach first infuses the factual knowledge for all the partitioned sub-graphs using their respective adapters with newly initialized parameters, then these knowledge-encapsulated adapters are further fine-tuned alone with the underlying BERT model through mixture layers. In this paper, we explored three approaches for implementing the mixture layers, which are described as follows:• Softmax. As the default mixture layers deployed in our MoP, AdapterFusion is a recent proposed model (Pfeiffer et al., 2020a) that learns to combine the information from a set of tasks adapters by a softmax attention layer. In particular, the outputs from different adapters at layer l are combined using a contextual mixture weight calculated by a softmax over these adapters:(3) For brevity, we denote our MoP with the original AdapterFusion mixture layers as Softmax.• Gumbel. We also extend the AdapterFusion by replacing the softmax layer with the Gumbel-Softmax (Jang et al., 2017) layer for obtaining more discrete mixture weights:where g 1 , · · · , g K are i.i.d samples drawn from Gumbel(0, 1) distribution, and τ is a hyper-parameter controlling the discreteness. For brevity, we denote our MoP with the Gumbel-Softmax AdapterFusion mixture layers as Gumbel.• MoE. Mix-of-Experts (MoE) is a type of general purpose neural network component for selecting a combination of the experts to process each input. In particular, we use the the sparsely-gated mixture-of-experts, introduced by Shazeer et al. (2017) , for obtaining a top-K sparse mixture of these adapters. And the mixture weights are calculated by:s l,k = Softmax(TopK(H(Φ l,G k ), K)), (5)