key: cord-0547757-384w88vf authors: Kim, Donggyun; Cho, Seongwoong; Lee, Wonkwang; Hong, Seunghoon title: Multi-Task Neural Processes date: 2021-10-28 journal: nan DOI: nan sha: 846f49a3f23b597cef3679c846f5f9e8fa8ed4ac doc_id: 547757 cord_uid: 384w88vf Neural Processes (NPs) consider a task as a function realized from a stochastic process and flexibly adapt to unseen tasks through inference on functions. However, naive NPs can model data from only a single stochastic process and are designed to infer each task independently. Since many real-world data represent a set of correlated tasks from multiple sources (e.g., multiple attributes and multi-sensor data), it is beneficial to infer them jointly and exploit the underlying correlation to improve the predictive performance. To this end, we propose Multi-Task Neural Processes (MTNPs), an extension of NPs designed to jointly infer tasks realized from multiple stochastic processes. We build MTNPs in a hierarchical way such that inter-task correlation is considered by conditioning all per-task latent variables on a single global latent variable. In addition, we further design our MTNPs so that they can address multi-task settings with incomplete data (i.e., not all tasks share the same set of input points), which has high practical demands in various applications. Experiments demonstrate that MTNPs can successfully model multiple tasks jointly by discovering and exploiting their correlations in various real-world data such as time series of weather attributes and pixel-aligned visual modalities. We release our code at https://github.com/GitGyun/multi_task_neural_processes. Neural Processes (NPs) (Garnelo et al., 2018b) are a class of meta-learning methods that model a distribution of functions (i.e. a stochastic process). By considering a task as a function realized from the underlying stochastic process, they can flexibly adapt to various unseen tasks through inference on functions. The adaptation requires only one forward step of a trained neural network without any costly retraining or fine-tuning, and has linear complexity to the data size. NPs can also quantify their prediction uncertainty, which is essential in risk-sensitive applications (Gal & Ghahramani, 2016) . Thanks to such appealing properties, there have been increasing attempts to improve NPs in various domains, such as image regression (Kim et al., 2019; Gordon et al., 2020) , image classification (Requeima et al., 2019; Wang & Van Hoof, 2020) , time series regression (Qin et al., 2019; Norcliffe et al., 2021) , and spatio-temporal regression (Singh et al., 2019) . In this paper, we explore extending NPs to a multi-task setting where correlated tasks are realized simultaneously from multiple stochastic processes. Many real-world data represent multiple correlated functions, such as different attributes or modalities. For instance, medical data (Johnson et al., 2016; Harutyunyan et al., 2019) or climate data (Wang et al., 2016) contain various correlated attributes on a patient or a region that need to be inferred simultaneously. Similarly, in multi-task vision data (Lin et al., 2014; Zhou et al., 2017; Zamir et al., 2018) , multiple labels of different visual modalities are associated with an image. In such scenarios, it is beneficial to exploit functional correlation by modeling the functions jointly rather than independently, in terms of performance and efficiency (Caruana, 1997) . Unfortunately, naive NPs lack mechanisms to jointly handle a set of multiple functions and cannot capture their correlations either. This motivates us to extend NPs to model multiple tasks jointly by exploiting the inter-task correlation. In addition to extending NPs to multi-task settings, we note that handling multi-task data often faces a practical challenge where observations can be incomplete (i.e. not all the functions share the common sample locations). For example, when we collect multi-modal signals from different sensors, the sensors may have asynchronous sampling rates, in which case we can observe signals from only an arbitrary subset of sensors at a time. To fully utilize such incomplete observations, the model should be able to associate functions observed in different inputs such that it can improve the predictive performance of all functions using their correlation. A multivariate extension of Gaussian Processes (GPs) (Álvarez et al., 2012) can handle incomplete observations to infer multiple functions jointly. However, naive GPs suffer from cubic complexity to the data size and needs approximations to reduce the complexity. Also, their behaviour depend heavily on the kernel choice (Kim et al., 2019) . To address these challenges, we introduce Multi-Task Neural Processes (MTNPs), a new family of stochastic processes that jointly models multiple tasks given possibly incomplete data. We first design a combined space of multiple functions, which allows not only joint inference on the functions but also handling incomplete data. Then we define a Latent Variable Model (LVM) of MTNP that theoretically induces a stochastic process over the combined function space. To exploit the inter-task correlation, we introduce a hierarchical LVM consists of (1) a global latent variable that captures knowledge about all tasks and (2) task-specific latent variables that additionally capture knowledge specific to each task conditioned on the global latent variable. Inducing each task conditioned on the global latent, the hierarchical LVM allows MTNP to effectively learn and exploit functional correlation in multi-task inference. MTNP also inherits advantages of NP, such as flexible adaptation, scalable inference, and uncertainty-aware prediction. Experiments in synthetic and real-world datasets show that MTNPs effectively utilize incomplete observations from multiple tasks and outperform several NP variants in terms of accuracy, uncertainty estimation, and prediction coherency. 2.1 BACKGROUND: NEURAL PROCESSES We consider a task f t : X → Y t as a realization of a stochastic process over a function space (Y t ) X that generates a data D t = (X D , Y t D ) = {(x i , y t i )} i∈I(D t ) , where I(D t ) denotes a set of data index. Neural Processes (NPs) use a conditional latent variable model to learn the stochastic process. Given a set of observations C t = (X C , Y t C ) = {(x i , y t i )} i∈I(C t ) , NP infers the target task f t through a latent variable z and models the data D t by a factorized conditional distribution p(Y t D |X D , z): We refer to the set of observations C t as a context data and the modeling data D t as a target data. NP models the generative model p(Y t D |X D , z) and the conditional prior p(z|C t ) by two neural networks, a decoder p θ and an encoder q φ , respectively. Since the direct optimization of Eq.1 is intractable, the networks are trained by maximizing the following variational lower-bound. log p θ (Y t D |X D , C t ) ≥ E q φ (z|D t ) [log p θ (Y t D |X D , z)] − D KL (q φ (z|D t )||q φ (z|C t )). (2) Note that the decoder network q φ is also used as a variational posterior q φ (z|D). The parameter sharing between model prior and variational posterior gives us an intuitive interpretation of the loss function: the KL term acts as a regularizer for the encoder q φ such that the summary of the context is close to the summary of the target. This reflects the assumption that the context and target are generated by the same underlying data-generating process and aids effective test-time adaptation. After training, NP infers the target function according to the latent variable model (Eq.1). Now we extend the setting to multi-task learning problems where multiple tasks f 1 , · · · , f T are realized from T stochastic processes simultaneously, each of which has its own function space (Y t ) X , ∀t ∈ T = {1, 2, · · · , T }. Let D = (X D , Y 1:T D ) = t∈T D t be a multi-task target data, where each D t corresponds to the data of task f t . Then the learning objective for the set of T realized tasks is to model the conditional probability p(Y 1:T D |X D , C) given the multi-task context C = (X C , Y 1:T C ) = t∈T C t , where each C t is a set of observations of task f t . The sets C and D can be arbitrarily chosen, but we assume C ⊂ D for simplicity. However, assuming the complete context C for all tasks is often challenged by many practical issues, such as asynchronous sampling across multiple sensors or missing labels in multi-attribute data. To address such challenges, we relax the assumptions on context C and let I(C t ) be different across t ∈ T . In this case, an input point x i can be associated with a partial set of output values {y t i } t∈Ti , T i T , which is referred incomplete observation. Next, we present two ways to use NPs to model the multi-task data and discuss their limitations. Single-Task Neural Processes (STNPs) A straightforward application of NPs to the multi-task setting is assuming independence across tasks and define independent NPs over the function spaces (Y 1 ) X , · · · , (Y T ) X . We refer to this approach as Single-Task Neural Processes (STNPs). Specifically, a STNP has T independent latent variables v 1 , · · · , v T , where each v t implicitly represents a task f t . Thanks to the independence assumption, STNPs can handle incomplete context by conditioning on each task-specific data C t independently. However, this approach can only model the marginal distributions for each task, ignoring complex inter-task correlation within the joint distribution of the tasks. Note that this is especially impractical for multi-task settings under the incomplete data since each task f t can be learned only from C t , ignoring rich contexts available in other data C t , ∀t = t. Joint-Task Neural Process (JTNP) An alternative approach is to combine output spaces to a product space Y 1:T = t∈T Y t and define a single NP over the function space (Y 1:T ) X . We refer to this approach as Joint-Task Neural Processes (JTNPs). In this case, a single latent variable z governs all T tasks jointly. p(Y 1:T D |X D , C) = p(Y 1:T D |X D , z)p(z|C)dz. JTNPs are amenable to incorporate correlation across tasks through the shared variable z. However, by definition, they require complete context and target for both training and inference. This is because any incomplete set of output values {y t i } t∈Ti for an input point x i such that T i = T is not a valid element of the product space Y 1:T . In addition, it relies solely on a single latent variable to explain all tasks, ignoring per-task stochastic factors in each function f t . In what follows, we propose an alternative formulation for jointly handling multiple tasks on incomplete data, which (1) enables a probabilistic inference on the incomplete data and (2) is more amenable for learning both task-specific and task-agnostic functional representations. In this section, we describe Multi-Task Neural Processes (MTNPs), a family of stochastic processes to model multiple functions jointly and handle incomplete data. We first formulate MTNPs using a hierarchical LVM. Then we propose the training objective and a neural network model. Our objective is to extend NPs to jointly infer multiple tasks from incomplete context. Discussions in Section 2.2 suggest that direct modeling of a distribution over functions of form f : X → t∈T Y t is achievable via JTNP (Eq. 4), yet it requires complete data in both training and inference. To circumvent this problem, we reformulate the functional form by h : X × T → t∈T Y t . Note that this functional form allows us to model the same set of functions as JTNP by f (x i ) = (h(x i , 1), · · · , h(x i , T )). However, by using the union form we can exploit incomplete data since any partial set of output values {y t i } t∈Ti now becomes a set of valid output values at different input points (x i , t), t ∈ T i . For notational convenience, we denote x t i = (x i , t) and assume input points in the context C and the target D are embedded by the task indices, i.e., C = (X 1: } i∈I(C t ) and the same for D. Next, we present a latent variable model that induces a stochastic process over functions of form h. To make use of both task-agnostic and task-specific knowledge, we define a hierarchical latent variable model (Figure 1(c) ). In this model, the global latent variable z captures shared stochastic factors across tasks using the whole context C, while per-task stochastic factors are captured by the task-specific latent variable v t using C t and z. It induces the predictive distribution on the target by: where v 1:T := (v 1 , · · · , v T ). Similar to Eq. 1, we assume the conditional independence on Note that this hierarchical model can capture and leverage the inter-task correlation by sharing the same z across v 1:T . Also, it is amenable to fully utilize the incomplete data: since the global variable z is inferred from the entire context data C = t∈T C t and is conditioned to infer task-specific latent variable v t , each function f t induced by v t exploits the observations available for not only itself C t , but also for other tasks C t , ∀t = t. Next, we show that Eq. 5 induces a stochastic process over the functions of form h : X × T → t∈T Y t . Proposition 1. Consider the following generative process on data D and context C, which is a generalized form of Eq. 5. Then under some mild assumptions, there exists a stochastic process over functions of form h : where the data D is generated. Proof. We leave the proof in Appendix A.2. We refer to the resulting stochastic processes as Multi-Task Neural Processes (MTNPs). In the perspective of stochastic process, Eq. 5 allows us to learn functional posterior not only on each task via v t , but also across the tasks via z. Then optimizing Eq. 5 can be interpreted as learning to learn each task captured by v t together with the functional correlation captured by z. We use an encoder network q φ and a decoder network p θ to approximate the conditional prior and generative model in Eq. 5, respectively. Since the direct optimization of Eq. 5 is intracable, we train the networks via the following variational lower bound, where we use the same network q φ for both conditional prior and variational posterior as in NP: We leave the derivation in Appendix A.3. The above objective reflects several desirable behaviors for our model. Similar to NP, the KL divergences encourage that both latent variables z and v t inferred from the context data are consistent with those inferred from the entire target data. On the other hand, we observe that minimizing the KL divergence on task-specific variables forces the global latent z to be informative across all tasks, such that it can induce the task-specific factors v t from the limited context C t . This makes the model encode correlated information across tasks in z and use it for inferring each task with v t , which is critically important for joint inference with incomplete context data. After training, MTNP infers the target functions according to the latent variable model (Eq. 5). This section presents an implementation of MTNPs composed of an encoder q φ and a decoder p θ (Eq. 7). While our MTNP formulation is not restricted to a specific architecture, we adopt ANP (Kim et al., 2019) as our backbone, which implements the encoder by attention layers (Vaswani et al., 2017) and the decoder by a MLP. Figure 2 illustrates the overall architecture. In the following, we denote a stacked multi-head attention block (Parmar et al., 2018) by Attn(Q, K, V ) and a MLP by ψ(x). Also, we denote e t by a learnable task embedding for t ∈ T which is used to condition on the task index t. The latent encoder samples global and per-task latent variables by aggregating the context C. For each context example (x t i , y t i ) ∈ C t , we first project it to a hidden representation s t i = ψ s (x i , y t i ) + e t . Then we aggregate them to a task-specific representation s t via self-attention followed by a pooling operation, which is further aggregated to a global representation s. (9) Note that the first attention is applied along the example axis (per-task) to encode information of each task, while the second one is applied along the task axis (across-task) to aggregate the information across the tasks. Then, we get the global and task-specific latent variables via ancestral sampling. z ∼ q φ (z|C) = N (ψ (z,1) (s), ψ (z,2) (s)), Deterministic Encoder To further improve the expressiveness of model, we extend the deterministic encoder of Kim et al. (2019) that produces local representation specific to both target example and task via attention mechanism. As in the latent encoder, we first project each context example (x t i , y t i ) ∈ C t to a hidden representation d t i = ψ d (x i , y t i ) + e t that serves as value embedding in cross-attention. Also, we use context and target input x t i as key and query embeddings for the crossattention, respectively. Then we apply cross-attention along the example axis (per-task) followed by self-attention along the task axis (across-task). Decoder Finally, the decoder produces predictive distributions for the target output y t i ∈ Y t D for each target input x t i . We first project the input to w t i = ψ w (x i ) + e t , then concatenate it with the corresponding latent variable v t and determinstic representation r t i . The output distribution is computed by MLPs, whose output depends on the type of the task. However, like NPs, they infer each task independently and do not explicitly consider inter-task correlation during inference. Also, CNAPs are designed specifically to classification tasks, while MTNPs are generally applicable to various tasks including classification and regression. Hierarchical Models in Neural Process Family Since the pioneering work by Garnelo et al. (2018b) , several variants of NP introduced the concept of hierarchical modeling. Attentive Neural Processes (ANPs) (Kim et al., 2019) incorporate attention mechanism to a deterministic variable, which is additional context information for each target example to improve the expressive power of the model and prevent the underfitting issues of vanilla NPs. Similarly, Wang & Van Hoof (2020) introduce local latent variables to incorporate example-specific stochasticity, which extends the graphical model of NPs to the hierarchical one. Our MTNP formulation also involves a hierarchical latent variable model but has a different structure orthogonal to the prior works. MTNPs use a global latent variable to jointly model multiple task functions while using per-task latent variables to capture task-specific stochasticity. Although extending the model to contain example-level local latent variables is possible, we adopt the deterministic local representation as in ANPs for simplicity. We evaluate MTNP on three datasets, including both synthetic and real-world tasks. In all experiments, we construct incomplete context data by selecting a complete subset C ⊂ D of size m = |I(C)| from the target, then randomly drop the output points independently according to the missing rate γ ∈ [0, 1] (γ = 0 means complete data). We repeat the procedure with five different random seeds and report the mean values of each evaluation metric. Baselines In each experiment, we compare MTNP with two NP variants, STNP and JTNP. We adopt ANP (Kim et al., 2019) as a backbone architecture for STNP and JTNP, which is a strong NP baseline. Since JTNP cannot handle incomplete data, we build a stronger baseline by the combination of STNP and JTNP (S+JTNP), where missing labels are imputed by STNP and then used to jointly infer the tasks by JTNP. In 1D regression tasks, we additionally compare two Multi-Output Gaussian Processes baselines, CSM (Ulrich et al., 2015) and MOSM (Parra & Tobar, 2017) , and two metalearning baselines, MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018) , where we slightly modify the meta-learning baselines to learn multiple tasks jointly from incomplete data. At training time, we set γ = 0.5 for all models but keeping γ = 0 for JTNP. At test time, we evaluate the models in various missing rates γ ∈ {0, 0.25, 0.5, 0.75}. We provide architectural and training details in Appendix B. We also provide ablation studies on architectural designs such as self-attentions and pooling, parameter sharing and task embedding, latent and deterministic encoders in Appendix H. Dataset and Metric We begin with 1D synthetic regression tasks where the target functions are correlated by shared parameters (e.g., scale, bias, phase) but have different shapes. Inspired by Guo et al. (2020b) , we first randomly sample global parameters a, b, c, w ∈ R shared across the tasks, then generate four correlated tasks using different activation functions as follows. To simulate task-specific stochasticity, we perturb the parameters (a, b, c, w) with small i.i.d. Gaussian noises per task. In this setting, the model has to learn per-task functional characteristics imposed by different activation functions and per-task noises, as well as how to share the underlying parameters unseen during training among the tasks. For evaluation, we generate training and testing sets via non-overlapping splits of the parameters, then measure mean squared error (MSE) normalized by the scale parameter a to aggregate results on functions with different scales. See Appendix C for details. Results Table 1 shows the quantitative results with γ = 0.5. More comprehensive results with different missing rates and standard deviations for the metrics are provided in Appendix D. As it shows, MTNP outperforms all baselines in all tasks and context sizes. This can be attributed to the ability of MTNP to (1) exploit all available context examples to infer inter-task general knowledge (i.e. a, b, c, w) and (2) translate it back to functional representations for each task. In contrast, STNP fails to predict multiple tasks accurately due to the independent task assumption. Although JTNP is designed to discover and utilize inter-task correlations, its performances do not show dramatic improvement over STNP since its observations are largely based on noisy imputations from STNP. We also observe that GP baselines (MOSM, CSM) perform even worse than STNP when the context size is small, despite their inherent ability to joint inference on incomplete data. We conjecture that it is because GPs lack a meta-training mechanism that allows NPs (and MTNPs) to quickly learn the tasks using a few examples. Gradient-based meta-learning baselines (MAML, Reptile) are also comparable to STNP and JTNP but perform worse than MTNP. This could be due to the lack of global inference on function space, which leads them to overfit the context points. As an illustrating example, we also plot predicted distributions from the models in a highly incomplete scenario (m = 10 and γ = 0.5) in Figure 3 (a). We observe that STNP generally suffers from inaccurate predictions due to limited context, while MTNP successfully exploits incomplete observations from different tasks to improve the predictive performance. The qualitative results for all baselines are provided in Appendix D. We also perform an ablation study on the latent variable model to justify the effectiveness of our hierarchical formulation. We consider two variants of MTNP that consist of the global latent variable only (MTNP-G) and the task-specific latent variables only (MTNP-T). Then we evaluate the models in three synthetic datasets generated with different levels of inter-task correlation. Specifically, we construct partially correlated tasks as described before, totally correlated tasks by removing the task-specific noises, and independent tasks by sampling the parameters a, b, c, w independently for On the other hand, MTNP and MTNP-T do not suffer from such a negative transfer since each of the independent tasks can be addressed by per-task latent variables separately. The overall results demonstrate that incorporating both global and task-specific information is the most effective and robust against various levels of inter-task correlation. Dataset and Metric To demonstrate our method in a practical, real-world domain, we perform an experiment on weather data. Weather attributes are physically correlated with each other, and the observations are often incomplete due to different sensor configurations or coverage per station. Also, the observed attributes are highly stochastic, making MTNP's stochastic process formulation fits it well. We use a dataset gathered by Dark Sky API 1 , consisting of 12 daily weather attributes collected at 266 cities for 258 days. We choose six attributes, namely low and high temperatures (TempMin, TempMax), humidity (Humidity), precipitation probability (Precip), cloud cover (Cloud), and dew point (Dew), which forms six correlated tasks. We normalize each attribute to be standard Gaussian and the time to be in [0, 1]. We divide the data into 200 training, 30 valid, and 33 test sets of time series, where each set corresponds to a unique city. We evaluate the prediction performance by MSE. Since the data is noisy, we also report negative log-likelihood as a metric of uncertainty estimation. Results Table 2 summarizes quantitative results. More comprehensive results with different missing rates, context sizes, and standard deviations for the metrics are provided in Appendix E. MTNP outperforms all baselines in both accuracy and uncertainty estimation, which demonstrates that it generalizes well to real-world stochastic data. More interestingly, Figure 4 illustrates how MTNP Figure 4 : Visualization of MTNP's internal knowledge transfer. By observing additional data from Cloud task (at red triangles) given upon a few context points (at blue dots), the predicted mean and variance of Precip task improve at the additionally observed region. transfer its knowledge from one task (Cloud) to another (Precip) given the incomplete observations. When the observation is sparse (Figure 4 (a)), the model produces an inaccurate prediction with high uncertainty for unobserved input domains. However, when the additional observations are available for the other attribute ( Figure 4 (b),(c)), MTNP successfully transfers the knowledge to improve the prediction. It shows that MTNP can effectively learn to exploit the incomplete observation by transferring knowledge across tasks. Dataset and Metric We further demonstrate our approach to more challenging 2D structured function regression tasks. Following Garnelo et al. (2018a) , we interpret an RGB image as a function that maps a 2D pixel location x i ∈ [0, 1] 2 to its RGB values y i ∈ [0, 1] 3 , and extend its concept to pixel-aligned 2D spatial data for the multi-task setting. Specifically, we consider four pixel-aligned visual modalities with a resolution of 32 × 32 on celebrity faces as a set of tasks, namely RGB image (RGB) (Liu et al., 2015) , semantic segmentation map (Segment) (Lee et al., 2020), Sobel edge (Edge) (Kanopoulos et al., 1988) , and Projected Normalized Coordinate Code (PNCC) (Zhu et al., 2016) . We then construct training and testing sets with non-overlapping splits of face images. To evaluate the Segment task, we report mean Intersection-over-Union (mIoU). For the other tasks, we report MSE. We also measure prediction coherency across tasks to evaluate the task correlation captured by models. To measure the coherency between the predictions, we generate pseudo-labels by translating the RGB prediction into the other three modalities using image-to-image translation methods (Kanopoulos et al., 1988; Guo et al., 2020a; Chen et al., 2018) , then measure errors (MSE or 1 -mIoU) between the pseudo-labels and predictions. Additional details are provided in Appendix F. Results Table 3 summarizes the quantitative comparison results. More comprehensive results with different missing rates are provided in Appendix G. Overall, we observe similar results with the 1D regression experiments where MTNP generates more accurate predictions over STNP and S+JTNP by effectively exploiting the incomplete data. We also observe that the MTNP produces more coherent predictions over the baselines, which shows that it indeed learns to exploit the correlation across tasks effectively. To further validate the results, we present qualitative comparison results in Figure 5 . We observe that STNP and S+JTNP produce inaccurate (red boxes) or incoherent (green box) outputs when the number of contexts is extremely small. On the other hand, MTNP (1) consistently regresses coherent functions regardless of the number of observable contexts, and (2) its predictions are more accurate than the baselines given the same number of contexts (green box). Finally, we investigate the discovery and exploitation of task correlations achieved by MTNP. We first partition tasks into source and target tasks. Then, we measure relative performance improvement on the target tasks before and after the model observes data from source tasks. We summarize the results in Table 4 , where we average performance gains coming from all possible combinations of source tasks for each target task. By observing which task is the most beneficial to each of the other tasks, we observe that there are two groups of highly correlated tasks (RGB-Edge) and (Segment-PNCC). These results demonstrate that MTNP successfully captured dependence among tasks considering that (1) RGB and Edge are composed of two correlated low-level signals (e.g. color intensity and its gradients) We propose Multi-Task Neural Processes (MTNPs), a new family of stochastic processes designed to infer multiple functions jointly from incomplete data, along with a hierarchical latent variable model. Through extensive experiments, we demonstrate that the proposed MTNPs can leverage incomplete data to solve multiple heterogenous tasks by learning to discover and exploit task-agnostic and task-specific knowledge. Scaling up our method to large-scale datasets will be a promising research direction. To this end, our method can be improved in several aspects by (1) generalizing to unseen task space T and (2) allowing empty context data for some tasks such that we can generalize MTNPs in more diverse real-world scenarios such as zero-shot inference and semi-supervised learning. Ethics Statement Recently, detecting and removing data bias have become essential problems towards producing fair machine learning models. We believe that our work can contribute to detect unintentional data bias present in multi-attribute data. MTNP can be seen as a universal correlation learner who learns arbitrary correlation across tasks purely data-driven way. Therefore, given potentially biased multi-attribute data (e.g., multiple personal attributes), MTNP may detect any biased relationship by learning the correlation between them. For example, we may perform the task-to-task transfer analysis on a trained MTNP as discussed in Section 5.2 and Section 5.3, then see which task (or attribute) has a high correlation with another task (or attribute). In this work, we present two major theoretical results (Proposition 1 and Eq. 7), a neural network model (Section 3.3), and experiments on three datasets (Section 5). We give a complete proof of Proposition 1 in Appendix A.2 and the ELBO derivation for Eq. 7 in Appendix A.3. We provide architectural details and training hyper-parameters of the models used in the experiments in Appendix B. Finally, details on the experimental settings and datasets are provided in Appendix C and Appendix F. In this section, we give a proof of Proposition 1 with a brief introduction of the Kolmogorov Extension Theorem (Itô et al., 1984) , and derive training objectives of STNP, JTNP, and MTNP. A stochastic process F : X × Ω → Y is a collection of random variables {Y x : Ω → Y} x∈X which is indexed by an index set X . Also, all the random variables are defined on a single probability space (Ω, F, P) and a value space Y. This can be interpreted as a distribution over a function space Y X , such that sampling a function f corresponds to Suppose we have observed input and output sequences X = (x 1 , x 2 , · · · , x n ) and Y = (y x1 , y x2 , · · · , y xn ) of a function f : X → Y. With a slight abuse of a notation, let p( Then by the Kolmogorov Extension Theorem, the data (X, Y ) induces a stochastic process F such that ∃ ω ∈ Ω s.t. y xi = F (x i , ω) for all i = 1, 2, · · · , n, if the distribution p(Y |X) satisfies two conditions: consistency and exchangability. where where X 1:n = {x 1 , · · · , x n }, π • X = (π(x 1 ), π(x 2 ), · · · , π(x n )), and π • Y = (y π(x1) , y π(x2) , · · · , y π(xn) ). In the case of MTNP, we observe input and output sequences X = ((x 1 , t 1 ), (x 2 , t 2 ), · · · , (x n , t n )) and Y = (y (x1,t1) , y (x2,t2) , · · · , y (xn,tn) ) of a function h : X × T → t∈T Y t . Note that in the main text, we abbreviate x t i = (x i , t) and y t i = y (xi,t) for visibility. Now we want to show the existence of a stochastic process H : X × T × Ω → t∈T Y t , where the data D = (X, Y ) is generated. This can be done by showing the following conditions. where ) and Y i1:i2 = (y (xi 1 ,ti 1 ) , · · · , y (xi 2 ,ti 2 ) ). For any permutation π on X 1:n × T , where X 1:n = {x 1 , · · · , x n }, π • X = (π((x 1 , t 1 )), · · · , π((x n , t n ))), and π • Y = (y π((x1,t1)) , · · · , y π((xn,tn)) ). Here p(Y |X, C) = ρ X (Y |C) is the conditional distribution of Y given any context C. Note that C is conditioned since we are modeling functional posterior of h, rather than prior. Now we provide the proof of Proposition 1, which states that the following generative model defines a stochastic process. To show the conditions of Kolmogorov Extension Theorem, we need two assumptions on the data generating process (Eq. 20). First, we assume the distribution defined by the data generating process is finite so that the order of integral can be swapped. Also, we assume that the conditional distribution p(y (xi,t) |x i , t, v t ) can implicitly select the per-task latent variable v t among v 1:T using the given task index t, i.e., there exists a distributionp such thatp(y (xi,t) This means no more than that the latent variables v 1:T are indeed task-specific, such that each v t corresponds to task f t . Note that our neural-network model of MTNP (Figure 2 ) indeed satisfies the second assumption, since the decoder selects the corresponding per-task latent variable v t given the task index t ∈ T . Proof. We first show the consistency condition. From the data generating process (Eq. 20), Next, we show the exchangability condition. Let π 1 , π 2 be the values of first and second coordinate of π, such that π((x i , t i )) = (π 1 ((x i , t i )), π 2 ((x i , t i ))). Then Here we used the assumption about p(y (xi,t) , v π2((xi,ti)) ). Since 1 ≤ m < n and π are arbitrarily chosen, the data generating process (Eq. 20) satisfies the conditions of the Kolmogorov Extension Theorem. Thus there exists a stochastic process H : X × T × Ω → t∈T Y t , whose realizations are functions of the form h : Note that the latent variable model of MTNP (Eq. 5) is a special case of the data generating process (Eq. 20), where p(v t |z, t, C) = p(v t |z, C t ). Thus MTNP is a stochastic process over the functions of form h : X × T → t∈T Y t . We derive the evidence lower bound (ELBO) for log p θ (Y 1: . For simplicity, we assume C t ⊂ D t for all t so that C ⊂ D as well. Also, to avoid confusion, for this derivation we denote the conditional prior networks as p θ (z|C t ) and p θ (v t |z, C t ) and then replace them with q φ (z|C t ) and q φ (v t |z, C t ) respectivelly, when we introduce parameter-sharing between prior and variational posterior networks. First, the conditional log-likelihood has a lower bound where p θ (Y 1:T D |X 1:T D , C, z) in Eq. 36 can be further expanded by On Eq. 34 and Eq. 40, we use the conditional independence relation follows from the latent variable model (Eq. 5) By combining Eq. 36 and Eq. 43, and also by sharing the parameters of conditional priors p θ (z|C) and p θ (v t |z, C t ) with variational posteriors q φ (z|C) q φ (v t |z, C t ), we get the following lower bound. A.4 ELBO FOR STNP AND JTNP STNP is no more than a collection of independent Neural Processes (NPs), where each NP corresponds to each task. Using T encoders and decoders {(p θt , q φt )} T t=1 , the objective for STNP can be derived by summing up the NP objectives (Eq. 2). We omit the ELBO derivation for NP. Note that the parameter sharing is used for the conditional prior network p θt (v t |C t ) and the variational posterior network q φt (v t |D t ). On the other hand, JTNP is a single NP that models all tasks jointly, by concatenating the output variables into a single vector. Using an encoder q φ and a decoder p θ , the objective for JTNP is the same as the NP objective. Again, the encoder q φ serves as both conditional prior and variational posterior. Note that STNP and JTNP model functions with input space X , so there is no superscript t in the input variables. In this section, we provide architectural and training details about models used in the experiments (Section 5). As a strong NP baseline, we adopt Attentive Neural Processes (ANPs) (Kim et al., 2019) architecture for STNP and JTNP. The encoder of ANP consists of a latent path and a deterministic path, each computes a latent variable z and a deterministic representation r i specific to each target example x i . Then the decoder produces a distribution for the target output y i , which is assumed to be Normal. The latent encoder samples a global latent z. For each context example (x i , y 1:T i ) ∈ C, we first project it to a hidden representation s 1:T i = ψ s (x i , y 1:T i ) using a single MLP ψ s . Then we aggregate them to a global representation s via self-attention followed by a pooling operation. s = pool(Attn({s 1:T i } i∈I(C) , {s 1:T i } i∈I(C) , {s 1:T i } i∈I(C) ). Then the global latent is sampled via two MLPs. z ∼ q φ (z|C) = N (ψ (z,1) (s), ψ (z,2) (s)). Deterministic Encoder The deterministic encoder produces local representation r i for each i ∈ D. We first project each context example (x i , y 1:T i ) ∈ C t to a hidden representation d 1:T i = ψ d (x i , y 1:T i ) that serves as value embedding in cross-attention. Then by using the context and target input x i as key and query embeddings, we apply a cross-attention along the example axis (per-task). Decoders Finally, the decoder produces predictive distribution for each joint target output y 1:T i . We first project the input to w i = ψ w (x i ), then concatenate it with the global latent variable z and deterministic representation r 1:T i . To compute the output distribution, we first apply two MLPs on the triple (w i , z, r 1:T i ). Then for each dimension, we construct the predictive distributions as Normal or Categorical, depending on the corresponding task type. where (µ i ) Y t (or (σ 2 i ) Y t )) denotes the projection of µ i (or σ 2 i ) into the task-specific output space Y t , by indexing the corresponding dimension from µ i . For example, if all tasks are one-dimensional, then this corresponds to selecting t-th coordinate of µ i . This section presents a detailed description of the STNP architecture used in the experiments. The STNP consists of T independent ANPs, which consists of T latent encoders, T deterministic encoders, and T decoders. Then STNP produces target output distribution by conditioning on the context In the following, we deonte a stacked multi-head attention block (Parmar et al., 2018) by Attn(Q, K, V ) and a MLP by ψ(x), as in Section 3.3. The latent encoders sample per-task latents v 1:T = (v 1 , · · · , v T ). For each context example (x i , y t i ) ∈ C t , we first project it to a hidden representation s t i = ψ t s (x i , y t i ) using a MLP ψ t s specific to task f t . Then we aggregate them to a task-specific representation s t via self-attention followed by a pooling operation. Note that each attention is applied along the example axis (per-task) and independent to each task. Then the per-task latent variables are sampled independently, via MLPs. Deterministic Encoders The deterministic encoders produce local representation r t i for each i ∈ D and t ∈ T . We first project each context example (x i , y t i ) ∈ C t to a hidden representation d t i = ψ t d (x i , y t i ) that serves as value embedding in cross-attention. Then by using the context and target input x i as key and query embeddings, we apply T independent cross-attention along the example axis (per-task). Decoders Finally, the decoders produce predictive distributions for each target output y t i . We first project the input to w t i = ψ t w (x i ), then concatenate it with the corresponding latent variable v t and deterministic representation r t i . The output distribution is computed similar to MTNP described in Section 3.3. We use the hidden dimension d = 128 for all models in synthetic and CelebA experiments and d = 64 and in weather experiments. ψ s (or ψ t s ) 3 3 3 ψ d (or ψ t d ) 3 3 3 ψ w (or ψ t w ) 1 1 1 ψ y (or ψ t y ) 5 5 5 per-task Attn (in STNP, MTNP) 3 - 3 global Attn (in JTNP) - 3 - across-task Attn (in MTNP) - - 2 For all three models, we schedule learning rate lr by lr = base_lr × 1000 0.5 × min(n_iters × 1, 000 −1.5 , n_iters −0.5 ), where n_iters is the number of total iterations and base_lr is the base learning rate. We also introduce beta coefficient on the ELBO objective following Higgins et al. (2017) , which is multiplied by each KL term. The beta coefficient is scheduled to be linearly increased from 0 to 1 during the first 10000 iters, then fixed to 1. We summarize the training hyper-parameters of models used in the experiments in Table 7 . The overall description of our neural network model for MTNP is provided in Section 3.3. We use different parameter sharing techniques in the datasets, depending on whether the tasks are homogeneous or not. In synthetic and weather tasks, all output values are one-dimensional. Thus we tie the parameters of the per-task paths in encoder and decoder, which makes more efficient parametrization compared to per-task encoders and decoders. In visual tasks, however, the tasks have different output dimensionalities. Thus in this case, we separate the parameters of all per-task paths. As task identity is implicitly encoded by the separation of task-specific paths, we do not employ task embeddings We include two Multi-Output Gaussian Process (MOGP) baselines, MOSM (Parra & Tobar, 2017) and CSM (Ulrich et al., 2015) . To make use of training set of tasks, we consider pretraining MOGPs with respect to the kernel parameters using the same meta-training dataset with MTNP, and transfer the learned kernel parameters as prior in meta-testing. This allows both MOGPs and MTNPs to be trained and evaluated under the same setting. To prevent overfitting, we early-stopped the pretraining based on NLL. We observe that such pretraining is effective in synthetic tasks but not in weather tasks, thus we report the pretrained version for results on synthetic tasks and non-pretrained version for results on weather tasks. We also include two gradient-based meta-learning baselines, MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018 ) that use the same meta-train/meta-test data with our method. We chose these models as they are model-agnostic meta-learning methods that can be applied to our multi-task regression setting with incomplete data. Applied to our problem, the meta-training involves bi-level optimization where the inner loop optimizes the loss for context data and the outer loop optimizes the loss for target data. We employ a similar architecture to MTNP for the baselines that consists of a 4-layer MLP encoder network shared by all tasks and task-specific 4-layer MLP decoder networks. For fair comparisons, we controlled the total number of parameters of the models similar to NP baselines (STNP, JTNP, MTNP). In this section, we describe details of the data generating process and experimental settings of 1D function regression on synthetic tasks. As discussed in the paper, we simulate synthetic tasks which are correlated by a set of parameters a, b, c, w ∈ R as follow: where Sine, Tanh, Sigmoid are sine, hyperbolic tangent, logistic sigmoid function, respectively, and Gaussian(x) is defined as exp(−x 2 ). Rather than sharing the exactly same parameters a, b, c, w across tasks, we add a task-specific noise to each parameter, to control the amount of correlation across tasks as follow: Thus in fact the input-output pairs of each task is generated as follow: We split the 1,000 functions into 800 training, 100 validation, and 100 test sets of four correlated tasks. Then we construct a training dataset, a validataion dataset, and a test dataset using the corresponding set of generated tasks. For each training and validation data, we sample 200 input points uniformly within the interval [−5, 5], and applied the corresponding tasks to generate multi-task output values. For each test data, we choose 1000 input points in the uniform grid of the interval [−5, 5], and generate the multi-task output values similarly. Finally, simulating the incomplete data is achieved by randomly dropping each output value y t i with probability γ ∈ [0, 1]. For evaluation, we average the normalized MSE M SE = 1 n n i=1 (y t i −ŷ t i ) 2 /a 2 on test dataset. 2 For predictionŶ 1:4 , we approximate the predictive posterior mean with Monte Carlo sampling. For example in MTNP, We use N = M = 5, resulting total 25 samples. For STNP (or JTNP), we sample each latent v t (or z) 5 times, since there is no hierarchy. Since all the output distributions are Gaussian, the posterior predictive mean can be computed by averaging the means of each sample distribution p(y t i |x i , v t k,l ). To plot the predictions in Figure 3 (a), we use the posterior means for both z and v t (which corresponds to the Maximum A Posteriori estimation) and plot the mean and variance of resulting p(y t i |x i , v t ). In this section, we provide additional results on the synthetic experiment, with various missing rates γ and also with standard deviation from 5 different random seeds. When the data is incomplete and missing some task labels (i.e., γ = 0.25, 0.5, 0.75), we can see that MTNP clearly outperforms the baselines in almost all cases. When the complete data (γ = 0) is given, MTNP still outperforms almost all baselines while achieves at least competitive performance to JTNP. Figure 6 and 7 shows that MTNP is the most robust against both context size and quality (incompleteness). Published as a conference paper at ICLR 2022 Published as a conference paper at ICLR 2022 In this section, we explore the totally incomplete setting where no two task output values are observed at the same input point (I(C t ) ∩ I(C t ) = ∅, ∀t = t ), to further validate the practical effectiveness of our method. To generate the totally incomplete dataset, we randomly sample context input points for each task independently, then compute the corresponding output points according to the Eq. 65 and Eq. 66. Note that in this case we do not drop any output points, so that the context size is m = |I(C)| = t∈T |I(C t )|. We evaluate the baselines and ours in two different scenarios: (1) training on partially incomplete or complete dataset then testing on totally incomplete dataset and (2) both training and testing on totally incomplete dataset. We observe that in both scenarios, MTNP outperforms the baselines. (1), with varying context size (m). All models are trained on partially incomplete data with γ = 0.5 but JTNP is trained with γ = 0. (2), with varying context size (m). All models are trained on totally incomplete data but JTNP is trained with γ = 0. In this section, we provide additional results on the time-series regression experiment on weather data, with various missing rates γ and also with standard deviation from 5 different random seeds. In this section, we provide additional results on the synthetic experiment, with various missing rates γ and also with standard deviation from 5 different random seeds. When the data is highly incomplete (i.e., γ = 0.5, 0.75), we can see that MTNP clearly outperforms the baselines in general. When the complete or less incomplete data (γ = 0, 0.25) is given, MTNP still outperforms the gradient-based meta-learning baselines and MOSM while achieves at least competitive performance to CSM, STNP and JTNP. (Kanopoulos et al., 1988) on the RGB images to generate continuous-valued edges. This corresponds to the Canny edge (Canny, 1986 ) without non-maximum suppression, which is also used in Zamir et. al. (2018 ) (Zamir et al., 2018 . For PNCC task, we apply a pretrained 3D face reconstruction model on the RGB images to generate PNCC label maps (Guo et al., 2020a) . At each pixel, the PNCC label consists of the (x, y, z) coordinate of the facial keypoint located at the pixel. In summary, labels for RGB are 3-dimensional, Edge are 1-dimensional, Segment are 19-dimensional, and PNCC are 3-dimensional vectors. We split the 30,000 images into 27,000 train, 1,500 valid, and 1,500 test images. To evaluate the accuracy of the multi-task prediction, we average MSE M SE = 1 n n i=1 (y t i −ŷ t i ) 2 on the test images for continuous tasks (RGB, Edge, PNCC), and average mean IoU on the test images for discrete task (Segment). The predictive posterior mean is computed by Monte Carlo sampling, the same as in the 1D experiment. For categorical outputs, we discretize the prediction with the argmax operatorŷ i t = argmax k p(y t i = k|x i , v t ). To evaluate the consistency of predictions across tasks (coherency), we translate each RGB prediction to other task labels. For Edge and PNCC, we use the ground-truth label generation algorithm and the pretrained model used to generate the ground-truth labels for the translation, respectively. For Segment, we fine-tuned DeeplabV3+ (Chen et al., 2018) with ImageNet (Krizhevsky et al., 2012) pretrained ResNet-50 (He et al., 2016) backbone. We refer to the github repository of Yakubovskiy (2019) (Yakubovskiy, 2020) for the DeeplabV3+ model. After the translation, we measure MSE and 1 -mIoU for continuous and discrete tasks respectively, to evaluate the disagreement (as oppose to the coherency) between the predictions. To examine the learned correlation across tasks by MTNP ( Table 3 in the main paper), we compare the performance before and after MTNP observes a set of source data. The source data consist of all examples labeled with the source tasks. For example, if the target task is RGB and the source tasks are Edge and Segment, we give Edge and Segment labels for all pixels, while no RGB or PNCC labels are given. Since MTNP requires at least one labeled example for each task, we give a single completely labeled example which is chosen randomly to MTNP as a base context, before MTNP observes the source data. There are total 4 1 + 4 2 + 4 3 = 14 different combinations of source tasks exist. By excluding the case where target task is in the set of source tasks, total 7 different combinations of source tasks remain for each target task. To measure the performance gain from task f 1 to f 2 , we average the performance gain of f 2 from all sets of source tasks that containing f 1 . For example, the performance gain from Edge to RGB is computed by averaging performance gains δ Edge→RGB , δ Edge,Segment→RGB , δ Edge,PNCC→RGB , and δ Edge,Segment,PNCC→RGB , where we denote δ A→B by the performance gain from source tasks A to target task B. In this section, we provide additional results on the image regression experiment, with various missing rates γ and also with standard deviation from 5 different random seeds. In this study, we explore the effect of using self-attention layers before pooling operations and using PMA layer for pooling in MTNP. The variants of MTNP are as follows. (1) MTNP-A: MTNP without self-attention and using average pooling, (2) MTNP-P: MTNP without self-attention and using PMA, (3) MTNP-SA: MTNP with self-attention and using average pooling, (4) MTNP-SP: MTNP with self-attention and using PMA. The results are summarized below. Note that we consistently use MTNP-SP architecture for the experiments in Section 5. As shown in the result, MTNP with selfattention outperforms the one without self-attention by a large margin, implying that self-attention is critical in MTNP. To investigate whether the self-attention modules operate as desired in per-task attention stage, we visualize the attention weights assigned to each context point. For each task, we compute an averaged attention weight placed on a set of context points by averaging dimensions of the attention map of shape (nL, nH, nQ, nK) to (1, 1, 1, nK) where nL and nH denote the number of self-attention layers and heads in each layer, and nQ = nK refer to the number of query (Q) and key (K) vectors and Q = K. Note that this averaged attention weight can be interpreted as the averaged importance put on each of the context points by the self-attention module. Results are visualized in Figure 17 . As shown in the figure, points located at representative positions or at sparse regions are attended more than others. This shows that self-attention in per-task paths operates as desired. In this study, we explore the effect of parameter-sharing the per-task encoder networks and decoder networks in STNP and MTNP. The variants of STNP and MTNP are as follows. (1) STNP-S: STNP with shared encoder and decoder, (2) STNP-TS: STNP with task-specific encoders and decoders, (3) MTNP-S: MTNP with shared encoder and decoder in per-task branches, (4) MTNP-TS: MTNP with task-specific encoders and decoders in per-task branches. Note that we consistently use STNP-TS and MTNP-S architectures for the 1D experiments (synthetic and weather) and STNP-TS and MTNP-TS architectures for the 2D experiments (CelebA) in Section 5 As can be seen in the tables, in the weather dataset, we observe that the models with parameter sharing (STNP-S & MTNP-S) show comparable performances to their respective non-sharing baselines (STNP-TS & MTNP-TS). On the other hand, the parameter sharing technique slightly improves the performances of the models in the 1-D synthetic case. We conjecture that the utilization of the same architecture and its parameter for all tasks acts as a good inductive bias to the models considering that all tasks share the same global parameter a,b,c,w. Nonetheless, we still find that MTNPs consistently outperform STNPs regardless of whether the parameters are shared or not, validating the effectiveness of MTNP to capture and exploit functional correlation for multi-task learning problems. STNP-S 0.0049 ± 0.0003 -1.1381 ± 0.0223 0.0066 ± 0.0006 -1.0527 ± 0.0241 0.0621 ± 0.0069 0.0240 ± 0.1210 STNP-TS 0.0046 ± 0.0004 -1.1514 ± 0.0181 0.0069 ± 0.0004 -1.0390 ± 0.0106 0.0632 ± 0.0072 0.1273 ± 0.1898 MTNP-S 0.0037 ± 0.0001 -1.1832 ± 0.0165 0.0054 ± 0.0001 -1.1049 ± 0.0154 0.0546 ± 0.0021 -0.1006 ± 0.0696 MTNP-TS 0.0036 ± 0.0002 -1.1818 ± 0.0076 0.0053 ± 0.0003 -1.0885 ± 0.0246 0.0519 ± 0.0013 -0.0662 ± 0.0828 STNP-S 0.2675 ± 0.0104 0.9537 ± 0.1347 0.2629 ± 0.0043 0.7847 ± 0.0503 0.0084 ± 0.0007 -0.9877 ± 0.0312 STNP-TS 0.2607 ± 0.0082 1.1242 ± 0.2362 0.2631 ± 0.0044 0.8563 ± 0.0637 0.0086 ± 0.0008 -0.9815 ± 0.0283 MTNP-S 0.2276 ± 0.0028 0.6557 ± 0.0433 0.2215 ± 0.0043 0.6660 ± 0.0141 0.0073 ± 0.0003 -1.0331 ± 0.0147 MTNP-TS 0.2187 ± 0.0043 0.7213 ± 0.0953 0.2253 ± 0.0100 0.6663 ± 0.0283 0.0071 ± 0.0003 -1.0232 ± 0.0144 In this study, we compare STNP, JTNP, and MTNP without using deterministic encoder, each corresponds to STNP-L, JTNP-L, and MTNP-L in the table. Note that these variants correspond to the direct NP implementations of STNP/JTNP/MTNP rather than ANP. The results are provided in the tables below. The overall trends are the same as the models with deterministic encoder, which demonstrates that the effectiveness of MTNP does not depend on a specific choice of architecture (vanilla NP or ANP). STNP-L 0.0213 ± 0.0025 0.0086 ± 0.0008 0.0045 ± 0.0006 0.0809 ± 0.0101 0.0405 ± 0.0063 0.0234 ± 0.0026 S+JTNP-L 0.0201 ± 0.0030 0.0106 ± 0.0008 0.0061 ± 0.0003 0.0687 ± 0.0056 0.0376 ± 0.0014 0.0227 ± 0.0019 MTNP-L 0.0096 ± 0.0022 0.0027 ± 0.0005 0.0012 ± 0.0002 0.0417 ± 0.0037 0.0169 ± 0.0010 0.0091 ± 0.0007 Table 32 : Average MSE and NLL on weather tasks, with m = 10 and γ = 0.5. STNP-L 0.3381 ± 0.0026 0.8792 ± 0.0008 0.3411 ± 0.0049 0.8798 ± 0.0066 0.0106 ± 0.0005 -0.8804 ± 0.0118 S+JTNP-L 0.2832 ± 0.0095 0.7347 ± 0.0179 0.2851 ± 0.0123 0.7692 ± 0.0111 0.0127 ± 0.0007 -0.8827 ± 0.0220 MTNP-L 0.2432 ± 0.0046 0.6267 ± 0.0220 0.2372 ± 0.0056 0.6659 ± 0.0163 0.0085 ± 0.0003 -1.0006 ± 0.0065 We also compare MTNP with deterministic encoder only, which corresponds to MTNP-D in the table below. To emphasize the benefits of generative modeling of MTNPs, we include MTNP evaluated on the best sample among 25 predictive samples, which corresponds to MTNP-best in the table. We observe that MTNP and MTNP-D are comparable in synthetic and weather datasets, which seems reasonable as we designed the deterministic encoder to mimic the latent encoder of MTNP (e.g., they employ both per-task and across-task inferences). However, we can see that MTNP-best clearly outperforms MTNP-D, which implies that MTNP can generate more accurate samples while MTNP-D cannot. Table 33 : Average normalized MSE on synthetic tasks, with varying context size (m) and γ = 0.5. In this study, we compare to different types of task embeddings for MTNP. MTNP-Onehot uses one-hot encoded vector for task embedding e t while MTNP-learnable uses learnable vector for the task embedding. Note that we consistently use MTNP-learnable for the 1D experiments in Section 5 Table 35 : Average normalized MSE on synthetic tasks, with varying context size (m) and γ = 0.5. We observe that MTNP with one-hot embedding is comparable to MTNP with learnable embedding. To further investigate the effect of learnable embedding, we visualize the learned task embedding by MTNP using the t-SNE algorithm. We include the visualization results in Figure 18 and 19. As shown in the figure, we find that the learned task embeddings are well-separated from each other and uniformly distributed on the embedding space. From the observations, we conjecture that well-separated task embeddings are sufficient task information for MTNP. Figure 19 : t-SNE plot (with 2 components) of the learned task embeddings of MTNP in weather tasks. Kernels for vector-valued functions: A review Improved few-shot visual classification A computational approach to edge detection. PAMI, 1986. Rich Caruana. Multitask learning. Machine learning Encoderdecoder with atrous separable convolution for semantic image segmentation Model-agnostic meta-learning for fast adaptation of deep networks Meta-learning mean functions for gaussian processes Dropout as a bayesian approximation: Representing model uncertainty in deep learning Conditional neural processes. In ICML Neural processes. In ICML Workshop Convolutional conditional neural processes. In ICLR Towards fast, accurate and stable 3d dense face alignment Learning to branch for multi-task learning Multitask learning and benchmarking with clinical time series data. Scientific data Deep residual learning for image recognition Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework An Introduction to Probability Theory Mimic-iii, a freely accessible critical care database Design of an image edge detection filter using the sobel operator Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In ICLR Imagenet classification with deep convolutional neural networks