key: cord-0949404-7pmngy6d authors: Wang, Chao; Lu, Xuantao; Wang, Wei title: A theoretical analysis based on causal inference and single-instance learning date: 2022-02-28 journal: Appl Intell (Dordr) DOI: 10.1007/s10489-022-03193-0 sha: 84ab7b3073a69d0773b4b9f4902a28fc10683606 doc_id: 949404 cord_uid: 7pmngy6d Although using single-instance learning methods to solve multi-instance problems has achieved excellent performance in many tasks, the reasons for this success still lack a rigorous theoretical explanation. In particular, the potential relation between the number of causal factors (also called causal instances) in a bag and the model performance is not transparent. The goal of our study is to use the causal relationship between instances and bags to enhance the interpretability of multi-instance learning. First, we provide a lower bound on the number of instances required to determine causal factors in a real multi-instance learning task. Then, we provide a lower bound on the single-instance learning loss function when testing instances and training instances follow the same distribution and extend this conclusion to the situation where the distribution changes. Thus, theoretically, we demonstrate that the number of causal factors in the bag is an important parameter that affects the performance of the model when using single-instance learning methods to solve multi-instance learning problems. Finally, combining with a specific classification task, we experimentally validate our theoretical analysis. Multi-instance learning (MIL) was originally used for the field of hand-printed numerals identification [1] and drug activity prediction [2] . Instead of considering a series of individually labeled instances, MIL focuses on the labels of sets (or called bags) of instances and demonstrate strong capabilities in many areas [3] , e.g., speech localization [4] , entity classification [5] , protein structure determination [6] , biometric authentication system [7] [8] [9] [10] , human pose estimation [11] , medical image analysis [12] , understanding chest CT imaging of Wei [13] , and clinical outcome prediction of COVID-19 [14] . However, the theoretical research of MIL still seriously lags behind the actual application speed. In other words, we are not very clear about some potential relationships between parameters (e.g., the number of instances in the bag) and the performance of the model. For example, users always neglect the influence of the number of instances in the bag. This makes the parameter settings of some experiments depend on the subjective intuition of the experimenters rather than interpretable principles. In most MIL tasks, we often overlook the following two issues: (1) how does the number of positive instances in the bag affect the value of the loss function? Most of the MIL tasks are based on a constraint premise, that the label of a bag is negative if and only if the bag does not contain any positive instance. However, this assumption ignores that the influence of the number of positive instances in the bag on the performance of the model. Another issue is that (2) testing instances tend to be assumed to follow the same distribution as the training instances (for the brevity of description, we refer to this assumption as the TTD). However, the TTD assumption is often violated in many real tasks [15] , and whether the TTD assumption holds directly affects the performance of the model. For example, when some real task scenarios cannot support the TTD assumption, the performance of some models will degrade [16] . Therefore, how to effectively deal with a series of MIL problems caused by the above issues has become a new research hotspot. To this end, many researchers put forward some methods in recent research, such as the approach based on the covariate shift setting [17] , and test-distribution-based methods [18] . Unfortunately, the above methods rely on the prior distribution to improve the performance of the models and do not provide a rigorous theoretical explanation of the above two issues [19] . Besides, many researchers pay attention to the relation between MIL and causal inference. The advantage of using causal inference theory to study MIL is that causal inference can describe and explain the complex internal mechanism of the system by the causal relationship between data [20] . Moreover, causality helps the model make stable predictions in unknown environments [21] , such as rain causes slippery roads and excessive release of carbon dioxide is one of the causes of global warming. This can be viewed as a stable mapping from cause to effect, therefore, a causalitybased classifier is more stable than an association-based classifier [22] . Although causal inference theory opens up new opportunities for MIL problems, the mathematical principles behind some tasks have not been well clarified. Therefore, in a MIL task, we should not only focus on which instances cause changes in the bag labels (we denote the instances that affect the bag label change as causal factors), but also understand the possible connections between the causal factors and the performance of the model. Recently, Zhang et al. [23] propose a novel MIL framework towards robust classification in distributional biased data. However, they neglect important metrics such as theoretical guarantees for obtaining causal factors. Feng et al. [24] attempt MIL from similar and dissimilar bags and obtain a series of performance analyses on MI classifiers and SI classifiers. However, they do not explicitly answer how the number of instances in the bag affects the model performance during the SIL method to solve MIL problems. In this paper, we combine the single-instance learning (SIL) method with the potential outcome framework (POF) [25] in causal inference to solve the two issues. We obtain a series of rigorous theoretical analysis results and verified our conclusions in a specific experimental task. In brief, the contributions in this paper are summarized as follows: -In the MIL task, we provide the lower bound on the number of samples required to determine causal factors (also called causal instances in [23] ) within 95% confidence intervals (see Theorem 2) . -Based on the standard MI assumption [26] , we capture that causal factors have a direct effect on the loss function of the single-instance learning method. We provide a lower bound on the loss function of the SIL method when the TTD assumption holds (see Theorem 3). And we extend the conclusion of Theorem 3 to the situation where the distribution changes (see Theorem 4 ). -We provide a rigorous theoretical analysis of the effect of the parameters in the above theorem on model performance and validate our conclusions with an object classification experiment. Distribution change In general, distribution change refers to the fact that the training and testing instances do not follow the TTD assumption, the causes of which are diverse. Ignoring the difference between the training and testing samples will lead to a decrease in the predictive ability of the model built based on the standard supervised approach. Therefore, how to solve the problem of MIL distribution change is a research hotspot [16, 27] . The covariate shift setting is a typical representative, based on which Sugiyama et al. [17] propose an estimation approach. The advantage is that this method does not rely on density estimation. Park et al. [28] propose an algorithm for calibrating prediction based on the probability of the covariate shift. In addition, many researchers also try to resolve the problem of distribution change in MIL from different perspectives [15, 29] . For instance, Zhou et al. [30] present an effective way to analyze the case that the training instances are not independent identically distributed in MIL. Wang et al. [31] construct a new MIL-based neural network to improve its ability to diagnose diseases based on medical images in the presence of unbalanced data. Recently, it has become a popular trend to explore the intrinsic connection between causal inference theory [20] and MIL problems [18, 32, 33] . Kuang et al. [21] develop an algorithm, called DGBR, which uses causality to achieve stable predictions. Shen et al. [22] propose an algorithm, named CRLR, which effectively improves the learning ability of the model in the presence of agnostic selection bias. Zhang et al. [23] obtain causal instances by evaluating the causal effects of the labels of bags and instances. They define a specific form of causal instance and propose a novel MIL algorithm that improves the robustness of the classifier. The connection between SIL and MIL Approaches to solving MIL problems can be broadly classified into two categories: One class of approaches solves the MIL problem directly at the bag-level or instance-level. Another type of approach uses SIL methods to solve MIL problems (this paper focuses on this type of approach). We briefly summarize some recent related work in Table 1 that demonstrates the advantages and disadvantages of SIL and MIL for different tasks. Although a large number of prior studies demonstrate the performance of MIL and SIL separately in different tasks (e.g., Table 1 ), most of the conclusions are summarized by the results of specific experiments, while few studies critically analyze the rationale behind solving MIL problems using SIL methods at the theoretical level [16, 23, 24] . Inspired by the above work, in this paper, we first provide a lower bound on the number of samples required to identify causal factors using causal inference theory. Subsequently, we demonstrate the superiority and limitations of the SIL approaches to solving MIL problems by rigorous theoretical analysis. Finally, we extend our conclusions to the case of distribution change. Our conclusions reveal to a certain extent why SIL methods can effectively solve MIL problems. In this section, the key notations and some basic notions about multi-instance learning, causal factor are reviewed. Table 2 summarizes the key notations commonly used in this paper and their descriptions. Multi-instance Learning (MIL)) Let finite set X ins = {x} denote the space consisting of all instances x, and let N(X ) = {o(x)}, where o(x) be the function that can be described by o(x) : X ins → N. The goal of MIL can be formalized as learning the function map MI L : N(X ) → Y, where Y represents the label set. In particular, the binary MIL problem can be described as: In our work, we only consider the binary MIL problem, unless otherwise specified. To simplify the presentation, for any finite set, we use {·} n to denote the set containing n elements, i.e., {x 1 , x 2 , ..., x n } {x i } n , equivalently, |{x i } n | = n (where '| · |' denotes the cardinal number of the set). Let B = {B i } r be the set containing r pairs (B i , Y +/− ), where B i {x i } n represents that a bag contains n instances. Object class recognition and drug activity prediction [35] MILES transforms MIL into the SIL, which improves classification accuracy and robustness to labeling uncertainty without satisfying the standard assumptions. SIL Prediction on agnostic test data [21] The classifier constructed based on causality can achieve stable prediction for unknown test environments. SIL Classification tasks under positive instance sparsity [36, 37] SI classifiers are inferior to MI classifiers when the bag with positive labels contains fewer positive instances. SIL Large-scale MIL problems [38] As an extended version of [35] , [38] [39] Linear programming procedures can perform multiinstance classification tasks with adjusting instance contributions while ignoring standard assumptions. Learning from similar and dissimilar bags [24, 40] [ 24] learns from similar and dissimilar bags and obtains a series of performance analyses on MI classifiers and SI classifiers. In this paper, we assume that |B i | = |B j =i |, which is a reasonable extension. For example, for any B − i , the label of the bag has no relation with the number of instances inside the bag. We use B(+) = {B + i } k + and B(−) = {B − i } k − to represent that the set of all bags labeled Y + and Y − , respectively. Given a bag B i = {x i } n , and φ : X ins → Y i,i∈{+,−} , the MIL function BIN MI L ψ(·) can be equivalently described in the following form [26] : Where ψ is denoted as the labeling function of the bag and '(· ∨ ·)' is the Boolean OR function. Then the instance x is denoted as causal factorx. By Definition 2, instances in the bag can now be divided into two categories: One category is the set consisting of causal factors (denoted as {x}, we use causal factors instead of causal instances here to emphasize the importance of these instances in the MIL task becausex is the unique factor that causes the bag to be labeled Y + ) and the other category is the set consisting of non-causal instances (denoted as {x ncf }). Obviously, the instance x can be formalized as: The causal factors ensure that ψ(B) = Y + holds for any bag B that contains them, and the non-causal instances do not affect the label of the bag. For example, we imagine such an image, a cat playing on the lawn (i.e., Fig. 1 ). The classifier will label images based on the presence or absence of the cat in the image. If there is at least one cat in the image, the classifier outputs Y + , otherwise, the classifier outputs Y − . Apparently, the cat in the image is a causal factor, while the rest, such as lawn, flower, bamboo basket, etc., are non-causal instances, and these noncausal instances do not affect the output of an oracle classifier. Therefore, in this paper, we follow the stander multi-instance assumption, i.e., the bag is labeled Y − , if and only if the bag does not contain any causal factor. Fortunately, causal factors can be obtained by estimating the causal effect of an instance on the label of the bag. Specifically, for the label Y , the causal effect of an intervention (e.g., adding or removing a candidate instance x to or from a bag) is a comparison of the potential outcomes of the two label states under the POF [23] . Inspired by the application of causal inference to MIL tasks, in the next section, we aim to capture the connection between causal factors and MIL tasks. In this section, we focus on three main problems. Specifically, in Section 4.1, we obtain a minimum number of candidate instances required to determine the causal factor. In Section 4.2, we analyze the connection between causal factors and the SIL loss function. In Section 4.3, we analyze the effect of parameters on the decision threshold of the classifier. In this section, we determine whether an instance x is a causal factor by estimating the average causal effect (ACE) [41] (also called average treatment effect (ATE), denoted as (x)) on the bag label. Specifically, let Y T (x∈B) be the potential label of bag B if x ∈ B. T (x ∈ B) can be viewed as a treatment of B. (i.e., adding x to B). Similarly, let Y T (x / ∈B) be the potential label of bag B if the instance x is not present in B. Therefore, the (x) can be defined as: where E(·) represents the expected value of the bag label. Note that T (x ∈ B) can be obtained by adding the candidate instance to B (we use C + = {x|x ∈ B + } = {x i } q to denote the candidate set and thex to denote the causal factor in C + ). Similarly, T (x / ∈ B) can be obtained by removing the candidate instance from bag B. Therefore, in the MIL task, (4) can be equivalently described as: where Y be the new label of bag B after being treated. Note that if the performance of the classifier is good enough (e.g., a perfect classifier), (5) can assist the classifier to effectively distinguish whether an instance x is a causal factor or a non-causal instance. The specific conclusion is shown in Theorem 1. Causal effect of instance x on label Y +/− ) Given an instance x, the estimated value of (x) can be approximated as: This theorem is elaborated upon in [23] . Theorem 1 describes the causal relationship between the label of the bag and the instances it contains under an ideal experimental situation. The key is how to estimate the value of E[ Y |Y B = Y − , B x∈B ] in actual tasks. An intuitive idea is to traverse all the instances, and finally obtain the value of (x). However, even in a relatively small dataset, the cost of this calculation is very huge. Therefore, a certain number of samples are usually collected to estimate the value of (x). For example, we use the sample average to estimate the first term in Theorem 1 (i.e., E[ Y |Y B = Y − , B x∈B ]) as: where ξ(·) is a classifier with .., q}) as input and a label as output. The core idea of this approach is to determine whether x i is a causal factor by using the label of Because if x i is (not) a causal factor, then the ideal classifier will predict the label of B − i (x i ) as 1 (0). In the MIL task, we can determine which instances are causal factors by estimating (7) . For example, by setting the threshold t, and we select the higherscoring x as the causal factorx (this goal can be achieved by Algorithm 1 proposed in [23] ). However, [23] ignores a crucial constraint on the C + . In other words, to obtain a valuation of (7) within an appropriate confidence interval, we need to determine how many instances the candidate set C + contains at least. To achieve the determination of causal factors at the lowest possible cost. We prove a lower bound of the samples required to estimate (7), which is the minimum value of q = |C + | of the candidate set, by the following theorem. In an ideal MIL task, the sample lower bound for acquiring the 95%confidence interval at a sub-linear cost is q ≥ log 40 2 2 . Proof Given the instance Utilizing Hoeffding's inequality [42] , we have Furthermore, (8) can be extended to the form of the twosided variant as follows [43] : Let [δ c − , δ c + ] be the confidence interval and let α c be the level of significance for [δ c − , δ c + ] (i.e., the probability of making an error), then we have Solving the above inequality, we require at least q ≥ log( 2 α c )/2 2 samples to acquire (1 − α c ) confidence interval [δ c − , δ c + ]. In particular, let α c = 0.05, we have q ≥ log 40 2 2 . The above theorem provides the cost of acquiring the confidence interval. Therefore, we can use Theorem 2 to obtain a lower bound (i.e., log 40 2 2 ) on the number of samples (i.e., q) used to determine the causal factorx within 95% confidence interval. Using the SIL approach to solve MIL problems is a popular way [21] . Therefore, in this section, we follow the framework proposed by [23] for further theoretical analysis of MIL using the SIL approach. Specifically, we will explore the potential connection between the number of causal factors in the bag and the loss function of the SIL method. Please note that in our subsequent analysis, we do not care about the specific experimental details of the model in determining causal factors (this problem is elaborated upon in [23] ). Therefore, we assume that the model has determined the causal factorx, i.e., there exists an ideal classifier ξ(·) to determine causal factors. The core of our study is to describe the effect of the number of causal factors in the bag on the loss function in different situations where the TTD assumption holds or not. And further explains why using SIL methods to study MIL problems is an effective tool. Specifically, in Theorem 3, based on the TTD assumption, we use rigorous mathematical language to perfect the conclusion in [16] . Furthermore, we extend the conclusion of Theorem 3 to the case where the distribution changes (see Theorem 4). These conclusions help us to capture more thoroughly the connection between data distribution, causal factors, and loss function. where '+' indicates that the label of the bag is Y + , the subscript 'c' means that the causal factors. Similarity, let where '−' indicates that the label of the bag is Y − , the subscript 'n' represents the non-causal instances. For the sake of computational simplicity, we assume that all bags contain the same number of instances, i.e., |B i | = |B i =j |. Note that, according to the definition of causal factor, for any B + i , no matter whether there are one or more causal factors in B + i , the label of bag B + i is always Y + . Therefore, we set each bag B + i to have the same number of causal factors (This setting is just to simplify the theoretical calculations, and in real experimental scenarios in Section 5, we estimate the proportion of causal factors in each B + ). Formally, let P + i ⊆ + (P + j ⊆ + ) denote the subset of all causal factors in B + i (B + j ), we can set |P + i | = |P + j | < τ + c . Similarly, the number of remaining non-causal instances in Similarly, for B − i , no matter how many non-causal instances be the set of non-causal instances in the B + , and | + | = τ + n , where '+' means the label of the bag is Y + , the subscript 'n' means the non-causal instances. Intuitively, we have that Assuming that the instances in bag B + contain causal factors and non-causal instances, we have that for each B + i , To sum up, we have the following assumptions, i.e., Based on the above analysis, in the next, we consider a situation that the instances in + and − follow the TTD assumption. Based on the TTD assumption, we obtain the lower bound of the SIL loss function. Let ξ(·) denote the Heaviside step function such that ξ(x ∈ + ) = 1, and ξ(x / ∈ + ) = 0. If the instances in + and − follow TTD assumption, then the SIL loss function L(x) can be minimized by linear function ξ m of ξ such that where β = τ + n · (τ + n + τ − n ) −1 . Proof First, we introduce the Heaviside step function 1 ξ(·) and a linear function ξ m (·) of ξ(·). Since ξ(x) = 1 if and only if x ∈ + , i.e., ξ(x) = 1, ξ(·) can be denoted as 1 The Heaviside function may be defined as a piecewise function, which can describe the ideal binary classification. where α is the weight of the causal factors in the B + and β is the weight of non-causal instances. Intuitively, for α, since all causal factors are only in B + , α = 1 in (17) . For β, we have at least two alternative forms for the value of β, which are Next, we need to determine β (whether β = β 1 or β = β 2 ) by calculating the extreme value of loss functions L(x) [16] , We first determine the specific form of ξ(·). Assuming that ξ(·) be a composite function about ξ σ (·), where 0 < ξ σ (x) < 1 ensures that the loss function L(x) has extreme value in the interval (0, 1). According to the definition of ξ(·), (20) can be equivalently described as x=x log ξ σ (x) = 0. ξ σ (x) can be expressed as a scalar that follows the distribution Dis (denoted as ξ σ ∼ Dis). Therefore, L(x) can be represented as: Assuming that i log ξ σ (x i ) be a large sample data, thus we can use the expected value of data to replace the average of i log ξ σ (x i ) as: Then we have − L(x) = | + | · E ξ σ ∼Dis log ξ σ (x) To simplify the presentation, we let h(ξ σ (x)) = | + | · log ξ σ (x) + | − | · log(1 − ξ σ (x)). (24) Next, we obtain the extreme value h(ξ σ (x)) by solving the derivative h (ξ σ (x)). After a simple calculation, we obtain that Therefore, the coefficient β in ξ m (x) can be determined to be β 1 . Next, we calculate the minimum value of L(x). Specifically, Obviously, all parameters in E ξ ∼Dis [·] are constants, thus (26) is equivalent to which proves the theorem. In particular, we consider an extreme case where there is only one causal factor in the bag, i.e., τ + c = 1. According to our previous assumption, there is τ + n ≈ τ − n , Thus, we obtain a minimum value of L(x) is τ − n log 4. Based on the TTD assumption, Theorem 3 shows that the function ξ m (·) can optimize the loss objective function L(x) to a minimum value. However, the TTD assumption is often violated in most real-world applications. Therefore, we extend the conclusion of Theorem 3 to the situation where the distribution changes. The conclusion is shown in Theorem 4. Proof According to Theorem 3 and (22), we know that Since D + (x) = D − (x), we assume that the new distribution Dis is an affine combination of distribution D + (x) and D − (x), i.e., where β 1 and β 2 are the coefficients in (18) . Given a Heaviside step function ξ(·) H[ξ σ (·)], which implies that According to (17) in Theorem 3, we can determine that the function ξ m (·) has the following form, where α = 1(because all causal factors are only in B + ). Similar to the setting of parameters β 1 and β 2 in Theorem 3, we assume that According to (30) , it is not difficult to find that Note that the coefficient β is a coefficient function β β( Dis) about the distribution Dis. Therefore, ξ m (x) = 1 and ξ m (x / ∈ + ) = β 1 or β 2 . According to (22) and the Dis contains the distribution information of D + (x) and D − (x), we have We obtain the extreme value of the function h 1 (ξ σ (x)) by solving the derivative h 1 (ξ σ (x) ). For a determined instance x, Dis(x) and Dis(x) can be regarded as ratio constants, the extreme solution of h 1 (ξ σ (x)) is Therefore, the coefficient β in the function ξ m (x) can be determined to be β 1 . Furthermore, according to (30) , we have that To make the expression more concise, we set Thus h 1 (ξ σ (x)) max can be rewritten as: Dis(x) Therefore, we obtain that (39) which completes the proof. In this section, we discuss the effect of different parameters on the decision threshold (denoted as d t ) of the classifier when the TTD assumption holds and does not hold, respectively. (1) The effect of parameters τ − n , τ + n on function ξ m (x) when TTD assumption holds As Theorem 3 reveals, we find that τ − n and τ + n have a direct effect on the extremes of the loss function L(x), which can be minimized by ξ m (x). This implies that the performance of the classifier can be improved with the aid of ξ m (x). Therefore, we explore the essential connection between the parameters τ − n and τ + n and ξ m (x). Note that ξ m (x) is a new classifier developed from ξ(x), and ξ m (x) can be formalized as: where β 1 = | + | · (| + | + | − |) −1 = τ + n · (τ + n + τ − n ) −1 . Since τ + n and τ − n take different values for different bags, this directly affects the range of β 1 and the determination of the decision threshold d t for the classifier in real task. Therefore, our goal is to find a stable decision threshold d t such that max(β 1 ) < d t < 1. Specifically, if the instances in + and − follow TTD assumption, in terms of Theorem 3, we have For the sake of descriptive brevity, we set τ − n τ + n = k τ , then we obtain that As discussed previously, we set the number of instances in all bags to the same number, hence, according to (15) , k τ can be rewritten as: Note that in a real experiment, τ − n is a certain number (e.g., if the image is considered as a bag, the image can be partitioned into a certain number of patches), and the more causal factors in the bag (i.e., τ + c ), the closer to 0 β 1 is. Therefore, in this case, the decision threshold d t is empirically chosen as d t > 0.5. Besides, we consider another special case where the bags with positive labels contain fewer causal instances (i.e., sparse positive instances), and we introduce the witness rate (WR [44] ) to analyze the d t selection of the classifier in the case of sparse causal instances (i.e., low witness rate). Decision threshold selection at low witness rate. In MIL, the witness rate describes the ratio of the number of positive instances to the total number of instances. Therefore, in this paper, WR can be equivalently written as: When the WR is high, the classification task can be trained by a conventional supervised learning algorithm. However, the performance of the algorithm is affected at a low WR [45] . Obviously, in our task, when the total number of instances are determined, the value of WR depends only on | + |. According to (15) , we have that which is equivalent to Thus, if | + | is small, we have that τ − n ≈ τ + n , i.e., k τ ≈ 1. This means that even at a low WR, β 1 can still reach 0.5. In summary, the decision threshold d t > 0.5 can stably satisfy the classifier in the experiment. In other words, ξ m (x) shows a more stable classification performance at a low witness rate. (2) The effect of parameters τ − n , τ + n , D − (x) and D + (x) on function ξ m (x) when TTD does not hold. Similar to the analysis of (40) in Theorem 3, we analyze the effect of + and − on ξ m (x) if the instances in + and − are drawn from the different distributions D + (x) and D − (x), respectively. In this scenario, ξ m (x) is a new classifier constructed by ξ(x), and ξ m (x) can be formalized in the form of where . Specifically, if the instances in + and − do not follow the TTD assumption, then we have In contrast to (41) N 1 (μ 1 , σ 1 ) and D + (x) ∼ N 2 (μ 2 , σ 2 ). For the sake of computational simplicity, we assume that distributions N 1 and N 2 share an identical variance, i.e., σ 1 = σ 2 = σ c . In this scenario, we have Let μ 2 μ 1 = k μ , then (49) can be rewritten as: Therefore, for fixed instance x, mean μ 1 and standard deviation σ c , −xμ 1 σ 2 c and −μ 2 1 +2xμ 1 2σ c in (50) are considered constants. f (k μ | μ 1 , σ c , x) is a quadratic function with k μ as the independent variable, and since μ 1 2σ c > 0, there exists a minimal value of the function f (k μ | μ 1 , σ c , x). After a simple calculation, we can obtain that the minimum value of the function f (k μ | μ 1 , σ c , x), i.e., when k μ = x μ 1 (i.e., x = μ 2 ). Through the previous analysis, we confirm that both tends to 1 (i.e., low witness rate). In this case, the value of β 1 will be close to 1, and the d t should be relatively large, e.g., d t > 1 1+min(k ) . In other words, the decision threshold of the classifier should be set as large as possible when the TTD assumption is violated. In this section, we perform experiments on the object classification task to validate our theoretical conclusion. Without exact coordinates for each object in the image, the object classification task can be seen as a classic standard MIL problem, the overview of the classification framework is shown in Fig. 2 . The following experiments will focus on verifying the conclusion of Theorem 3 in a real task scenario. Since Theorem 4 is a reasonable extension on Theorem 3 and the only difference is that the training data and the testing data are from different distributions. This means that experimentally verifying Theorem 4 is only a minor difference from verifying Theorem 3 at the level of experiment setup and model training, and is easy to implement. Therefore, in the rest of this paper, we do COCO [46] . The experiments to verify our conclusions are conducted on dataset COCO, 2 which is a large-scale dataset for object detection, segmentation, and captioning tasks published by Microsoft. COCO contains 1.5 million object instances. In addition, this dataset provides 5 captions per image. We use all nouns in the captions as labels of the object classification task. -Image Processing: we consider each image I (size of 576 × 576) as a bag containing a certain number of objects (12 × 12 patches of size 224 × 224 with stride 32, i.e., each image contains 144 patches). -Model Setup: we adopt the VGG16 network [47] (the parameters are shown in Fig. 2 ) to obtain the feature maps (size of 12 × 12), and we use a one-dimension convolution followed by a sigmoid to generate the probability of 80 words for every image patch. Let {y i } l be the label set, according to the experimental setup, we have l = 80. For a fixed label y j , the MIL objective can be obtained by where ξ y j (x i ) is the output of the classifier ξ(·) (in Theorem 3) with x i as input, under the label y j , x i is a patch (144 in total) of an image I . Hence, we can obtain the total cost by summing all O y j (I ), i.e., where G y j (I ) is ground truth corresponding to label y j , CE stands for cross-entropy. Similarly, for a specific label y j , the SIL objective can be obtained by where ξ y j m (x i ) is the output of classifier ξ m (·) (i.e, (17) in Theorem 3) corresponding to the label y j . Then, we compare the performance of the model at the bag level. Specifically, for an image I and a specific label y j , we use as the bag level score. Finally, we compare S y j (I ) with G y j (I ) using four popular metrics, which will be given in the next section. We compare the performance of the model with ξ(·) and the optimized model with ξ m (·) by four metrics [48] , mean average precision (mAP), hamming loss (HL), coverage (COV), and ranking loss (RL). -Mean average precision evaluates the mean of average precision for all classes, where average precision (AP) can be formalized as: whereŷ is the ground truth and rank (x i , y) refers to the ranking of the object x i predicted to be y. n is the total number of samples in the testing data. The value of mAP ranges from 0 to 1. The closer to 1 the value is, the better the performance of the model is. -Hamming loss measures the number of times a label is misclassified, e.g., a label belonging to a sample is not correctly predicted, or a label that does not belong to a sample is incorrectly predicted as belonging to that sample. It can be formalized as: where f (·) is a multi-label classifier and SD(·) evaluates the symmetric difference between two sets. T is the total number of labels. The value of HL ranges from 0 to 1. The closer to 0 the value is, the better the performance of the model is. -Coverage denotes the mean of the least ranked true label among all samples. It can be formalized as: The value of CE ranges from 0 to 1. The closer to 0 the value is, the better the performance of the model is. -Ranking loss measures the number of times that the predicted probability value of a relevant label is smaller than the predicted probability value of an irrelevant label. Ranking loss (RL) can be formalized as: Where Y c i is the complementary set of Y i . The value of RL ranges from 0 to 1. The closer to 0 the value is, the better the performance of the model is. The core of the experiment is to compare ξ(·) in the MIL approach (i.e., Fig. 2 ) with ξ m (·) in the SIL approach (i.e., Theorem 3). The results are given in Table 3 . As shown in Table 3 , we find that the model ξ m (·) outperforms the original multi-instance classification model in all four metrics. Overall, the results in Table 3 indicate that: -The model with ξ m (·) improves mAP by 2.92%. This is a good indication of the validity of the model with ξ m (·). -Model with ξ m (·) also achieves 0.035, 0.353, and 0.003 improvements in metrics HL, COV, and RL. The results show that the improvement of mAP does not come at the expense of other metrics, which sufficiently proves the conclusion of our theoretical analysis. -Causal factors have two effects on multi-instance learning, one is that the causal factors in a bag determine the bag label, and the other is that their number affects the performance of the model. Specifically, we know that the main difference between the model with ξ(·) and the model with ξ m (·) is the coefficient β = τ + n · (τ + n + τ − n ) −1 . When the number of instances in the bag is determined (e.g., each image is divided into 144 patches in our experiments), then the coefficient β will strictly depend on the number of causal factors in the bag (because τ + n = τ − n − τ + c ). To our surprise, the coefficient τ + n plays a positive role in all four of these main evaluation metrics. -Together, these results provide important insights into capturing the inner relationship between multi-instance learning and causal factors. In this paper, we conduct a theoretical analysis of MIL problems by using causal inference theory and the SIL method. We first analyze the role of positive factors in the bag using causal inference theory. In Theorem 2, we prove a lower bounder of the number of sample instances needed to effectively determine the causal factors within a certain confidence interval. We then analyze the impact of data distribution on the multi-instance learning problem. Most of the previous MIL tasks are based on a strong constraint (i.e., the TTD assumption) on data distribution. However, in many real applications, the TTD assumption is not followed. We capture the relationship between the number of instances in the bag and the loss function when the TTD assumption holds and does not hold, respectively i.e., Theorem 3 and Theorem 4. Specifically, we exhaustively analyze the number of instances in the bag and the advantages of using a singleinstance approach to solve multi-instance tasks. Although some previous studies have illustrated the drawbacks of the single-instance approach in multi-instance learning tasks, our conclusions show that the single-instance approach demonstrates good robustness in solving multi-instance learning problems with acceptable time costs. In addition, an important detail is that we tend to ignore the effect of the number of positive and negative instances in the bag on the model when constructing the bag. Our conclusions show that although the number of positive instances in the bag does not affect the bag label (based on the definition of bag label in multi-instance learning), it has a direct impact on the performance of the model. In addition, we analyze the effect of some parameters on the decision threshold of the classifier. The theoretical analysis shows that when the training and testing samples follow independent identical distributions, the decision threshold can be empirically chosen to be 0.5, while when the training and testing samples do not follow independent identical distributions, the decision threshold should be appropriately increased to ensure the performance and effectiveness of the classifier. Finally, we verify our above theoretical analysis by a classical MIL task, the experimental results and the theoretical analysis provide important insights for researchers using causal inference theory and the SIL method to study multi-instance learning tasks. Integrated segmentation and recognition of hand-printed numerals Solving the multiple instance problem with axis-parallel rectangles Multi-instance clustering with applications to multi-instance prediction Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices Corpus-level finegrained entity typing Unsupervised multi-instance learning for protein structure determination Multi-instance iris remote authentication using private multi-class perceptron on malicious cloud server Privacypreserving and verifiable multi-instance iris remote authentication using public auditor Jointly learning multi-instance hand-based biometric descriptor Multi-instance cancellable biometrics schemes based on generative adversarial network Amil: Adversarial multi-instance learning for human pose estimation Localization of critical findings in chest x-ray without local annotations using multi-instance learning Synergistic learning of lung lobe segmentation and hierarchical multi-instance classification for automated severity assessment of covid-19 in ct images A multi-instance support vector machine with incomplete data for clinical outcome prediction of covid-19 Multi-instance learning with key instance shift Multi instance learning for unbalanced data Density ratio estimation in machine learning Robust classification under sample selection bias Multi-instance learning with distribution change The book of why: the new science of cause and effect Stable prediction across unknown environments Causally regularized learning with agnostic data selection bias Robust multi-instance learning with stable instances Multiple-instance learning from similar and dissimilar bags Causal inference for statistics, social, and biomedical sciences: An introduction A review of multi-instance learning assumptions Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary Calibrated prediction with covariate shift via unsupervised domain adaptation Multiple instance learning: A survey of problem characteristics and applications Multi-instance learning by treating instances as non-iid samples Ami-net+: A novel multiinstance neural network for medical diagnosis from incomplete and imbalanced data Does distributionally robust supervised learning give robust classifiers Bi-directional mapping for multi-label learning of label-specific features Single-vs. multiple-instance classification Miles: Multiple-instance learning via embedded instance selection Multiple instance learning for sparse positive bags Instance elimination strategy for non-convex multiple-instance learning using sparse positive bags Scalable algorithms for multiinstance learning A linear programming approach to multiple instance learning Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization Statistics and causal inference Probability inequalities for sums of bounded random variables Probability inequalities for the sum in sampling without replacement Convex multiple-instance learning by estimating likelihood ratio Multiple-instance learning as a classifier combining problem Microsoft coco: Common objects in context Very deep convolutional networks for large-scale image recognition Ml-knn: A lazy learning approach to multi-label learning