key: cord-0756417-kaxcxynk authors: Hoffmann, Geoffrey W. title: Proteomic analyser with applications to diagnostics and vaccines date: 2004-06-21 journal: J Theor Biol DOI: 10.1016/j.jtbi.2004.02.011 sha: a7230f520d66b7ac49e9abe75c4cb1c1a678b388 doc_id: 756417 cord_uid: kaxcxynk This paper describes a method for proteomic analysis with applications to diagnostics and vaccines. A panel of N (⪢1) reagents called X(j), with j=1 to N, is used. The binding strength of each of the X(j) reagents to each other is measured, for example by an ELISA assay, giving an N×N matrix K. The matrix K is used to define another set of N reagents called Y(j), with j=1 to N, each of which is a linear combination of the X(j) reagents and each of which is tailored to be complementary to one of the X(j) reagents. Each of the N pairs of reagents X(j) and Y(j) defines an axis in an N-dimensional shape space. The definition of these axes facilitates proteomic analysis of diverse biological samples, for example, mixtures of proteins such as serum samples or T cell extracts. A method for defining and measuring similarity between pairs of biological samples and between sets of biological samples in the context of the set of N reagent pairs is described. This leads to methods for using the N reagent pairs in the diagnosis of diseases and in the formulation of preventive and therapeutic vaccines. The relationship of this work to previous research on shape space is discussed. Immune system V region proteomics is important because the immune system V region repertoire is changed or ''skewed'' in many diseases, including cancer, autoimmune diseases and graft versus host disease (Pilch et al., 2002; Wucherpfennig et al., 1992; Imberti et al., 1991; Smith et al., 1995; Rebai et al., 1994; Ebling et al., 1988) . This skewing opens possibilities for innovations in diagnostic testing. It is also possible that some diseases can be prevented and/or treated if the skewing is sufficiently characterized and counteracted by a suitable perturbation, namely an immunization precisely tailored to reverse the particular skewing. A full proteomic description of the specific (V region) components of a particular immune system would constitute a list of the concentrations of each of millions of lymphocytes, antibodies and specific T cell factors, together with the isotypes, amino acid sequences and three-dimensional structures of the corresponding V regions. Even with the spectacular advances that are currently being made in proteomics, such a description is not a realistic goal, and even if it were, achieving it may not be particularly useful. Each individual has his or her own set of V regions, due to different V region genes, different MHC (major histocompatability complex) genes that affect the expressed repertoire of T cells, and different histories of exposure to a wide range of antigens. Furthermore, different somatic mutations in each individual contribute significantly to the generation of the V region repertoire. One recent approach to diagnostic proteomics is the SELDI-MS technology (Surface-Enhanced Laser Desorption/Ionization-Mass Spectrometry) coupled to pattern recognition software. This is not suited for V region proteomics because it is based on mass differences between proteins, and while (for example) IgG antibodies with different V regions can have slightly different masses, each person has a unique spectrum of antibodies. On the other hand, ELISA-based protein array technologies are becoming available that are suitable for V region proteomics as described in this paper. I here describe a method for proteomic analysis that builds on our previously defined concept of serological distance coefficients (Hoffmann and Tufaro, 1989) . In the earlier work, experimentally measurable similarity coefficients S[A,B|C] specify the extent to which a pair of substances, A and B, are similar in the context of a diverse reagent, C. The definition of S[A,B|C] is the fraction of C that binds both A and B divided by the sum of (i) the fraction that bind A but not B, (ii) the fraction that binds B but not A and (iii) the fraction that binds both A and B. The value of S[A,B|C] is then necessarily a number between zero and one. This definition was applied (conceptually) also to similarities between complex mixtures of substances, such as the antibodies of two serum samples, A and B. A ''distance coefficient '' D[A,B|C] between two sera, A and B, in the context of C, was defined as one minus the similarity coefficient in the same context. Methods for the experimental measurement of these coefficients and their possible use in the diagnosis and prognosis of disease conditions were described. The improved method utilizes a number N (>>1) of reagents, rather than a single diverse reagent. Each reagent can be an individual substance, for example a protein, possibly an antibody, or a mixture of substances. This produces a much larger data set than using a single diverse reagent, but it is still a very small set compared with the complete listing of V regions and their concentrations mentioned in the second paragraph above. The result is a measure of similarity based on an N-dimensional shape space, that is a more powerful tool for applications to diagnostics and vaccines. The concept of an N-dimensional shape space has been discussed by Perelson and Oster (1979) , and a formulation that permits an experimental determination of the dimensionality of a shape space has been described by Lapedes and Farber (2001) . It will become clear that the N-dimensional shape space of this paper is different from both of these; I compare the different approaches near the end of the paper. We denote the N reagents by X(j) (with j ¼ 1 to N), and use them most simply at a uniform concentration C 0 . We measure the binding (relative affinity) of each of these reagents to each other using, for example, an ELISA assay. This produces a matrix K with elements K jk (j ¼ 1 to N, k ¼ 1 to N). Such K matrices for IgM antibodies have been described by Holmberg et al. (1989) and Kearney et al. (1987) . We next define N additional reagents, that we denote as Y(j), (j ¼ 1, N). Each of the Y(j) reagents is made up of a linear combination of the X(j) reagents, with the amount of the kth component being proportional to K jk . Those components that have strong binding to X(j) are present in Y(j) at a high concentration, while those with little or no binding are included at a low or zero concentration. For a given value of j, X(j) and Y(j) are complementary to each other, and together the pair defines an axis in the N-dimensional shape space. There are N such pairs, that together define the N axes of an Ndimensional shape space. There are two possible ways of normalizing the concentrations of the Y(j) reagents to establish a symmetry between the X(j) reagents and the Y(j) reagents. One is to make the total concentration of the components of Y(j) such that the binding signal obtained for Y(j) binding to X(j) (in the case of an ELISA assay, with Y(j) binding to X(j) on the plate), in the linear range of the assay, is equal to the converse binding signal (binding of X(j) to Y(j), also in the linear range of the assay). The other method is to simply set the total concentration of each Y(j) equal to C 0 . The former method leads to the definition of a convenient virtual N-dimensional origin for the shape space, namely a hypothetical sample to which X(j) and Y(j) bind equally in the assay, for all values of j. We measure the binding of each X(j) reagent (j ¼ 1, N) to each Y(k) (k ¼ 1, N) reagent. This produces the N Â N matrix J with elements J jk . On the basis of massaction, and subject to linearity of the assay, the expected relative values of the elements of J are The diagonal elements of this matrix (j ¼ k) specify the level of binding between the reagents X(j) and Y(j), that have been specifically tailored to be complementary to each other. Hence their mutual binding will produce a strong signal, while there will be relatively weak signals for off-diagonal terms. Thus J is an approximately diagonal matrix. We now consider a set of biological samples obtained from M individuals. These samples may be, for example but not exclusively, serum, T-lymphocyte extracts, saliva or urine. We use the index i for the samples, so i ¼ 1 to M. We measure the binding of each of the reagents X(j) (j ¼ 1 to N) to each of the samples, again using for example an ELISA assay. For each sample i we thus obtain N absorbance values A iX ðjÞ : Together all the elements A iX ðjÞ constitute an M Â N matrix that we call A X . We repeat this process using the set of N complementary reagents, Y(j). We measure the binding of each Y(j) reagent to each sample i, to obtain the matrix A Y consisting of the elements A iY ðjÞ : Subject to the assay being linear, we can however also compute expected relative values of A iY ðjÞ using the product of the matrix A X and the matrix K: The results of these summations are then normalized such that the average of the computed A iY ðjÞ matrix elements is the same as the average of the A iX ðjÞ matrix ARTICLE IN PRESS elements. Hence, remarkably, we can have the benefit of an analysis in terms of the N X(j)/Y(j) axes in shape space without needing to prepare the Y(j) reagents, and without making measurements on all our samples using them! This is because the A X and K matrices already contain all the physical information. On the other hand, by including the actual measurement of A iY ðjÞ using the Y(j) reagents we have a technology that is more robust, because the individual measurements are then automatically screened for self-consistency. This is analogous to sequencing both strands of DNA, in which case any sequencing errors are immediately revealed, since one sequence predicts the other. The difference A iX ðjÞ À A iY ðjÞ is a coordinate for the sample i on the X(j)ÀY(j) axis, that can be either positive or negative. It specifies whether the sample There are N such coordinates for each sample. Fig. 1 illustrates this for just two of the N coordinates. It is expected that the N-dimensional coordinates for young, healthy individuals form one cluster (Hoffmann submitted) while the points for individuals with various diseases cluster around other, disease-specific points. Let a subset of the M samples be derived, for example, from people who have been classified to have a given disease (the ''D set'', consisting of, say, M D samples) and let another subset be from healthy individuals (the ''H set'', consisting of M H samples). We obtain M H N ELISA absorbance results A HðiÞX ðjÞ for the healthy group, where i is an index for the sample that goes from 1 to M H , and j is the index for the reagents X(j) that goes from 1 to N. We likewise obtain M D N results A DðiÞX ðjÞ from the disease group, where i goes from 1 to M D . For each value of j we average the values of A HðiÞX ðjÞ for i ¼ 1 to M H : We likewise average the values of A DðiÞX ðjÞ : Similarly, using the Y(j) reagents we obtain the average values and Now we consider a set of M U samples that are unknown in that they are from individuals that may or may not have the disease. We measure the binding of each of the N reagents X(j) to each of the U(i) The corresponding similarity of U(i) to H av in the context of the complete set of the N reagent pairs X(j)/ Y(j) is obtained by summing over j: The similarity of sample U(i) to the average of the disease set of samples (''D av '') would then be likewise These measures of similarity or other measures of clustering in the N-dimensional space can then be used as the basis for a diagnosis. The same set of reagents X(j) and Y(j), j ¼ 1 to N; can be used for diagnosis of multiple diseases. All that is additionally needed is a set of samples for each disease, from which the values of A D av X ðjÞ and A D av Y ðjÞ (j ¼ 1 to N) for each disease are determined. Fig. 1 . The reagents X(1) and Y(1) are complementary to each other and define an axis in shape space, and the reagents X(2) and Y(2) define a second axis. The coordinates of sample i are determined by measuring the amount of binding of the reagents X(1), Y(1), X(2) and Y(2) to the sample. Here sample i binds more to X(1) than Y(1) and more to X(2) than Y(2). Hence it is more similar to Y(1) than to X(1) and more similar to Y(2) than to X(2). So far we have included all of the N reagents in the analysis. We do not need to do this. For the diagnosis of a particular disease or condition we can instead include only those reagents that optimize specificity, sensitivity and simplicity, either individually or jointly. An advantage of this diagnostic method is that it is based on N-dimensional shape space, with N>>1, in contrast to the two-dimensional map of the previously published serological distance coefficient diagnostic method (Hoffmann and Tufuro, 1989) . Ndimensional vectors with N>>1 contain much more precise information than two-dimensional vectors. The method consequently is expected to provide more specific diagnoses. Another advantage of this method over the precursor method (Hoffmann and Tufaro, 1989) is that it eliminates the need to do absorptions, which is the most labour-intensive part of that earlier method. We are currently faced with an important new disease, namely SARS. A corona virus has been identified as the culprit, 1 but in Canada only about 50% of confirmed SARS patients were found to be positive for direct detection of the virus, namely polymerase chain reaction or virus culture (Frank Plummer, personal communication). Ultimately, about 95% of confirmed cases developed antibody to SARS coronavirus at 4 weeks. This raises the question of whether SARS can be caused by a proteomic stimulus similar to that caused by the virus. Several years ago there was a similar situation with AIDS and HIV, but then cases of the syndrome that were negative for HIV were defined as ''idiopathic CD4+ T-lymphocytopenia'', rather than AIDS (Smith et al., 1993; Ho et al., 1993; Spirat et al., 1993; Duncan et al., 1993) . The definition of AIDS was narrowed to include only those people who are positive for HIV (Morbidity and Mortality Weekly Report, 1999). The method described here may be useful for identifying any additional causes of SARS. The SARS corona virus may produce one form of repertoire skewing, while other agents may induce a similar but distinct skewing. The method described may thus enable a diagnosis for SARS that is independent of the detection of a corona virus or any other virus. In addition to its diagnostic role, the formalism and method developed here is useful for designing and evaluating highly specific multi-component proteomic perturbations to the immune system, that function as preventive and/or therapeutic vaccines. For a single pair of reagents X(j) and Y(j) and a given disease D, we can plot the values A D av X ðjÞ ; A H av X ðjÞ ; A D av Y ðjÞ and A H av Y ðjÞ on the axes A X ðjÞ and A Y ðjÞ as shown in Fig. 2 . Hence the points labelled A D av X ðjÞY ðjÞ and A H av X ðjÞY ðjÞ are defined for the average disease and average healthy states, respectively. We need a stimulus that (firstly for this pair of reagents), moves the system from A D av X ðjÞY ðjÞ towards A H av X ðjÞY ðjÞ : An appropriate stimulus consists of two components, one for motion from right to left (for example, Fig. 2 ) and one for motion in the vertical direction. The reagent Y(j) stimulates the complementary X(j) cells, and hence moves the system along the X(j) axis (the horizontal axis). The reagent X(j) stimulates Y(j) cells, and moves the system in the vertical direction. We next need to determine the appropriate concentrations of the reagents. At first sight, we might choose a concentration of Y(j) proportional to A H av X ðjÞ À A D av X ðjÞ and a concentration of X(j) proportional to A H av Y ðjÞ À A D av Y ðjÞ : A problem with this is however that some such tentative relative concentrations are negative, and we cannot include a negative amount of a reagent in the formulation of a vaccine. This problem can be resolved by substituting a positive amount of the reagent X(j) for any computed negative amount of reagent Y(j) [since X(j) is complementary to Y(j)], and likewise a positive amount of Y(j) for any negative amount of X(j). The relative amount of X(j) needed in the vaccine, from the perspective of the X(j)/Y(j) pair of reagents, will be denoted by R[X(j)] and where sign x ¼ 1 for x > 0; and sign x ¼ À1 for xo0: Similarly, the relative amount of Y(j) in the vaccine, denoted by R[Y(j)], is given by In the example of Fig. 2 , both components in the expression for R[X(j)] are positive, and both components in the expression for R[Y(j)] are zero. The total specific component of the vaccine is then obtained by summing over j. This is thus a method for formulating an immunogenic (vaccine) stimulus using the base set of N reagents. We then still have a single undetermined parameter, namely the ratio of the actual total concentration needed in the vaccine to the numerical values as computed. This parameter can be determined empirically by titration. The preceding description is in terms of vaccines suitable for a particular disease and for many people. Such vaccines are applicable especially as preventive immunisations. An individual patient may however have skewing that is unique to that patient. In such cases a personally tailored approach may be beneficial. One method is to replace the average absorbance values A D av X ðjÞ and A D av Y ðjÞ with the patient's absorbance values A DðiÞX ðjÞ and A DðiÞY ðjÞ ; respectively, in Eq. (12) and (13). Another step in the direction of personally tailored vaccines is to replace A H av X ðjÞ with A HðiÞX ðjÞ and A H av Y ðjÞ with A HðiÞY ðjÞ ; in Eq. (12) and (13), where A HðiÞX ðjÞ and A HðiÞY ðjÞ are obtained using historical samples from when the individual i was healthy. Hence N-dimensional perturbations can be tailored to inhibit and/or reverse pathological skewing of V region repertoires at the levels of both populations and individuals. While the concept of using X(j)/Y(j) axis coordinates emerged in the context of the V region network of interactions of the immune system, this method can also be used more generally to characterise and monitor broader proteomic changes in an individual. Similarity coefficients as defined here can be expected to be a powerful tool for gaining an improved understanding of the idiotypic network. The idiotypic network is the network of V regions that recognise each other (in addition to foreign substances) and is believed to play a central role in the regulation of the immune system (Hoffmann et al., 1988) . The N reagents X(j) need to be substances with reproducible, stable, diverse three-dimensional shapes. They may include for example monoclonal antibodies and/or other proteins from one or more species. One possibility is that all of the X(j) reagents are monoclonal antibodies, for example all of the IgG class. This would create a symmetry in the system that allows for essentially unlimited diversity in shapes, while ensuring that all the reagents have a similar intrinsic ability to cross-link complementary receptors. (IgG antibodies have two V regions, and thus a single IgG molecule is able to cross-link complementary receptors.) This is relevant for applications to vaccine formulation, since cross-linking of receptors is believed to be the mechanism for the specific stimulation of lymphocytes. This would be preferable to using proteins with varying degrees of polymerization, some of which would be much stronger immunogenic stimuli than others. Traditionally immunologists have focussed on high affinity interactions, such that an antibody is ''specific for'' (has a high affinity for) only a very small number of substances. If we include low affinity interactions, each antibody interacts with a much larger fraction of substances, including other antibodies. ELISA technology provides the option of measuring relatively lowaffinity interactions, and in order to define directions in shape space precisely, we would prefer that the matrices K and A be not too sparse. This can be achieved by adjusting the conditions of the ELISA such that lowaffinity interactions fall within the dynamic range of the assay. Another possibility for the choice of the X(j) reagents is to use exclusively soluble proteins of a size comparable to each other and without any repeating determinants, again ensuring that they are of similar immunogenicity. The focus of the method is on threedimensional shapes, rather than on sequences (as in RNA or DNA nucleotide sequences). The method does not require any of the X(j) reagents to be proteins, but proteins do constitute a convenient library of diverse shapes. We would again be interested in including low affinity interactions. The specificity of the method depends on the value of N and the accuracy of the assay method. If the values of A iX ðjÞ À A iY ðjÞ are obtained simply as Boolean numbers, when N ¼ 20 the shape space would have 2 20 distinguishable points. With an ELISA assay the results are however analogue rather than Boolean, and each coordinate might have 10 distinguishable values. Then already with N ¼ 5 the shape space would have 10 5 distinguishable points, and with N ¼ 20 there would be 10 20 distinguishable points. This theoretical remarkable resolution is expected to be important for applications to diagnostics and vaccines. It can be tested in experiments in which known mixtures of the X(j) reagents themselves are analysed using the method, and the experimentally determined coordinates are compared with the theoretical predictions. In their work on shape space Perelson and Oster estimated limits on the size of the repertoire that is needed to reliably respond to antigen, and were also concerned with the necessity not to make antibodies to self. The focus of the theory is the relationship between the volume of shape space covered by the reactivity of a single antibody and the total volume of shape space, and hence the number of different antibodies needed to reliably cover shape space. The main parameters in the theory are the dimension of their shape space N, the size of the repertoire N Ab , and the distance in shape space within which an antibody can bind all antigens, e: These parameters are interdependent, and the theory did not include a method for measuring N or e: On the basis of literature values of the frequencies of antigen specific cells, they estimated that N could not be more than 5 or 10. Lapedes and Farber described a shape space for which a dimensionality can be determined using experimental data. They used MN experimental data points, namely the binding of M antigens to N antisera, to map the shapes of each of the antigens and sera to points in a D-dimensional shape space (Lapedes and Farber, 2001) . The method involves minimizing a function of the experimental data points and the space shape coordinates. The relationship of this shape space to that of Perelson and Oster is not clear to me, since it does not have e as a parameter. They found D to have a value of 4 to 5. The earlier papers are based on the premise that there is an intrinsic dimensionality for shape space relevant to immunological recognition. This premise plays no role in our theory, which is a distinct formalism. Our theory is an extension of and improvement on our earlier paper on serological distance coefficients, in which similarity was defined in the context of a single diverse reagent (Hoffmann and Tufaro, 1989) . Here we define similarity in the context of an approximately orthogonal set of N axes in shape space. In immunology context is of over-riding importance, since antibodies are made in the context of a set of self antigens, T cells and other antibodies. The dimension N of the space is something we are free to choose, and the choice determines the level of specificity. The larger the value of N, the higher the specificity of the method. The theory leads to new methods for diagnostics and vaccines. Idiopathic CD4+ T-lymphocytopeniafour patients with opportunistic infections and no evidence of HIV infection Idiotypic spreading promotes the production of pathogenic autoantibodies Is the immune system a self-symmetrizing system? Serological distance coefficients Establishment and functional implications of B-cell connectivity Idiopathic CD4+ T-lymphocytopenia-immunodeficiency without evidence of HIV infection The N-Dimensional Network Selective depletion in HIV infection of T cells that bear specific T cell receptor Vb sequences Non-random VH gene expression and idiotype-antiidiotype expression on early B cells The geometry of shape space: application to influenza Theoretical studies of clonal selection: minimal antibody repertoire size and reliability of self-nonself discrimination Antigen-driven T-cell selection in Patients with cervical cancer as evidenced by T-cell receptor analysis and recognition of autologous tumor Analysis of the T-cell receptor b-chain variableregion (Vb) repertoire in monozygotic twins discordant for human immunodeficiency virus: evidence for perturbations of specific Vb segments in CD4+ T cells of the virus-positive twins T cell receptor repertoire of CD4+ and CD8+ T cell subsets in the allogeneic bone marrow transplant recipient Unexplained opportunistic infections and CD4+ T-lymphocytopenia without HIV infection. An investigation of cases in the United States Idiopathic CD4+ T-lymphocytopenia-an analysis of five patients with unexplained opportunistic infections T cell receptor Va-Vb repertoire and cytokine gene expression in active multiple sclerosis lesions I thank Robert Forsyth and Graham Clowes for helpful comments on the manuscript.