key: cord-0018651-43zwpi1x authors: Esener, Necati; Maciel-Guerra, Alexandre; Giebel, Katharina; Lea, Daniel; Green, Martin J.; Bradley, Andrew J.; Dottorini, Tania title: Mass spectrometry and machine learning for the accurate diagnosis of benzylpenicillin and multidrug resistance of Staphylococcus aureus in bovine mastitis date: 2021-06-11 journal: PLoS Comput Biol DOI: 10.1371/journal.pcbi.1009108 sha: cad02305451d076cbc61a484c5b3f8a596433f0d doc_id: 18651 cord_uid: 43zwpi1x Staphylococcus aureus is a serious human and animal pathogen threat exhibiting extraordinary capacity for acquiring new antibiotic resistance traits in the pathogen population worldwide. The development of fast, affordable and effective diagnostic solutions capable of discriminating between antibiotic-resistant and susceptible S. aureus strains would be of huge benefit for effective disease detection and treatment. Here we develop a diagnostics solution that uses Matrix-Assisted Laser Desorption/Ionisation–Time of Flight Mass Spectrometry (MALDI-TOF) and machine learning, to identify signature profiles of antibiotic resistance to either multidrug or benzylpenicillin in S. aureus isolates. Using ten different supervised learning techniques, we have analysed a set of 82 S. aureus isolates collected from 67 cows diagnosed with bovine mastitis across 24 farms. For the multidrug phenotyping analysis, LDA, linear SVM, RBF SVM, logistic regression, naïve Bayes, MLP neural network and QDA had Cohen’s kappa values over 85.00%. For the benzylpenicillin phenotyping analysis, RBF SVM, MLP neural network, naïve Bayes, logistic regression, linear SVM, QDA, LDA, and random forests had Cohen’s kappa values over 85.00%. For the benzylpenicillin the diagnostic systems achieved up to (mean result ± standard deviation over 30 runs on the test set): accuracy = 97.54% ± 1.91%, sensitivity = 99.93% ± 0.25%, specificity = 95.04% ± 3.83%, and Cohen’s kappa = 95.04% ± 3.83%. Moreover, the diagnostic platform complemented by a protein-protein network and 3D structural protein information framework allowed the identification of five molecular determinants underlying the susceptible and resistant profiles. Four proteins were able to classify multidrug-resistant and susceptible strains with 96.81% ± 0.43% accuracy. Five proteins, including the previous four, were able to classify benzylpenicillin resistant and susceptible strains with 97.54% ± 1.91% accuracy. Our approach may open up new avenues for the development of a fast, affordable and effective day-to-day diagnostic solution, which would offer new opportunities for targeting resistant bacteria. Introduction Staphylococcus aureus is a major opportunistic pathogen, infecting both humans and a wide variety of animals including dairy cattle, which have been recently proven to pose an important zoonotic potential, being the principal animal reservoir of novel human epidemic clones [1] . Worldwide, S. aureus is one of the most frequently isolated pathogens of bovine mastitis, which remains a significant problem in the dairy industry by affecting productivity, profitability, animal health and welfare [2] . The majority of bovine mastitis infections caused by S. aureus exhibit subclinical and chronic manifestations resulting in long-term intramammary persistence [3] . S. aureus can reproduce swiftly upon entering the mammary gland and induce immune reactions that can lead to tissue injuries [4] . Most of the time, the immune response of the cow itself cannot successfully eliminate the S. aureus infection and treatment is needed [4] . Existing S. aureus vaccines are not considered as a preventive solution due to their yet unproven effectiveness against infections [5] . In 2000, Gentillini et al. [6] indicated that beta-lactams (penicillins and cephalosporins), aminoglycosides, macrolides and lincosamides were the most commonly used antibiotics for treatment of bovine mastitis. In addition, according to a recent survey [7] in 2018, penicillins, aminoglycosides and third/fourth generation cephalosporins were the most common antibiotics used on the treatment for bovine mastitis in the UK. The first examples of using benzylpenicillin for bovine mastitis treatment can be traced back to the 1940s [8] . However, penicillinresistant S. aureus strains, carrying a penicillinase/beta-lactamase emerged shortly after its first resistance prediction [36] [37] [38] [39] [40] . Rapid and accurate classification of MRSA and methicillin-sensitive S. aureus (MSSA) based on MALDI-TOF spectral of clinical samples were obtained by several studies [36, 38, 39] . Analogously, high accuracy results have been obtained when applying machine learning approaches to MALDI-TOF spectral data for the prediction of the broad-spectrum antibiotic vancomycin. In particular, successful separation of vancomycin-intermediate (VISA) from vancomycin-susceptible S. aureus (VSSA) on the basis of MALDI-TOF data collected from clinical samples [37, 40, 41] . Recently, van Oosten and Klein [42] , developed classification models for S. aureus which assign the mechanisms of action of antibacterial drugs. The objective of this study was to find a fast and more accurate alternative to standard susceptibility tests, to profile multidrug and benzylpenicillin resistance in S. aureus isolates. To this end, we tested the discriminatory power given by the combination of supervised machine learning and MALDI-TOF, complemented by a protein-protein interaction (PPI) network and a protein structural analysis workflow. Here for the first time, we demonstrate that this approach can be used to develop diagnostic solutions that can discriminate with high performance between benzylpenicillin-and multidrug-resistant and susceptible bovine mastitiscausing S. aureus isolates. In this study, 82 S. aureus isolates had been cultured from milk samples collected between March 2004 and May 2005. The samples were from 24 herds each in a different farm (24 farms) where 23 farms were in England (most of them in the south) and one farm was in Wales (Llangathen, Carmarthen). The locations of the farms and S. aureus isolates collected from each farm are shown in Fig 1 and the breakdown of isolates per farm is shown on S1 Table. Moreover, S2 Table indicates the antimicrobial susceptibility profile of the resistant isolates that were obtained from the same animal. VITEK analysis showed that the cohort consisted of 31 benzylpenicillin-resistant and 51 benzylpenicillin-susceptible isolates. Amongst the resistant isolates, 16 isolates were found to be only penicillin-resistant, while 15 isolates had resistance to multiple antibiotics, among these 15 isolates 13 were found to be resistant to three or more antibiotics, with at least one antimicrobial agent in three antimicrobial classes (multidrug-resistant, MDR), while two isolates were resistant to two or more antibiotics with at least one antimicrobial agent in two antimicrobial classes (extensively drug-resistant, XDR). We considered the MDR and XDR as one class and named it as MDR for simplicity. As shown in Fig 2, out of 15 multidrug-resistant isolates, 11 isolates were resistant to benzylpenicillin, clindamycin, erythromycin, tilmicosin and tylosin; 1 isolate was resistant to benzylpenicillin, clindamycin, tilmicosin and tylosin; 1 isolate was resistant to benzylpenicillin, tetracycline and tilmicosin; 1 isolate was resistant to benzylpenicillin and tetracycline, and 1 isolate was resistant to benzylpenicillin, cefalotin, cefoxitin and oxacillin. 51 isolates were found to be susceptible to all antibiotics used in this study which were benzylpenicillin, cefoxitin, oxacillin, cefalotin, ceftiofur, cefquinome, amikacin, gentamicin, kanamycin, neomycin, enrofloxacin, clindamycin, erythromycin, tilmicosin, tylosin, tetracycline, florfenicol and trimethoprim/sulfamethoxazole. A total of 312 MALDI-TOF raw data spectra had been obtained from 82 S. aureus isolates, on average 4 replicate spectra per isolate. The peak lists, i.e. the lists of paired mass/charge (m/z) ratios and corresponding intensity values, were extracted from the raw spectra as specified in the Methods Section. Supervised machine learning algorithms were used to implement classifiers to verify if the MALDI-TOF peaks associated with isolates could be used to predict their resistance or susceptibility to benzylpenicillin and multidrug. Being based on supervised learning, all methods The circles represent the location of the farms and the size of the circles indicate the number of S. aureus isolates in the farms. The highest number of isolates provided by a single farm was 21, while the lowest was 1. The green colour represents the susceptible S. aureus isolates while the dark and light blue is for multidrug-resistant and benzylpenicillin-resistant only S. aureus isolates, respectively. Natural Earth was used as base to construct the map for the United Kingdom (https://www.naturalearthdata.com/downloads/ 10m-cultural-vectors/10m-admin-0-countries/), the map was created with rnaturalearth package in R. https://doi.org/10.1371/journal.pcbi.1009108.g001 required the availability of training datasets for model building and validation datasets for assessing the performance of the classifier. The prediction performance of each classifier was evaluated measuring accuracy, sensitivity, specificity and kappa. Thirty iterations of nested cross-validation (described in Methods) were used to train each classifier. The following classification methods, available in the scikit-learn library in Python, were tested: naïve Bayes, linear and non-linear (RBF kernel) support vector machines (SVM), decision tree, random forests, multi-layer perceptron neural networks (MLP), AdaBoost (Ada-Boost-SAMME version), logistic regression, linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). We first focused on investigating the possibility to develop a classifier to verify if MALDI-TOF peak lists associated with isolates could be used to predict their multidrug phenotype. Specifically, we considered the spectra of 15 multidrug-resistant isolates (13 MDR and 2 XDR) and 51 susceptible isolates (susceptible to all antibiotics tested in this study). A total of 249 raw spectra were analysed. The pre-processing led to the identification of four different peaks (Table 1) found to appear in at least 30% of all number of spectra. Due to the unbalanced nature of this specific data set (76% of samples were susceptible and only 24% were resistant), we first standardised the four features by a down-sampling method to build robust classifiers [43] . At each one of the 30 runs, 15 samples were randomly chosen out of the initial 51 susceptible samples and a final balanced (50% resistant, 50% susceptible) dataset was generated. The four peaks were then used as features to build ten classifiers and to develop predictive models for the multidrug phenotype. Before the classification, features were standardised (mean centred and unit variance) then resistant and susceptible isolates were labelled as 0 and 1, respectively. 30 runs using nested cross-validation were performed. Amongst the investigated machine learning approaches, LDA, linear SVM and RBF SVM were found as the top three best performance showing algorithms, respectively. Diagnostic systems trained on individual isolates coming from 24 different farms achieved up to (mean result ± standard deviation over 30 runs on the test set): accuracy = 96.81% ±0.43%, sensitivity = 99.88% ± 0.41%, specificity = 95.96% ± 0.52%, and kappa = 91.83% ± 1.37% in LDA algorithm. Detailed performance results of all classifiers on test data can be found in Fig 3. Next, we investigated resistance and susceptibility to benzylpenicillin only. This was to isolate specific patterns underlying resistance to this specific antibiotic. We chose benzylpenicillin because it was the only antibiotic for which we had singly resistant isolates. To this aim, the spectra of the 16 benzylpenicillin-resistant only isolates and 51 susceptible isolates (susceptible to all antibiotics tested in this study) were first pre-processed as described in the Methods Section. Five peaks were found in at least 30% of the overall number of spectra (Table 2 ). Due to the unbalanced nature of this specific data set (76% of samples are susceptible and only 24% are resistant), we first standardised the five features by a down-sampling method to build robust classifiers [43] . At each one of the 30 runs, 16 samples were randomly chosen out of the initial 51 susceptible samples and a final balanced (50% resistant, 50% susceptible) dataset was generated. The five peaks were then used as features to build ten classifiers and to develop predictive models for the benzylpenicillin phenotype. Before the classification, features were standardised (mean centred and unit variance) then resistant and susceptible isolates were labelled as 0 and 1, respectively. 30 runs using nested cross-validation was performed. Amongst the investigated machine learning approaches RBF SVM, neural network and logistic regression were those that achieved the best performance. Diagnostic systems trained on individual isolates coming from 24 different farms achieved up to (mean result ± standard deviation over 30 runs on the test set); accuracy = 97.54% ± 1.91%, sensitivity = 99.93% ± 0.25%, specificity = 95.04% ± 3.83%, and kappa = 95.04% ± 3.83% in RBF SVM algorithm. Detailed performance results of all classifiers on test data can be found in Fig 4. Notably, four peaks (4.807kDa, 6.422kDa, 6.891kDa and 9.621kDa) were found common in the analysis of benzylpenicillin-resistant vs susceptible and multidrug-resistant vs susceptible isolates. When comparing the intensities of these four peaks in the two datasets (resistant vs. susceptible) we observed that 4.807kDa, 6.891kDa and 9.621kDa had a higher average in susceptible isolates consistently while 6.422kDa had a higher average of intensity in benzylpenicillin-resistant only isolates class. 4.305kDa which was specific to benzylpenicillinresistant only analysis had higher average intensity in resistant than susceptible isolates. Because two of the five discriminant proteins found in this work were of ribosomal origins and ribosomal proteins have been used for the discrimination of major S. aureus lineages based on MALDI-TOF analysis [44-47], we performed further analyses in support that our classifiers were picking up susceptibility/resistance differences rather than strain differences. First, we investigated if and how in the sole presence of the ribosomal peaks as input features or in their absence the performance of the classifiers changed and how. As shown in S3 Table by removing only the ribosomal proteins from the analysis of both multidrug and benzyl-penicillin datasets, the performance of the classifiers decreases but not significantly, all indicators are still above 80%. However, when using only the ribosomal proteins as input features for the analysis of both multidrug and benzyl-penicillin datasets, the specificity and Cohen's kappa indicators drop to unacceptable values for both the multidrug and benzyl-penicillin predicted phenotypes. Altogether these results indicate that the ribosomal proteins in combination with the other discriminant proteins are contributing to the susceptibility/resistance classification but do not play a major role in the classification. The five peaks identified as providing optimal discrimination between benzylpenicillin-resistant only and susceptible isolates were further analysed to identify their correspondent S. aureus proteins. It should be noted that the four peaks identified as providing optimal discrimination between multidrug-resistant and susceptible were also amongst these peaks. When compared to the reference S. aureus Newbould 305 (ATCC 29740) proteome, the five peak masses identified the following five S. aureus proteins: two hypothetical proteins (molecular weights of 4801.95 and 6901.37 Da), RpmJ, RpmD and DNA-binding protein HU. The molecular weights of the corresponding proteins changed slightly from those in the original spectra as a result of the search criteria outlined in the Methods (Table 3 ). In order to better understand the functions and roles of these proteins within the drug resistance phenotype, we PTTA is the p-value of Welch's t-test; PKWK is the p-value of Wilcoxon test; index 1 refers to resistant isolates; index 2 refers to susceptible isolates; Ave is the overall intensity average; Ave1 is the intensity average of class 'Resistant'; Ave2 is the intensity average of class 'Susceptible'; StdDev is the overall intensity standard deviation; StdDev1 is the intensity standard deviation of class 'Resistant'; StdDev2 is the intensity standard deviation of class 'Susceptible'; PA is the overall proportion of appearance; PA1 is the proportion of appearance of class 'Resistant'; PA2 is the proportion of appearance of class 'Susceptible'. https://doi.org/10.1371/journal.pcbi.1009108.t002 characterised the molecular functions (MF), cellular components (CC), and biological processes (BP) they may carry out. RpmJ and RpmD are the 50S ribosomal proteins L36 and L30, respectively. HU is a histone-like DNA-binding protein, which interacts with DNA to protect . For the hypothetical proteins, we used 3D threading methods to predict the Gene Ontology (GO) functions ( Fig 5) . The hypothetical protein of 4801.95Da was annotated as COPII-coated vesicle cargo loading (BP), intracellular protein transport (BP), proteolysis (BP), homophilic cell adhesion via plasma membrane adhesion molecules (BP) and ion binding (MF). The hypothetical protein of 6901.37Da was annotated as being involved with the small molecule metabolic process (BP), antibiotic metabolic process (BP), lipid transport (BP) and ion binding (MF). With the aim to further characterise the function of these proteins we did a PSI-BLAST comparative analysis; all discriminant proteins with 100% coverage and significant e-values are shown in Table 3 . Next, we investigated the drug resistance interactome by building the protein-protein interaction network. The benzylpenicillin PPI network, including the four significant proteins (RpmJ, RpmD, HU and HP2) and their 149 first neighbours, was generated (Fig 6) . It should be noted that HP1 could not be found in the S. aureus proteome that was available in STRING database. GO and KEGG analyses of the network showed enrichment for ribosome, nucleic acid binding and catalytic activity (Fig 7) . Tetracycline resistance protein (TetM) and elongation factor G (FusA) were found as the first neighbours of RpmJ and RpmD based on the experimental findings of their homologs in E. coli [49, 50] . Additional four proteins (MecA, BlaZ, PbpA and metallo-beta-lactamase (MBL)) were associated with beta-lactams, rRNA adenine N-6-methyltransferase (ErmA), macrolides resistance, multidrug efflux pump (NorA) and ABC transporter protein (ABC-2). These proteins were found to interact with some first neighbours of the discriminant proteins in the network. Penicillin-binding protein 2 prime (MecA) was shown to share a common interactor, cell division protein (DivIB), with the discriminant protein RpmD. The interactions of MecA-DivIB (interaction score: 0.639) and DivIB-RpmD (interaction score: 0.864) are based on experimental/biological data coming from homologs in other species [51] . MecA was also shown to share a common interactor, DNA polymerase I (PolA), with the discriminant protein HU. While the interaction of MecA-PolA was based on text mining (interaction score: 0.432), the interaction of PolA-HU was based on experimental/biological data (interaction score: 0.668) obtained from homologs in other species [52, 53] . PolA was the only protein which links (based on text mining) HU to other beta-lactam resistance proteins such as penicillin-binding protein I (PbpA) (interaction score: 0.499) and beta-lactamase (BlaZ) (interaction score: 0.425) [52, 54] . PbpA was also shown to share the common interactor DivIB with discriminant proteins RpmD and RpmJ. ErmA was shown to share common nodes (ribosomal proteins) with the discriminant proteins RpmD and RpmJ. ErmA was shown, based on text mining, to also interact with PolA, linked to HU as previously described, (interaction score: 0.611) [55] and to other proteins (RpsA, MetG and GuaA), based on co-expression, gene fusion and co-occurrence (interaction scores >0.400). NorA was shown to share a common interactor, DNA topoisomerase (TopA) with the discriminant protein HU. ABC-2 was shown to share common interactors, signal recognition particle proteins FfH and FtsY with discriminant proteins RpmD and RpmJ. MBL was shown to share a common interactor, putative fatty oxidation complex protein (AID38649.1), with discriminant protein RpmJ based on coexpression, gene fusion and co-occurrence (interaction scores > 0.400). Notably, the PPI analysis of the benzylpenicillin-resistant proteome, 153 proteins-a total of 4 discriminant proteins and 149 first neighbour proteins-showed higher connectivity (clustering coefficient 0.728) than the complete S. aureus proteome network (clustering coefficient 0.421). The average number of neighbours per protein was 68.719 in the benzylpenicillin-resistant proteome network and 27.190 in the complete S. aureus proteome network. In terms of network density, the values ranged between 0.452 (benzylpenicillin-resistant proteome network) and 0.009 (complete S. aureus proteome network) and for the network heterogeneity the values ranged between 0.528 benzylpenicillin-resistant proteome network) and 1.243 (complete S. aureus proteome network). Antibiotic-resistant S. aureus infections are a major concern in human and veterinary medicine. Recently, dairy cattle have been shown to be an important risk factor for zoonotic transfer [1] . Fast, affordable and effective diagnostic solutions which are able to detect the specific S. aureus strains and their antibiotic resistance and susceptibility profiles are key to support effective and targeted treatment selection. Motivated by identifying the most effective method to discriminate (MDR-and benzylpenicillin-) resistant and susceptible S. aureus strains, we approached the task in a principled way by applying optimization techniques to overcome uncertainty in data features and by using a wide repertoire of classification methods. In general, most of the classifiers tested achieved high performance and had kappa values over 85.00%. However, amongst the investigated machine learning approaches RBF SVM, neural network and logistic regression were those that achieved the best performance. Diagnostic systems trained on individual isolates coming from 24 different farms achieved up to (mean result ± standard deviation over 30 runs on the PLOS COMPUTATIONAL BIOLOGY test set): accuracy = 97.54% ± 1.91%, sensitivity = 99.93% ± 0.25%, specificity = 95.04% ± 3.83%, and kappa = 95.04% ± 3.83% in RBF SVM algorithm. We showed that our classification methods while offering high out-of-sample accuracy can also be solved in practical computational times. While our primary aim was to develop machine learning-powered diagnostics discriminating benzylpenicillin-resistant and susceptible isolates of bovine mastitis-causing S. aureus, we also characterized the molecular determinants and interactions underlying the identified antibiotic resistance and susceptible patterns. Several isolates were obtained from the same animal, some of them also presented the same antimicrobial susceptibility profile, possibly suggesting that they represent the same strain. Moreover, none of the S. aureus isolates, except one, were found resistant to cefoxitin or oxacillin, despite being resistant to penicillin, suggesting that penicillin-resistant S. aureus isolates in this study were maybe indeed producers of penicillinase instead of being MRSA. This might be related to the fact that since the first report of S. aureus resistant to methicillin detected in a dairy herd in the United Kingdom [12] and from the first isolation in 2012, of both mecA and mecC LA-MRSA in bulk milk from dairy cattle in the UK [17] , frequency of detection of mecA and mecC LA-MRSA in the UK, gathered from surveillance and large-scale dairy cattle studies, [11, 17] remained low [15] . The low frequency of resistance to cefoxitin or oxacillin found in our cohort is possibly reflecting that LA-MRSA is present in the UK, possibly at a low prevalence level. Our findings showed that the five MALDI-TOF peaks recognized as significant by the trained classifiers were found to correspond to two ribosomal proteins (RpmJ and RpmD), DNA-binding HU protein and two hypothetical proteins. RpmD, DNA-binding HU protein and two hypothetical proteins were also found to give the best discrimination between multidrug-resistant and susceptible profiles of S. aureus. The notion that components of the ribosome are important in the growth rate and antibiotic resistance of bacteria is a well-known concept [56] . Among those determinants involved in intrinsic resistance, ribosomal proteins have been found to deal with the general response to stress [57] . Similarly, recent findings highlighted the existence of ribosomal mutations conferring resistance to antibiotics of several classes not targeting the ribosome [56] . Specifically, it has been shown that ribosomal mutations can contribute to the evolution of multidrug-resistant profiles, by inducing ribosomal mis-assembly, that in turn leads to a systematic transcriptional cell alteration, ultimately impacting resistance to multiple antibiotics by interfering with different cellular pathways [56] . RpmJ was shown to be up-regulated in Pseudomonas aeruginosa when treated with ciprofloxacin and fluoroquinolone [58] and similarly in S. epidermidis [59] . Moreover, rpmJ was shown to confer intrinsic multidrug resistance to a varied set of antibiotics (nitrofurantoin, sulfamethoxazole, rifampicin, tetracycline, vancomycin, ampicillin, colistin, erythromycin) in E. coli, where deletion of this gene caused the bacteria to become more sensitive than wild type [60] . In comparison, fewer literature works have been published about rpmD and antibiotic resistance. Sharma-Kuinekel and collaborators showed that rpmD was downregulated in S. aureus strains which had the antibiotic tolerance related LytSR system silenced [61] . The discriminant protein DNA-binding HU protein was found essential in the bacterial survival and growth of S. aureus [62] . It was also previously found to be correlated to antibiotic resistance by being upregulated in the mutant S. aureus isolates with silenced serine/threonine kinase PknB, which also has a penicillin-binding domain [63] . Besides the proteins with known functions, we also identified two hypothetical proteins, but we were unable to find any evidence so far linking them to antibiotic resistance. Although it was not possible for us to identify the function of these hypothetical proteins, by applying PSI-BLAST and PSORTb v3.0 together with 3D threading modelling searches, the hypothetical proteins are predicted to be involved in pathways such as antibiotic metabolic process, lipid/protein transport and ion binding. Although the elected mechanism to acquire resistance in S. aureus is through horizontal gene transfer, spontaneous mutations in the core genome and positive selection are also mechanisms used by the bacteria to acquire several resistances (e.g., fluoroquinolones, linezolid and daptomycin) [27] . The spontaneous mutation mechanisms involving ribosomal proteins in S. aureus has been previously found to raise antibiotic resistance (e.g. vancomycin) [64] . Future efforts may integrate genome sequencing analysis of the isolated strains towards elucidating and understanding the mechanisms underlying the antibiotic resistance. We were not surprised that known genes such as blaZ, mecA, pbpA, conferring resistance to penicillin in S. aureus were not amongst the MALDI-TOF peaks recognized as significant by the trained classifiers. This is because the mass range resolution of the MALDI-TOF was set to be between 2kDa and 12kDa, and the BlaZ, MecA, PbpA are all proteins with molecular weights higher than 20kDa. However, our PPI cluster analysis results showed that these proteins known to confer resistance have all been found to interact with most of the proteins corresponding to the MALDI-TOF peaks and to form a highly connected benzylpenicillin proteome network. While our approach successfully developed a diagnostic solution to identify antibiotic-resistant signatures, there are limitations to our method which future work may build upon. For one, the working range of 2-12kDa does not give the possibility to study the complete S. aureus proteome in relation to a specific phenotype. The MDR and XDR isolates, collectively named multidrug-resistant isolates, used in this study were all resistant to benzylpenicillin in addition to other antimicrobial agents. Therefore, there is a bias towards peaks determining resistance or susceptibility to benzylpenicillin, which may explain why all 4 multidrug discriminant peaks occurred within the set of benzylpenicillin-only discriminant peaks. In this work, we have opted to pre-process all the data together as previously done by several studies [42, [65] [66] [67] [68] instead of splitting it into a training and validation sets for several reasons. First, given the low number of samples in each of the two minority classes (multidrug resistant and benzylpenicillin-only resistant) it would have been not possible to have a sufficient number of observations in each set and each partition being enough representative to yield a good peak selection. Moreover, because some of the peaks appeared in just a subset of these samples (minority classes), the random sampling of the data performed could increase the chances of getting spurious peaks in the training set that would not represent the whole minority class. To avoid these problems, we pre-processed all the data together. Moreover, this study has been confined to a relatively small number of isolates. Ideally, a larger number of isolates would have allowed to refine the machine learning predictions. However, other studies attempted the analysis of antimicrobial resistance on S. aureus with MAL-DI-TOF and machine learning and similar sample size. For example, Tang 21 hVISA, and 38 VSSA isolates to develop their SVM based models. Usually, the larger the dataset the greater is the statistical power for pattern recognition. However, in our machine learning approach, we have used the Nested CV approach which is known to produce robust and unbiased performance estimates regardless of sample size [69] . The machine learning performance indicators associated with our models are high suggesting that models were sufficiently trained. In addition, we acknowledge, as a limitation of this study, that our data were collected from farms only in England and Wales. However, this should not pose a restriction on our method's ability to predict resistance or susceptibility in other farms across the globe. If it is given a sufficiently diverse distribution of data to train the supervised learning algorithms, this would reduce any geographical bias that could affect predictive capability. This study should be considered a proof-of-principle where we conducted a feasibility work to invest on with larger samples and geographical areas. Finally, the downside of requiring larger sample sizes is limitations in data availability, often requiring reliance on public databases and thus compromise on the type of available data and possible studies. Unfortunately, in omics and other technology-based data collection analysis, very often only small samples are available, this is because of limited in vivo experiments, protocols, involvement of human participants and costs. For example, whilst not being able to rely on large amounts of data, we had the unprecedented possibility to demonstrate that our methodology is associated with high classification accuracy even when using small sample size, this applicability may facilitate research scenarios where only limited data is available. In addition to the machine learning analyses undertaken to prove the effectiveness of our method to differentiate susceptibility/resistance profiles rather than strain differences, we also compared the MALDI-TOF spectral peaks spectral peaks (4305.59Da, 4807.21Da, 6422. [70] analysed 59 diverse S. aureus isolates from 6 different lineages using MALDI-TOF mass spectrometry. Based on their results over a gel view representation and a hierarchical cluster analysis, the authors indicated that, with a few exceptions, CC-specific biomarkers for S. aureus are an exception rather than a rule. The authors found 3 regions that could be considered biomarkers for some lineages: m/z 3875 and 3891 (CC5); m/z 6552 and 6592 (CC8); m/z 5002 and 5032 (CC22). Therefore, none of the peaks used in our study were considered biomarkers by Lasch et al. [70] . The results found by Lasch et al. [70] clearly suggests that typing S. aureus can be rather unsuccessful due to a lack of stable biomarkers to distinct clonal groups, a low classification accuracy based on different CC types and a cluster analysis that indicate the limited possibilities to differentiate S. aureus below species levels. Further comparisons were also made with existing literature coupling MALDI-TOF mass spectrometry with a refined analysis framework to accurate classify resistant and susceptible S. aureus strains. In particular, the peaks (m/z 4305. 59 [37, 40] no similar peaks were detected under the experimental conditions chosen here. In particular, our peaks often mapped in the higher and non-overlapping mass range of the spectrum. Whereas, when we compared our peaks with those found by Asakura et al. [41] to differentiate VISA, hVISA, and VSSA clinical isolates, we found that one peak (m/z 4306) was in common between the two studies. This peak is among 23 other peaks that were found to be statistically significant among VISA, hVISA and VSSA (p < 10 −4 , Kruskal-Wallis test). This peak corresponds to the ribosomal protein RpmJ. Indicating that ribosomal proteins can be correlated with resistance phenotypes. This was also reported by Josten et al. [44] when analysing the peak pattern of 401 MRSA and MSSA strains (see above). Although we have not typed our strains, which we acknowledge as a limitation of our study, we believe that it is not unreasonable to assume that we have classified the resistance/susceptibility phenotype and not the strains. Our and Lasch et al. [70] , it is very unlikely that we could separate all the different strains circulating in just two groups and importantly with such high performance indicators. From a machine learning point of view, given the limited number of observations, relative high number of possible strains, binary outcome, number of genetic/molecular traits different among the strains it would not had been possible to separate the different strains in just two groups especially with such high-performance scores. This is also in agreement with Lasch et al. [70] that although performing an elegant modular/hierarchical ANN analysis of spectra from the S. aureus data set (we only did a one-step machine learning classification), apart from a fairly good classification accuracy for CC8 strains of S. aureus and, to a lesser extent for strains of CC5 (80%) and CC30 (78%), the classification accuracy for the other strains was unacceptably low. Despite intensive efforts aiming at improving these outcomes, neither variations of the spectral pre-processing nor of the network topology resulted in better classification results according to the authors. Overall, we demonstrated that the combination of supervised machine learning and MAL-DI-TOF mass spectrometry can be used to develop an effective computational diagnostic solution that can discriminate between benzylpenicillin/multidrug-resistant and susceptible S. aureus strains. Our solution could save time and money with respect to traditional susceptibility testing which is not viable for day-to-day monitoring of antibiotic resistance. Our solution could support farmers with timely, accurate and targeted treatment selection. This study received an ethical review and approval from the Clinical Ethical Review panel at the School of Veterinary Medicine and Science, University of Nottingham (approval Reference number: 2067 170717). All data is owned by QMMS ltd. Bovine mastitis-causing S. aureus isolates were tested on VITEK 2 AST-GP79 using one Antibiotic Susceptibility Testing (AST) card per isolate. Each card was filled with at least one positive control well with no antibiotic and multiple wells with increasing concentrations of antibiotics. We tested susceptibility to the following antibiotics: benzylpenicillin, cefoxitin, oxacillin, cefalotin, ceftiofur, cefquinome, amikacin, gentamicin, kanamycin, neomycin, enrofloxacin, clindamycin, erythromycin, tilmicosin, tylosin, tetracycline, florfenicol and trimethoprim/sulfamethoxazole. Using the VITEK 2 we measured the growth and viability of the isolates in all wells compared to the control wells. Relative bacterial growth in each antibiotic well was calculated and compared with the positive control wells. The minimum inhibitory concentration (MIC) values were calculated by comparing the growth of the bacteria to the growth of isolates with known MICs. The S. aureus isolates were labelled as either resistant or susceptible according to their antibiotic resistance profiles based on CLSI breakpoints (VET01-S3) [71] . All S. aureus isolates were stored at -80˚C since their recovery in 2004/5 using a microbead preservation system (Technical Service Consultants Ltd, Lancashire). Isolates were recovered onto Blood agar and incubated at 37˚C for 24 hours. If no growth was initially observed the isolates were sub-cultured another 24 hours. All isolated were sub-cultured on blood agar at 37˚C for 24 hours prior to MALDI-TOF analysis. The same storage and growth conditions were applied to all isolates. The pure cultures were then analysed using the Time-of-flight (TOF) MALDI mass spectrometer (Bruker Daltonics, Billerica, MA), Microflex-Flex Control Version 3.4, Bruker Daltonics. The order of sample analysis was randomised, the Bruker Bacterial Test Standard (BTS) (Bruker Daltonics) was used for calibration control on every plate. For each isolate, six technical replicates were generated from 240 desorption's per replicate (6 x 40 shots), and protein mass spectra acquired in the range 2000 to 20,000 Da were generated. Spectra were compared visually using Biotyper 3.1 (Bruker Daltonics) to remove low intensity spectra or spectra with substantial background noise. All the samples used in this study were further analysed visually on Matlab for insufficient resolution (defined as a measure to distinguish two peaks of slightly different m/z values [72] ), low intensity or substantial background. However, no samples were discarded for these reasons. The. Technical replicates were further compared using composite correlation indices (CCI) to remove dissimilar spectra with CCI < 0.99 [73] . At least three good quality spectra per isolate were required for inclusion of the isolate in the analysis. Moreover, when three qualifying technical replicates could not be obtained the sample was re-analysed in order to get at least 3 replicates. All the 82 isolates used in this study had three good quality technical replicates. The pre-processing steps of MALDI-TOF mass spectra were performed using MATLAB Bioinformatics Toolbox Release 2017b, The MathWorks, Inc., Natick, Massachusetts, United States. Our analysis was done using 82 S. aureus isolates with each sample having 3 to 6 replicates. The pre-processing followed these 8 steps: 1. Mean Computing: the replicates of each biological isolate were averaged. 2. M/Z Cropping: the mass range was cropped to be between 2kDa and 12kDa. 3. Resampling: the data was up-sampled from 13,740 to 20,000 points. 4. Baseline Correction: for each biological isolate, baseline correction was applied by using a window of 200 Da with a step size of 200 Da to shift the window. The quantile method (10% value) was used to find the likely baseline value in every window. Shape-preserving piecewise cubic interpolation approximation was applied to regress the varying baseline. The regressed baseline was not smoothed. The resulting baseline was subtracted from the spectrum. 5. Normalisation: the area under the curve (AUC) of every spectrum was normalised to the median and post-rescaled such that the maximum intensity was 100. 6. Noise reduction: each sample was denoised using least-squares polynomial with a window of 35 Da and a 2-degree polynomial function. 7. Alignment: to align the spectrograms, a set of reference peaks was required. Specifically, the peaks were selected if present in at least 30% of all spectra. The 30% threshold was chosen following the workflow suggested in the ClinProTools software documentation [74] . In addition, the first pre-processing step of our workflow consists of averaging all the 3 or more technical replicates of each sample. Therefore, after this averaging step we have one spectrum per sample and consequently the 30% threshold used to select the peaks is applied to all samples. By applying the 30% threshold we are selecting only the peaks that are present and hence relevant across both the resistant and susceptible classes, as shown in Tables 1 and 2 in the Results section. The alignment was estimated using the default values of msalign function (Bioinformatics Toolbox). To retain a reasonable intensity a signal-to-noise ratio threshold was defined at 10% to discard all peaks below it. Therefore, since the spectra were previously normalised to an overall maximum intensity of 100, any point below 10 is considered noise. A minimum distance of 20Da between neighbouring peaks was set, i.e., two peaks must be at least 20Da apart to be considered different. After detecting all the peaks in each spectrum, a peak list report was prepared similarly to Clin-ProTools 3.0 [74] . Specifically, the peaks were selected if present in at least 30% of all spectra. The selected peaks were further pre-processed to have zero mean and unit variance. Such peaks represented the spectral features used in the classification analysis. The performance of the classifiers, naïve Bayes [75] , linear and non-linear (RBF kernel) support vector machines (SVM) [76] , decision tree [77] , random forests [78] , multi-layer perceptron neural networks (MLP) [79] , AdaBoost (AdaBoost-SAMME version [80] ), logistic regression [81] , linear discriminant analysis (LDA) [82] and quadratic discriminant analysis (QDA) [82] , was investigated using the scikit-learn library in Python [83] . For the classifiers, the following set of values were employed for the hyper-parameter searches: • Naive Bayes, LDA and QDA do not have hyper-parameters. The prediction performance of each classifier was evaluated by considering the following indicators, assuming P and N as the total number of positive (benzylpenicillin/multidrug-resistant) and negative (multidrug susceptible) isolates, respectively and using T for true (correct) and F for false (wrong) predictions: • Nested Cross-validation (NCV) [84] , which is a well-established cross-validation technique was employed to assess the performance and select the hyper-parameters of the proposed classifiers. In NCV there is an outer loop split of the data set into test and training sets. For each training set, a grid search (inner loop) is run, in order to find the best hyper-parameters of the classifier using accuracy as a performance metric. Then, the test set is used to score the best classifier found in the inner loop. These scores tell us how well the classifier model generalises, given the best hyper-parameters found in the inner loop. Thirty iterations were carried out, wherein each iteration an NCV was employed. The inner loop of the NCV finds the best hyper-parameters of each classifier (when suited) using a stratified 3-fold cross-validation; the outer loop measures the accuracy, sensitivity, specificity and kappa using a 5-fold stratified cross-validation, in order to compare all the classifiers [85] . A dedicated bioinformatics pipeline was developed to find correspondences between individual peaks selected by the machine learning-based classifiers and actual proteins of S. aureus. First, amino acid sequences of the proteins in the S. aureus Newbould 305 (ATCC 29740) proteome, which is considered the model bovine mastitis strain [86] , were retrieved from the PATRIC database in FASTA format. The molecular weights of the proteins were calculated using the Compute pI/Mw tool on ExPASy [87] . The proteins were filtered in the range of ± 200Da of the mass of individual peaks. Then, N-terminal methionine cleavage was predicted using the online prediction tool TermiNator [88] and the theoretical molecular weights of the proteins were re-calculated using compute pI/Mw tool according to presence or absence of the initial methionine. Finally, proteins with a maximum of 0.2% difference in mass to the individual peaks for the successful identification of correspondence were selected. To further investigate the function of the identified proteins, we studied protein-protein interactions (PPI) as previously described [89] . The PPI dataset of S. aureus (strain NCTC 8325/PS 47) was obtained from the STRING database [90] and nodes (proteins) with interaction scores lower than medium confidence level (interaction scores <0.400) were filtered out. The remaining nodes (proteins) were analysed in Cytoscape 3.7.1 based on the following parameters: the average number of neighbours, clustering coefficient, network density and network heterogeneity [91] [92] [93] . The characterisation of antibiotic-resistant genes of the beta-lactam, macrolide and tetracycline antibiotic classes in the PPIs, were obtained from ResFinder v3.1 [94] and using them as queries in a comparative BLAST search against the S. aureus proteome. The functions of the genes in the network were annotated with Gene Ontology terms (biological process, molecular function and cellular component) and KEGG pathways. Finally, to gain a more in-depth understanding of the protein functions, homology and threading 3D models for discriminant proteins were built. 3D homology modelling was used for the proteins with good quality templates in the Swiss-Model repository [95] and the models built by using Swiss-PdbViewer [96] . The 3D models of hypothetical proteins were generated by using the threading technique on I-TASSER, where biological functions were predicted as well [97] . The 3D Models of all discriminant proteins were visualized and edited in UCSF Chimera [98] . Homologs of the discriminant proteins were checked in the NCBI database by position-specific iterative basic local alignment tool (PSI-BLAST). Functional domains were searched against the CDD v3.17-52910 PSSMs database. PSORTb v3.0 was used to predict cellular locations of the discriminant proteins [99] . Supporting information S1 Table. A) Supervised machine learning prediction of multidrug resistance spectral signature profiles using the Linear Discriminant Analysis (LDA) classifier. Prediction performance results using all the peaks (4807m/z, 6422m/z, 6891m/z and 9621m/z); only the non-ribosomal peaks (4807m/z, 6422m/z, 6891m/z and 9621m/z) and only the ribosomal peak (6422m/z). B) Supervised machine learning prediction of multidrug resistance spectral signature profiles using a non-linear (RBF kernel) support vector machine (RBF-SVM) classifier. Prediction performance results using all the peaks (4305m/z, 4807m/z, 6422m/z, 6891m/z and 9621m/z); only the non-ribosomal peaks (4807m/z, 6422m/z, 6891m/z and 9621m/z) and only the ribosomal peak (4305m/z and 6422m/z). Gene exchange drives the ecological success of a multi-host bacterial pathogen Pathogen-specific production losses in bovine mastitis Host-response patterns of intramammary infections in dairy cows Virulence factors involved in the pathogenesis of bovine intramammary infections due to Staphylococcus aureus Knowledge gaps and research priorities in Staphylococcus aureus mastitis control. Transboundary and emerging diseases Antimicrobial susceptibility of Staphylococcus aureus isolated from bovine mastitis in Argentina A survey of dairy cow farmers in the United Kingdom: knowledge, attitudes and practices surrounding antimicrobial use and resistance. Veterinary Record Development of Penicillin Resistance among Staphylococcus aureus Isolated from Bovine Mastitis in Denmark and Other Countries Methicillin resistance in staphylococci: molecular and biochemical basis and clinical implications Methicillin resistance in Staphylococcus aureus: mechanisms and modulation. Science progress The emergence of mecC methicillin-resistant Staphylococcus aureus Meticillin-resistant Staphylococcus aureus with a novel mecA homologue in human and bovine populations in the UK and Denmark: a descriptive study. The Lancet Infectious Diseases Antibiotic resistance in staphylococci Livestock-associated Staphylococcus aureus Livestock-associated methicillin-resistant Staphylococcus aureus from animals and animal products in the UK. Frontiers in microbiology Livestock-associated Staphylococcus aureus: the United States experience First detection of livestock-associated meticillin-resistant Staphylococcus aureus CC398 in bulk tank milk in the United Kingdom Livestock origin for a human pandemic clone of community-associated methicillin-resistant Staphylococcus aureus Tracking the evolutionary origin of the methicillin resistance gene: cloning and sequencing of a homologue of mecA from a methicillin susceptible strain of Staphylococcus sciuri. Microbial drug resistance Detection of staphylococcal cassette chromosome mec type XI carrying highly divergent mecA, mecI, mecR1, blaZ, and ccr genes in human clinical isolates of clonal complex 130 methicillin-resistant Staphylococcus aureus MRSA colonization and infection among persons with occupational livestock exposure in Europe: prevalence, preventive options and evidence. Veterinary microbiology Mechanisms of Methicillin Resistance in Staphylococcus aureus Genetic characterization of antimicrobial resistance in Staphylococcus aureus isolated from bovine mastitis cases in Northwest China. Journal of integrative agriculture Burden and antimicrobial resistance of S. aureus in dairy farms in Mekelle UK veterinary antibiotic resistance and sales surveillance report Genetics of antimicrobial resistance in Staphylococcus aureus Mechanisms of antibiotic resistance in Staphylococcus aureus Antibiotic resistance in Staphylococcus aureus. Current status and future prospects Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance Systematic Genetic Screens Reveal the Dynamic Global Functional Organization of the Bacterial Translation Machinery The nucleoid protein Dps binds genomic DNA of Escherichia coli in a non-random manner Evidence from artificial septal targeting and site-directed mutagenesis that residues in the extracytoplasmic beta domain of DivIB mediate its interaction with the divisomal transpeptidase PBP 2B Evolution of the Pseudomonas aeruginosa mutational resistome in an international Cystic Fibrosis clone Evidence of a thermal unfolding dimeric intermediate for the Escherichia coli histone-like HU proteins: thermodynamics and structure Comparative genomics of methicillin-resistant Staphylococcus aureus ST239: distinct geographical variants in Beijing and Hong Kong The distribution of mobile genetic elements (MGEs) in MRSA CC398 is associated with both host and country Ribosomal mutations promote the evolution of antibiotic resistance in a multidrug environment The intrinsic resistome of bacterial pathogens Selective Proteomic Analysis of Antibiotic-Tolerant Cellular Subpopulations in Pseudomonas aeruginosa Impact of the Staphylococcus epidermidis LytSR two-component regulatory system on murein hydrolase activity, pyruvate utilization and global transcriptional profile Antibiotic sensitivity profiles determined with an Escherichia coli gene knockout collection: generating an antibiotic bar code The Staphylococcus aureus LytSR two-component regulatory system affects biofilm formation Comprehensive identification of essential Staphylococcus aureus genes using Transposon-Mediated Differential Hybridisation (TMDH) Transcriptome and functional analysis of the eukaryotic-type serine/threonine kinase PknB in Staphylococcus aureus Comprehensive Identification of Mutations Responsible for Heterogeneous Vancomycin-Intermediate Staphylococcus aureus (hVISA)-to-VISA Conversion in Laboratory-Generated VISA Strains Derived from hVISA Clinical Strain Mu3 Detection of SARS-CoV-2 in nasal swabs using MALDI-MS Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics Mass-Up: an all-in-one open software application for MALDI-TOF mass spectrometry knowledge discovery A simpler method of preprocessing MALDI-TOF MS data for differential biomarker analysis: stem cell and melanoma cancer studies Machine learning algorithm validation with a limited sample size Insufficient discriminatory power of MALDI-TOF mass spectrometry for typing of Enterococcus faecium and Staphylococcus aureus isolates Performance standards for antimicrobial disk and dilution susceptibility tests for bacteria isolated from animals 2-Mass Spectrometry Prediction of Streptococcus uberis clinical mastitis risk using Matrix-assisted laser desorption ionization time of flight mass spectrometry (MALDI-TOF MS) in dairy herds. Preventive veterinary medicine ClinProTools 3.0: User Manual. Bremen: Bruker Daltonik GmbH Data mining concepts and techniques third edition Machine learning. Support vector networks Classification and regression trees Random forests. Machine learning Fast learning in networks of locally-tuned processing units Multi-class adaboost The regression analysis of binary sequences Discriminant analysis and statistical pattern recognition Scikit-learn: Machine learning in Python On over-fitting in model selection and subsequent selection bias in performance evaluation Empirical evaluation of resampling procedures for optimising SVM hyperparameters. The Journal of Machine Learning Research Genome Sequence of Staphylococcus aureus Newbould 305, a Strain Associated with Mild Bovine Mastitis The proteomics protocols handbook The proteomics of N-terminal methionine cleavage Discrimination of contagious and environmental strains of Streptococcus uberis in dairy herds by means of mass spectrometry and machine-learning STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets Cytoscape: a software environment for integrated models of biomolecular interaction networks Hierarchical organization of modularity in metabolic networks Understanding network concepts in modules Identification of acquired antimicrobial resistance genes SWISS-MODEL: homology modelling of protein structures and complexes Automated comparative protein structure modeling with SWISS-MODEL and Swiss-PdbViewer: A historical perspective I-TASSER server: new development for protein structure and function predictions UCSF Chimera-a visualization system for exploratory research and analysis PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes