Summary of your 'study carrel' ============================== This is a summary of your Distant Reader 'study carrel'. The Distant Reader harvested & cached your content into a collection/corpus. It then applied sets of natural language processing and text mining against the collection. The results of this process was reduced to a database file -- a 'study carrel'. The study carrel can then be queried, thus bringing light specific characteristics for your collection. These characteristics can help you summarize the collection as well as enumerate things you might want to investigate more closely. Eric Lease Morgan May 27, 2019 Number of items in the collection; 'How big is my corpus?' ---------------------------------------------------------- 50 Average length of all items measured in words; "More or less, how big is each item?" ------------------------------------------------------------------------------------ 7632 Average readability score of all items (0 = difficult; 100 = easy) ------------------------------------------------------------------ 52 Top 50 statistically significant keywords; "What is my collection about?" ------------------------------------------------------------------------- 29 January 9 international 8 International 6 SARS 6 RNA 6 Figure 5 cell 5 Fig 3 Biology 2 variant 2 gene 2 figure 2 dna 2 disease 2 datum 2 blast 2 Supp 2 GWAS 2 ACE2 1 𝐷27 1 𝐷23 1 vntr 1 type 1 tumor 1 trajectory 1 time 1 table 1 subclonal 1 study 1 stat1 1 sequence 1 prescription 1 population 1 phage 1 network 1 multiplet 1 mri 1 module 1 model 1 integration 1 information 1 il-6 1 http://paperpile.com/b/m1KDL1/qAA80 1 host 1 genome 1 egfr 1 editing 1 drug 1 curvature 1 cluster Top 50 lemmatized nouns; "What is discussed?" --------------------------------------------- 2730 cell 2223 preprint 1752 gene 1632 datum 1470 author 1453 version 1359 review 1350 copyright 1339 funder 1338 holder 1333 peer 1250 preprintthis 1138 sequence 995 % 977 model 976 type 896 license 877 perpetuity 859 analysis 858 sample 801 number 732 method 674 dataset 641 value 621 expression 602 cancer 549 score 534 figure 532 line 517 time 517 fig 500 p 489 genome 486 annotation 470 right 467 study 465 result 450 permission 449 reuse 444 protein 438 site 436 level 435 licenseavailable 428 seq 423 set 402 motif 377 variant 370 specie 368 information 364 network Top 50 proper nouns; "What are the names of persons or places?" -------------------------------------------------------------- 1350 January 1282 al 927 et 915 � 710 ⋅ 703 J. 665 M. 636 NC 557 RNA 511 S. 478 ND 472 International 448 Figure 419 − 412 IL-27 410 SARS 410 A. 409 C 394 R. 364 D. 360 C. 333 CoV-2 330 S 298 M 290 Fig 280 L. 279 P. 279 E. 268 H. 260 K. 252 T. 251 B. 238 Supplementary 238 G. 220 A 218 Methods 215 p 210 . 204 Y. 198 J 197 B 195 Cell 194 HLA 192 BY 188 CCN 182 F. 181 N. 177 W. 172 Cancer 168 Data Top 50 personal pronouns nouns; "To whom are things referred?" ------------------------------------------------------------- 2564 we 1464 it 358 i 187 they 114 them 62 us 28 itself 19 https://doi.org/10.1101/2021.01.06.425569 17 one 15 adroit 12 swcam 11 he 11 em 6 themselves 6 https://doi.org/10.1101/2021.01.07.425716 6 bl 5 http://paperpile.com/b/h8ctd0/cq1b 4 ng 3 you 3 yj 3 mine 3 il-27ra 2 λ 2 us- 2 ua 2 u 2 she 2 s 2 ourselves 2 n 2 matchdrugwithdisease 2 m 2 https://doi.org/10.1101/2021.01.08.425918 2 https://doi.org/10.1016/j.cell.2011.02.013 1 𝜆𝜃 1 𝜆 1 𝒗𝑖 1 𝑢- 1 𝑘1 1 𝑖𝑗 1 𝑓 1 𝑉- 1 σ 1 x 1 wcα−cα 1 tpmrss4 1 tmprss11a 1 ti 1 t 1 rseqtu Top 50 lemmatized verbs; "What do things do?" --------------------------------------------- 13226 be 2179 use 2124 have 1337 post 1337 certify 928 display 890 grant 868 biorxiv 845 show 699 base 696 make 604 allow 468 identify 450 find 449 reserve 415 include 374 compare 358 � 327 do 319 see 317 contain 313 represent 312 provide 312 predict 304 associate 287 generate 282 obtain 270 apply 265 give 252 indicate 241 select 240 read 237 calculate 236 follow 231 define 225 perform 222 compute 218 detect 217 know 217 consider 207 estimate 199 observe 199 increase 186 bind 184 result 177 reveal 175 correspond 174 measure 174 derive 170 set Top 50 lemmatized adjectives and adverbs; "How are things described?" --------------------------------------------------------------------- 2077 not 720 - 717 available 705 high 620 also 550 different 539 more 469 single 463 other 446 international 437 human 373 low 372 only 366 same 359 then 359 specific 349 well 338 first 336 such 300 non 290 most 283 large 254 however 232 new 224 average 221 multiple 205 similar 202 further 198 scalar 198 respectively 190 significant 174 variant 174 small 173 relative 173 negative 170 biological 166 positive 166 genomic 157 functional 156 genome 151 common 151 as 150 thus 150 therefore 150 here 143 clinical 142 significantly 142 regulatory 141 novel 140 top Top 50 lemmatized superlative adjectives; "How are things described to the extreme?" ------------------------------------------------------------------------- 112 high 94 most 50 least 47 good 41 low 33 large 29 close 20 near 16 Most 14 strong 11 late 10 transcriptome 8 long 8 great 7 small 7 simple 7 bad 5 slow 5 dense 4 short 3 early 2 ​t​-t 2 fast 1 weak 1 topmost 1 steep 1 spac 1 furth 1 fine 1 few 1 editosome 1 c(.05 1 broad 1 big 1 SH3BGRL3 1 COVID-19 Top 50 lemmatized superlative adverbs; "How do things do to the extreme?" ------------------------------------------------------------------------ 196 most 78 least 23 well 3 transcriptome 3 highest 1 topmost 1 pstat1/3 1 lowest 1 biosphere Top 50 Internet domains; "What Webbed places are alluded to in this corpus?" ---------------------------------------------------------------------------- 2835 doi.org 1725 paperpile.com 879 creativecommons.org 92 www.codecogs.com 73 github.com 59 dx.doi.org 28 covid19risk.ai 17 gitlab.com 17 f1000.com 16 console.cloud.google.com 12 www.biorxiv.org 12 support.10xgenomics.com 11 www.ncbi.nlm.nih.gov 11 closedloop.ai 9 pubmed.ncbi.nlm.nih.gov 8 qxmd.com 6 www.sciencedirect.com 6 figshare.com 5 stm.sciencemag.org 5 score.depmap.sanger.ac.uk 4 tianyulu.shinyapps.io 4 imb-dev.gitlab.io 4 cran.r-project.org 4 clincancerres.aacrjournals.org 4 academia.nferx.com 3 virological.org 3 raw.githubusercontent.com 3 egg2.wustl.edu 3 depmap.org 3 cprdcw.cprd.com 3 biorxiv.org 3 bigd.big.ac.cn 3 ai.googleblog.com 3 cran.r-project.org 2 zfin.org 2 www.nature.com 2 www.krisp.org.za 2 www.jstor.org 2 www.jstatsoft.org 2 www.internationalgenome.org 2 www.gsea-msigdb.org 2 www.foastat.org 2 www.ebi.ac.uk 2 www.cancer.gov 2 www.cahanlab.org 2 www.biomedcentral.com 2 www.10xgenomics.com 2 ssbd.qbic.riken.jp 2 scikit-learn.org 2 rc.hms.harvard.edu Top 50 URLs; "What is hyperlinked from this corpus?" ---------------------------------------------------- 434 http://creativecommons.org/licenses/by-nc-nd/4.0/ 211 http://creativecommons.org/licenses/by/4.0/ 189 http://creativecommons.org/licenses/by-nc/4.0/ 106 http://paperpile.com/b/m1KDL1/qAA80 92 http://paperpile.com/b/m1KDL1/gtT4 84 http://paperpile.com/b/m1KDL1/Jg0A 74 http://paperpile.com/b/m1KDL1/lMI7H 74 http://doi.org/10.1101/2021.01.08.425379doi: 74 http://doi.org/10.1101/2021.01.08.425379 70 http://paperpile.com/b/m1KDL1/bIJVC 67 http://paperpile.com/b/m1KDL1/VeCw 64 http://paperpile.com/b/m1KDL1/8aHWG 60 http://doi.org/10.1101/2020.12.14.422697doi: 60 http://doi.org/10.1101/2020.12.14.422697 57 http://paperpile.com/b/m1KDL1/UDuS 56 http://paperpile.com/b/m1KDL1/wfRst 56 http://doi.org/10.1101/2021.01.08.425885doi: 56 http://doi.org/10.1101/2021.01.08.425885 55 http://doi.org/10.1101/2020.03.27.012757doi: 55 http://doi.org/10.1101/2020.03.27.012757 52 http://paperpile.com/b/m1KDL1/sJ8WE 52 http://paperpile.com/b/m1KDL1/6Xp3Y 50 http://doi.org/10.1101/2021.01.06.425560 49 http://paperpile.com/b/m1KDL1/ISaG 48 http://doi.org/10.1101/436634doi: 48 http://doi.org/10.1101/436634 48 http://doi.org/10.1101/2021.01.06.425560doi: 46 http://doi.org/10.1101/2021.01.08.425897doi: 46 http://doi.org/10.1101/2021.01.08.425897 45 http://doi.org/10.1101/2021.01.07.425716doi: 45 http://doi.org/10.1101/2021.01.07.425716 45 http://doi.org/10.1101/2021.01.04.425285doi: 45 http://doi.org/10.1101/2021.01.04.425285 44 http://creativecommons.org/licenses/by-nd/4.0/ 43 http://doi.org/10.1101/2021.01.02.425006doi: 43 http://doi.org/10.1101/2021.01.02.425006 39 http://doi.org/10.1101/2021.01.07.425637doi: 39 http://doi.org/10.1101/2021.01.07.425637 37 http://doi.org/10.1101/2021.01.04.425335doi: 37 http://doi.org/10.1101/2021.01.04.425335 34 http://doi.org/10.1101/2021.01.06.425544doi: 34 http://doi.org/10.1101/2021.01.06.425544 33 http://doi.org/10.1101/2020.10.26.351783doi: 33 http://doi.org/10.1101/2020.10.26.351783 32 http://paperpile.com/b/m1KDL1/VjIm 32 http://doi.org/10.1101/2021.01.08.425918doi: 32 http://doi.org/10.1101/2021.01.08.425918 32 http://doi.org/10.1101/2021.01.04.425250doi: 32 http://doi.org/10.1101/2021.01.04.425250 32 http://doi.org/10.1101/2020.08.13.249839doi: Top 50 email addresses; "Who are you gonna call?" ------------------------------------------------- Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?" ------------------------------------------------------------------------------- 1337 version posted january 395 � � � 6 � � − 5 � � = 5 � � | 4 cells were then 4 data is available 4 sequence did not 4 � � δ 3 data are available 3 data are then 3 data is often 3 data was normalized 3 data were also 3 gene � � 3 genes were then 3 sample has ccn 3 samples were then 3 � � ∈ 2 analysis identify pan 2 analysis is available 2 analysis was then 2 cells are not 2 cells show high 2 cells using nanobodies 2 cells were randomly 2 data are lc 2 data is not 2 data was also 2 data were normalized 2 genes are conditionally 2 models are available 2 sample is not 2 samples are available 2 sequences are available 2 type were randomly 2 � � c(t 2 � � coex 2 � � min 2 � � ⁄ 1 % predicted purity 1 % were putative 1 % � � 1 analyses gave very 1 analyses using ded- 1 analyses using diverse 1 analysis are available 1 analysis associated several 1 analysis displays cyclical 1 analysis does not Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?" --------------------------------------------------------------------------------------- 2 data is not available 1 cells were not well 1 data are not more 1 data showed no clear 1 data showed no significant 1 funders had no role 1 genes does not sufficiently 1 genes were not de 1 methods are not biased 1 sample has no ccn 1 sample is not available 1 sample is not pure 1 sequences do not always 1 sequences have not yet Sizes of items; "Measures in words, how big is each item?" ---------------------------------------------------------- 15548 10_1101-2021_01_06_425560 14993 10_1101-436634 13578 10_1101-2020_01_29_925354 12153 10_1101-2021_01_07_425637 11555 10_1101-2021_01_06_425569 10020 10_1101-2020_08_28_271981 9715 10_1101-2021_01_07_425794 9658 10_1101-2021_01_04_425250 9346 10_1101-332965 7878 10_1101-2020_09_09_289074 7333 10_1101-2021_01_04_425315 7280 10_1101-2021_01_08_425855 6912 10_1101-2021_01_06_425550 6886 10_1101-2021_01_08_425887 6841 10_1101-2021_01_07_425801 6528 10_1101-2020_04_17_043323 5391 10_1101-2021_01_07_425697 5110 10_1101-2021_01_05_425384 4921 10_1101-2020_12_26_424429 3682 10_1101-2021_01_08_426008 72 10_1101-2021_01_08_425976 72 10_1101-2021_01_06_425494 72 10_1101-2021_01_06_425546 10_1101-2021_01_08_425897 10_1101-2020_03_27_012757 10_1101-2021_01_08_425918 10_1101-2021_01_08_425952 10_1101-2021_01_08_425967 10_1101-2021_01_08_425379 10_1101-2021_01_06_425581 10_1101-2021_01_07_425782 10_1101-2021_01_07_425773 10_1101-2021_01_08_425885 10_1101-2020_10_26_351783 10_1101-2021_01_06_425544 10_1101-2021_01_05_425266 10_1101-2021_01_05_425414 10_1101-2021_01_07_425716 10_1101-2020_11_13_381475 10_1101-2020_05_22_110247 10_1101-2021_01_02_425006 10_1101-2021_01_05_425409 10_1101-2021_01_05_425417 10_1101-2021_01_05_425508 10_1101-2021_01_04_425335 10_1101-2020_12_14_422697 10_1101-2020_08_13_249839 10_1101-2021_01_04_425285 10_1101-2020_12_24_424332 10_1101-2021_01_04_425288 Readability of items; "How difficult is each item to read?" ----------------------------------------------------------- 72.0 10_1101-332965 72.0 10_1101-2020_08_28_271981 71.0 10_1101-2021_01_07_425801 67.0 10_1101-2021_01_06_425560 67.0 10_1101-2020_09_09_289074 66.0 10_1101-2021_01_07_425794 66.0 10_1101-2021_01_04_425315 63.0 10_1101-2021_01_08_426008 61.0 10_1101-2020_12_26_424429 61.0 10_1101-2020_01_29_925354 59.0 10_1101-2021_01_06_425569 59.0 10_1101-2021_01_07_425697 59.0 10_1101-2021_01_04_425250 59.0 10_1101-436634 58.0 10_1101-2021_01_07_425637 55.0 10_1101-2021_01_06_425550 54.0 10_1101-2020_04_17_043323 53.0 10_1101-2021_01_08_425887 53.0 10_1101-2021_01_08_425855 15.0 10_1101-2021_01_08_425976 15.0 10_1101-2021_01_06_425494 15.0 10_1101-2021_01_06_425546 -15.0 10_1101-2021_01_05_425384 10_1101-2021_01_08_425897 10_1101-2020_03_27_012757 10_1101-2021_01_08_425918 10_1101-2021_01_08_425952 10_1101-2021_01_08_425967 10_1101-2021_01_08_425379 10_1101-2021_01_06_425581 10_1101-2021_01_07_425782 10_1101-2021_01_07_425773 10_1101-2021_01_08_425885 10_1101-2020_10_26_351783 10_1101-2021_01_06_425544 10_1101-2021_01_05_425266 10_1101-2021_01_05_425414 10_1101-2021_01_07_425716 10_1101-2020_11_13_381475 10_1101-2020_05_22_110247 10_1101-2021_01_02_425006 10_1101-2021_01_05_425409 10_1101-2021_01_05_425417 10_1101-2021_01_05_425508 10_1101-2021_01_04_425335 10_1101-2020_12_14_422697 10_1101-2020_08_13_249839 10_1101-2021_01_04_425285 10_1101-2020_12_24_424332 10_1101-2021_01_04_425288 Item summaries; "In a narrative form, how can each item be abstracted?" ----------------------------------------------------------------------- 10_1101-2020_01_29_925354 As the state-of-the art approach for the openview detection of pathogens is genome sequencing (5, 6), it learning (17) to predict host range for a small set of three wellstudied species directly from viral sequences. predicting whether a new virus can potentially infect humans. boundary separates human viruses from other DNA sequences generated the reads from the genomes of human-infecting constituting reads yields a prediction for the whole sequence. predictions from all the reads originating from a given genome In the Fig. 1 we present example filters, visualized as "maxcontrib" sequence logos based on mean partial Shapley values prediction directly from next-generation sequencing reads Three receptor-binding domains (RBDs) are colored in blue, white and red according to the predicted infectious potential of the corresponding genomic sequence. Interpretable detection of novel human viruses from genome sequencing data Interpretable detection of novel human viruses from genome sequencing data 10_1101-2020_03_27_012757 10_1101-2020_04_17_043323 linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser linus: Conveniently explore, share, and present large-scale biological trajectory data In biology, we are often confronted with information-rich, large-scale trajectory data, but exploring and communicating We provide a python script that reads trajectory data and enriches them with additional features, such as edge bundling or custom axes and generates an interactive web-based visualisation that can be shared offline from diffusion MRI imaging (Liu et al., 2020), or tracking data such as cell trajectories or animal trails (Romero-Ferrero et visualisation tool linus, making it easier to explore 3D trajectory data from any device without a local installation of Creating a visualisation package with linus is done in a few simple steps (Fig. 1a): The user imports trajectory data from a Figure 1 Browser-based exploration and sharing of trajectory visualizations with linus. 10_1101-2020_05_22_110247 10_1101-2020_08_13_249839 10_1101-2020_08_28_271981 Full-length de novo protein structure determination from cryo-EM maps using deep learning structure types were predicted by a second DenseNet. Finally, the protein sequence was aligned to the main-chain according to the predicted Cα probabilities, amino acid types, and secondary structure amino acid type, and secondary structure type for each main-chain point, the target protein sequence The second network (i.e. DenseNet B) is used to predict the amino acid type and secondary structure type of a main-chain local dense point (LDP). Figure 3 shows a comparison of the predicted Cα models for the protein chains of different lengths The authors acknowledge professor Daisuke Kihara and his students Genki Terashi and Sai Raghavendra Maddhuri Venkata Subramaniya from Purdue University for providing their datasets. A New Protocol for Atomic-Level Protein Structure Modeling and Refinement Using Low-to-Medium Resolution Cryo-EM Density Maps. Figure 8: Protein models reconstructed by DeepMM and Phenix for the Chain A of 6DW1 10_1101-2020_09_09_289074 Structural Genetics of circulating variants affecting the SARS-CoV-2 Spike / human ACE2 complex SARS-CoV-2, COVID-19, mutations, Spike, ACE2 SARS-CoV-2 entry in human cells is mediated by the interaction between the viral Spike protein and protein variants in the SARS-CoV-2 population as the result of mutations, and it is unclear if these SARS-CoV-2 (the COVID-19 virus) and human cells, through the analysis of Spike/ACE2 complexes. future mutations targeting the ACE2/Spike binding and detected by sequencing SARS-CoV-2 on a We obtained structural models of the SARS-CoV-2 Spike interacting with the human ACE2 from three contributing to the interaction between Spike and ACE2, according to GBPM (see Table 1 and Fig 3 for A less frequent mutation amongst those predicted to contribute to the ACE2/Spike interaction is population and non-zero GBPM average score in the ACE2/Spike interaction models. ACE2 variants with non-zero GBPM score in the Spike interaction model. 10_1101-2020_10_26_351783 10_1101-2020_11_13_381475 10_1101-2020_12_14_422697 10_1101-2020_12_24_424332 10_1101-2020_12_26_424429 SARS-CoV-2 primers in use today by measuring the number of mismatches between primer sequence and genome targets with respect to the sequenced SARS-CoV-2 genomes, we can measure how the targeted proteins are mutating. primer sequences and protocols developed for six different regions – USA, Germany, China, Hong Kong, Japan, and Thailand – percent of genomes hit by each PCR test, labelled by the country and target gene region. Figure 6 shows the average number of mismatches over time, grouped by the genomes sampled The results of this study also demonstrate that each primer target develops a different number of mismatches over time The mutations that lead to mismatches between gene PCR primers and their targets reflect the sequence evolution of the consistent with the observed increasing number of mismatches over time, and shows that evolution of SARS-CoV-2 genomes is of the RdRp gene, primer target creates a disproportionate number of mismatches when compared to genomes sequenced within 10_1101-2021_01_02_425006 10_1101-2021_01_04_425250 A read count-based method to detect multiplets and their cellular origins from snATAC-seq data Similar to other droplet-based single cell assays, single nucleus ATAC-seq (snATAC-seq) data harbor multiplets 17 found that when snATAC-seq samples were adequately sequenced (e.g., >20k valid read pairs per cell), ATAC-52 ATAC-DoubletDetector detected heterotypic multiplets introduced in PBMC samples with high recall 126 detected multiplets were homotypic (76.7-84.3% in islets, 63-78.7% in PBMCs), with cell types being distributed 207 with respect to their cell proportions for both homotypic and heterotypic multiplet types (Fig. 5d-e, Extended Data 208 ATAC-DoubletDetector for identifying multiplets from snATAC-seq data with enough reads per nuclei, it can also 239 Fig. 2: ATAC-DoubletDetector identifies heterotypic and homotypic multiplets in human PBMC snATAC-seq data. e, The number of cells and percentage of multiplets detected by ATAC-DoubletDetector in PBMC and islet samples. Extended Data Fig. 6: ATAC-DoubletDetector detects both homotypic and heterotypic multiplets at high read depth. 10_1101-2021_01_04_425285 10_1101-2021_01_04_425288 10_1101-2021_01_04_425315 tool enables statistically-principled subtype-level downstream analyses, such as detecting subtypespecific differentially expressed genes (sDEG) and differential dependency networks (DDN) nuclear-norm regularized low-rank matrix factorization problem (Wang, Hoffman et al. regularization to optimize the estimation of between-sample variations in each subtype to recover sample-specific deconvolution and optimization solver used in swCAM algorithm, followed by Sample-specific deconvolution problem formulation and the assumption of hidden low-rank pattern CAM-estimated subtype-specific expression matrix serves as the initial reference 𝑺. The objective function of swCAM for sample-specific deconvolution problem and its reformulation As swCAM focuses on subtype-specific variation estimation, simulating biological variance The observations for 300 genes in 50 samples were simulated with subtype-specific expression Gene co-expressed function modules detected by WGCNA on swCAM estimated sample-specific Gene co-expressed function modules detected by WGCNA on swCAM estimated samplespecific expression for each subtype with λ=5 and δ=1 or 0.1. capacity of swCAM to estimate sample-specific signals in each subtype using simulations where 10_1101-2021_01_04_425335 10_1101-2021_01_05_425266 10_1101-2021_01_05_425384 of COVID-19 patients, expediting the models'' transition from research to clinical practice. The open source website https://covid19risk.ai/ currently incorporates nine models from six an inclusive platform for predictive models related to COVID-19. supplement their judgment with patient-specific predictions from externally-validated models Keywords: Covid-19, predictive models, diagnosis, prognosis, nomogram, machine We, as researchers working on COVID-19 models, saw an urgent need for a web-based Our aim for this platform is to include validated prediction models (TRIPOD type 2b and 3) published AI prediction models related to all aspects of COVID-19, including diagnosis, COVID-19 predictive models will serve as a decision aid for doctors. Decision Support System for Severity Risk Prediction and Triage of COVID-19 Patients Covid19Risk.ai: An open source repository and online calculator of prediction models for early diagnosis and prognosis of Covid-19 Covid19Risk.ai: An open source repository and online calculator of prediction models for early diagnosis and prognosis of Covid-19 10_1101-2021_01_05_425409 10_1101-2021_01_05_425414 10_1101-2021_01_05_425417 10_1101-2021_01_05_425508 10_1101-2021_01_06_425494 bioRxiv.org the preprint server for Biology Skip to main content Home Submit ALERTS / RSS Search for this keyword Advanced Search Subject Areas All Articles Animal Behavior and Cognition Biochemistry Bioengineering Bioinformatics Biophysics Cancer Biology Cell Biology Clinical Trials Developmental Biology Ecology Epidemiology Evolutionary Biology Genetics Genomics Immunology Microbiology Molecular Biology Neuroscience Paleontology Pathology Pharmacology and Toxicology Physiology Plant Biology Scientific Communication and Education Synthetic Biology Systems Biology Zoology View by Month 10_1101-2021_01_06_425544 10_1101-2021_01_06_425546 bioRxiv.org the preprint server for Biology Skip to main content Home Submit ALERTS / RSS Search for this keyword Advanced Search Subject Areas All Articles Animal Behavior and Cognition Biochemistry Bioengineering Bioinformatics Biophysics Cancer Biology Cell Biology Clinical Trials Developmental Biology Ecology Epidemiology Evolutionary Biology Genetics Genomics Immunology Microbiology Molecular Biology Neuroscience Paleontology Pathology Pharmacology and Toxicology Physiology Plant Biology Scientific Communication and Education Synthetic Biology Systems Biology Zoology View by Month 10_1101-2021_01_06_425550 We further evaluated the performance of the models using two whole-exome sequencing (WES) datasets from a recently released set of genome and exome data [23] (Figure 2). Among the populationresolved false-positive errors, more than two third (71.0%) are uncommon (allele frequency ≤ 5%) among the 1000Genomes samples, whereas there are only 11.4% uncommon variants for population-induced false positives. This observation supports the hypothesis that the population-aware model uses allele frequency to adjust its variant calls. A potential concern for population-aware variant calling models is increasing false negative rate for novel alleles. To better understand the zero-frequency variants, we called variants using the DeepVariant PacBio model with the PrecisionFDA v2 35x HG003 reads set sequenced with the We evaluate potential biases introduced by population information in variant calling by comparing population-aware models that use allele frequencies from different Despite greater overall accuracy, we note that the population-aware model underperforms on variants with zero allele frequencies in 1000Genomes. 10_1101-2021_01_06_425560 Review and performance evaluation of trait-based between-community dissimilarity measures 2 Review and performance evaluation of trait-based between-community dissimilarity measures 2 2. In this paper we reviewed the trait-based dissimilarity indices available in the 16 dissimilarities calculated by different indices correlate with environmental distances. beta diversity, dissimilarity index, distance metric, community ecology, functional traits 39 including several families of trait-based dissimilarity indices. FDissim indices incorporate trait information into the calculation of dissimilarity in different 162 Indices following this approach represent each community with a typical trait value, and 185 2005) or trait-based dissimilarity of species (Lepš 220 of the similarity indices for presence/absence data disregarding species properties, while the 281 ordinariness values in the species-based (dis-)similarity indices. Ricotta & Pavoine (2015) introduced a new family of trait-based similarity measures called 331 For species-based analyses, Ricotta & Podani (2017) suggested a general formula of distance 336 compared how strongly the dissimilarity indices correlate with the environmental distance 515 10_1101-2021_01_06_425569 to yeast and mouse data, we identify a half-dozen novel metabolites, including thiamine and taurine Peak annotation occurs in a single global optimization step, based on linear programming, connected nodes matches the atom mass difference and (ii) only co-eluting peaks are connected by edges receive a positive score for MS2 spectra similarity match between the connected nodes, and With a score assigned for each potential node and edge annotation, we formulate the global network A final edge annotation score S( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, A final edge annotation score S( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, A global network optimization approach for untargeted metabolomics data annotation NetID applies global optimization for metabolomics data annotation and metabolite A global network optimization approach for untargeted metabolomics data annotation (NetID). 10_1101-2021_01_06_425581 10_1101-2021_01_07_425637 characterize the CpGs on the mammalian methylation array with various genomic annotations. Array probes are sequences of length 50bp flanking a target CpG based on the human reference We added probes targeting 1986 CpGs to the mammalian methylation array based on All 37488 CpGs profiled on the mammalian methylation array apply to humans, but only a CpGs on the mammalian array cover 6871 human and 5659 mouse genes when each DNA methylation samples for three species: human (n=10 arrays), mouse (n=20), and rat (n=15), synthetic DNA data from 3 species: human (n=10 mammalian arrays), mouse (n=20), and rat CpG and gene coverage of probes on the mammalian methylation array across CpG island and chromatin state analysis of mammalian methylation probes. probes targeting the same CpG that can also be found on the human EPIC array that were not mammalian methylation array to the human (hg19) and mouse (mm10) genome using QUASR 10_1101-2021_01_07_425697 Capsule network for protein ubiquitination site prediction Capsule network for protein ubiquitination site years, some calculation methods have been developed to predict potential ubiquitination sites. this paper, a deep learning model, "Caps-Ubi," is proposed that uses a capsule network for protein network layer are used as a feature extractor to obtain the functional domains in the protein Data of protein ubiquitination sites the amino acid sequence around the protein ubiquitination site; namely, one-of-K encoding and the performance of various window sizes in one-of-21 and amino acid continuous encoding modes. modes is the best on the capsule network: this proposed Caps-Ubi model achieved an accuracy, In this paper, a new deep learning model for predicting protein ubiquitination sites is proposed, ubiquitination sites in proteins. Large-scale prediction of protein ubiquitination sites machine learning method with substrate motifs to predict ubiquitin-conjugation site on machine learning method with substrate motifs to predict ubiquitin-conjugation site on 10_1101-2021_01_07_425716 10_1101-2021_01_07_425773 10_1101-2021_01_07_425782 10_1101-2021_01_07_425794 Despite the importance of gene annotations in RNA-seq data analysis, very little research has been conducted to examine how differences in annotations impact on gene compared the effect of human genome annotations from popular databases including Ensembl, GENCODE and RefSeq on various aspects of RNA-seq analysis and they showed gene-level expression quantification in an RNA-seq data analysis pipeline. The Ensembl, RefSeq-NCBI and RefSeqRsubread annotations were provided to featureCounts to generate read counts for genes Gene expression data generated using TaqMan RT-PCR and Illumina''s BeadChip microarray were used to validate the gene-level quantification results from the RNA-seq The Ensembl and NCBI RefSeq annotations are among the most widely used gene annotations that have been utilized for RNA-seq gene expression quantification in the field. led to a better concordance in gene expression between the RNA-seq data and the RTPCR data, compared to the use of Ensembl and RefSeq-NCBI annotations. 10_1101-2021_01_07_425801 proteases, may cleave the furin site of SARS-CoV-2 S protein and  subunits of epithelial sodium channels ( 15 TMPRSS2, and ACE2 were significantly upregulated in severe COVID-19 patients and SARS-CoV-2 infected 22 Plasmin cleaves the furin site in SARS-CoV S protein (Kam et al. lung epithelial cells and whether SARS-CoV-2 infection alters their expression at the single-cell level. severe/moderate COVID-19 patients and SARS-CoV-2 infected cell lines, mainly owning to ciliated cells. The expression levels of proteases (PLAU, FURIN, TMPRSS2, PLG), ACE2, and SCNN1G in 11 cell 88 Expression levels of PLAU, SCNN1G, and ACE2 in SARS-CoV-2 infection 93 epithelial cell lines infected with SARS-CoV-2: A549, Calu-3, and NHBE (Blanco-Melo et al. CoV-2 infection also increased the expression level of ACE2 in A549 cells (P < 0.05) (Smith et al. Our data showed that the respiratory cells co-express SARS-CoV-2 receptor, ENaC 137 Changes of proteases, ACE2, and SCNN1G in respiratory cell lines after SARS-CoV-2 10_1101-2021_01_08_425379 10_1101-2021_01_08_425855 DeepHBV: A deep learning model to predict hepatitis B virus (HBV) integration sites. deep learning model DeepHBV to predict HBV integration sites by learning local learning model DeepHBV to predict HBV integration sites by learning local genomic DeepHBV effectively predicts HBV integration sites by adding genomic features. mixed HBV integration sequences, positive genome feature samples, and randomly peaks and DeepHBV with HBV integration sequences + TCGA Pan Cancer peaks) on model trained with HBV integrated sequences + TCGA Pan Cancer showed an performed better compared with DeepHBV model with HBV integration sequences + HBV integration sites + TCGA Pan Cancer, a cluster of attention weights much output of DeepHBV with HBV integration sites plus TCGA Pan Cancer showed the of DeepHBV with HBV integration sequences + TCGA Pan Cancer showed strong DeepHBV with HBV integration sequences + TCGA Pan Cancer model on (a) DeepHBV with HBV integration sequences + TCGA Pan Cancer model on (a) 10_1101-2021_01_08_425885 10_1101-2021_01_08_425887 the same structured model, so that these can be used as input to rule-based or deep learning algorithms for data extraction. example, at this point in this article the main headers are ''abstract'' followed by ''introduction'' and ''materials and methods'' that could make up a digraph. We use this process to evaluate new potential synonyms for existing terms and identify abstract → introduction → materials → results → discussion → conclusion → acknowledgements → footnotes section → references. Based on the digraph, we then assigned data and data description to be synonyms of the materials section, and participants From the analysis of ego-networks four new potential categories were identified: disclosure, graphical abstract, highlights and participants. Newly identified synonyms for existing IAO terms (00006xx) from the digraph mapping of 2,441 publications. Newly identified synonyms for existing IAO terms (00006xx) from the digraph mapping of 2,441 publications. 10_1101-2021_01_08_425897 10_1101-2021_01_08_425918 10_1101-2021_01_08_425952 10_1101-2021_01_08_425967 10_1101-2021_01_08_425976 bioRxiv.org the preprint server for Biology Skip to main content Home Submit ALERTS / RSS Search for this keyword Advanced Search Subject Areas All Articles Animal Behavior and Cognition Biochemistry Bioengineering Bioinformatics Biophysics Cancer Biology Cell Biology Clinical Trials Developmental Biology Ecology Epidemiology Evolutionary Biology Genetics Genomics Immunology Microbiology Molecular Biology Neuroscience Paleontology Pathology Pharmacology and Toxicology Physiology Plant Biology Scientific Communication and Education Synthetic Biology Systems Biology Zoology View by Month 10_1101-2021_01_08_426008 AncestralClust: Clustering of Divergent Nucleotide Sequences by Ancestral Sequence Reconstruction using Phylogenetic Trees Despite the exponential increase in the size of sequence databases of homologous genes, few methods exist to cluster At low identities, these methods produce uneven clusters where the majority of sequences are are no clustering methods that can accurately cluster large taxonomically divergent metabarcoding reference databases such as databases (Schoch et al., 2020), there is a need for new computationally efficient methods that can cluster divergent sequences. To cluster divergent sequences, we developed AncestralClust clustering methods: UCLUST (Edgar, 2010), meshclust2 (James dataset against UCLUST because it is the most widely used clustering program and it performs better than CD-HIT on low identity We developed a phylogenetic-based clustering method, AncestralClust, specifically to cluster divergent metabarcode sequences. Comparisons of clustering methods using 13,043 COI sequences from 11 different species. 10_1101-332965 protease well by making favorable interactions with important residues of the enzyme. Keywords: Vinyl sulfone inhibitors, Cryptopain-1, Cysteine protease, Molecular residues that were contacted by ligand subgroups across the enzymatic cleft, in one or The ligand subgroup-contacting residues in each complex had been mutated to Alanine; favorably interacting subsite residues (derived from Supplementary Table 1) in the Interactions: enzyme subsite residues ligand subgroups with the ligand ring systems showed highly positive ddGbind values for thiophen group in always, showed favorable interactions even when no ligand group was placed near it. with no ligand group placed near the residue, the interactions were unfavorable. around the ligand subgroups of the best-scored vinyl sulfones compounds (PubChem IDs Figure1: Illustration of the typical binding of vinyl sulfone inhibitors to cysteine protease enzymes. Figure 3: All the residues that are contacted by one or more ligands in the docked complexes of 10_1101-436634 RegTools: Integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer somatic variants from genomic data with splice junctions from transcriptomic data to identify isoforms, we annotated them with the Variant Effect Predictor (VEP), SpliceAI, and GenotypeTissue Expression (GTEx) junction counts and compared our results to other tools that integrate tools, the unbiased nature of RegTools has allowed us to identify novel splice variants and identify potential cis-acting splice-relevant variants in tumors (www.regtools.org). To demonstrate the utility of RegTools in identifying potential splice-relevant variants from tumor transcriptome as described above and its associated variants based on splice junction region For our analysis, we annotated the pairs of associated variants and junctions identified by Pan-cancer analysis of 35 tumor types identifies somatic variants that alter canonical We also identify recurrent splice altering variants in genes not known to be cancer genes RegTools contains three sub-modules: "variants", "junctions", and "cis-splice-effects".