Summary of your 'study carrel' ============================== This is a summary of your Distant Reader 'study carrel'. The Distant Reader harvested & cached your content into a collection/corpus. It then applied sets of natural language processing and text mining against the collection. The results of this process was reduced to a database file -- a 'study carrel'. The study carrel can then be queried, thus bringing light specific characteristics for your collection. These characteristics can help you summarize the collection as well as enumerate things you might want to investigate more closely. This report is a terse narrative report, and when processing is complete you will be linked to a more complete narrative report. Eric Lease Morgan Number of items in the collection; 'How big is my corpus?' ---------------------------------------------------------- 50 Average length of all items measured in words; "More or less, how big is each item?" ------------------------------------------------------------------------------------ 9329 Average readability score of all items (0 = difficult; 100 = easy) ------------------------------------------------------------------ 61 Top 50 statistically significant keywords; "What is my collection about?" ------------------------------------------------------------------------- 19 February 15 International 8 RNA 7 cell 5 gene 3 mutation 3 Supplementary 3 Seq 3 Fig 3 Data 3 Alzheimer 2 set 2 sequence 2 motif 2 method 2 international 2 figure 2 covid-19 2 SARS 2 PCA 2 GWAS 2 Figure 2 Cancer 1 δaf 1 target 1 subject 1 strain 1 single 1 siamese 1 seq 1 rrna 1 rate 1 protein 1 preprint 1 polar 1 phase 1 phage 1 nitrogen 1 model 1 mer 1 llps 1 lineage 1 kinase 1 inhibitor 1 image 1 high 1 glm+egs 1 genome 1 fry 1 file Top 50 lemmatized nouns; "What is discussed?" --------------------------------------------- 2327 gene 2265 cell 2114 preprint 1821 datum 1194 version 1191 method 1191 author 1169 sequence 1101 copyright 1083 review 1057 model 1028 funder 1027 holder 1023 peer 968 dataset 944 license 936 analysis 935 set 868 value 857 % 818 preprintthis 723 perpetuity 675 number 666 sample 651 result 651 licenseavailable 643 expression 607 type 582 mutation 542 cancer 515 figure 494 structure 472 seq 457 size 453 study 429 file 428 approach 421 motif 414 feature 411 drug 410 time 386 genome 385 site 376 p 375 effect 362 protein 361 network 360 information 355 ad 344 level Top 50 proper nouns; "What are the names of persons or places?" -------------------------------------------------------------- 1477 al 1039 February 1013 et 776 International 758 J. 715 M. 611 . 603 NC 488 S. 487 RNA 439 D. 427 C. 421 A. 417 ND 371 R. 334 L. 330 P. 320 C 294 B. 291 E. 286 G. 283 Figure 272 Fig 267 T. 259 M 256 K. 252 S 252 A 247 BY 242 Supplementary 234 H. 233 k 229 µM 228 J 220 N. 194 http://creativecommons.org/licenses/by-nc/4.0/ 183 N 176 K 176 F. 175 R 173 Data 173 Alzheimer 164 Seq 163 Y. 160 Li 158 Cancer 157 T 155 SPARC 155 Raloxifene 154 W. Top 50 personal pronouns nouns; "To whom are things referred?" ------------------------------------------------------------- 2679 we 1639 it 243 they 164 i 129 them 57 us 37 one 28 itself 13 you 9 themselves 6 https://doi.org/10.1101/2020.01.28.923532 6 he 5 𝒙 4 ​sample​ 4 u 4 s 3 ∆̂′ 3 m′ 2 𝑙𝑎 2 λ 2 ourselves 2 ours 2 n 2 il- 2 http://paperpile.com/b/5tes3g/x5omi 1 𝜟 1 𝑒𝑖 1 𝑆∗of 1 ’s 1 τ2 1 α 1 ʻʻuniprotdom_postmodenzʼ 1 y∗ 1 yij 1 yes 1 when398 1 uw 1 us- 1 ub 1 to136 1 theirs 1 the355 1 pepos 1 mj 1 kb 1 influ- 1 in- 1 https://www.10xgenomics.com/resources/publications/ 1 https://paperpile.com/c/rqvmzs/bhgv 1 https://paperpile.com/c/5tes3g/x5omi Top 50 lemmatized verbs; "What do things do?" --------------------------------------------- 12666 be 2222 use 2129 have 1148 make 1030 post 1027 certify 905 display 867 grant 802 base 726 show 702 biorxiv 559 https://www.zotero.org/google-docs/?hsltkm 492 include 478 provide 457 identify 451 select 414 do 414 associate 408 compare 348 � 334 perform 330 follow 328 find 326 generate 320 allow 309 represent 301 give 300 obtain 284 contain 273 set 258 apply 246 predict 246 estimate 225 see 222 require 222 describe 220 define 211 calculate 190 compute 189 take 189 develop 189 consider 182 bind 179 achieve 176 test 176 cluster 172 target 170 propose 170 know 170 exist Top 50 lemmatized adjectives and adverbs; "How are things described?" --------------------------------------------------------------------- 2100 not 681 - 650 single 637 different 608 also 574 more 570 high 556 available 544 other 455 only 428 well 377 first 370 such 367 specific 335 non 335 large 330 low 327 then 326 most 325 same 320 genome 292 human 274 small 264 however 261 e.g. 257 new 242 multiple 240 similar 221 good 207 many 201 top 197 thus 185 random 183 clinical 175 as 169 individual 168 significant 168 average 167 linear 163 functional 161 genomic 159 here 157 original 157 biological 155 further 153 wide 149 international 148 polar 147 less 142 deep Top 50 lemmatized superlative adjectives; "How are things described to the extreme?" ------------------------------------------------------------------------- 182 good 137 most 69 least 28 high 24 Most 18 near 18 large 17 small 14 close 13 short 12 late 10 transcriptome 10 low 8 bad 6 great 5 fast 4 manif 4 deep 3 simple 2 young 2 old 2 long 1 ​k​-near 1 would71 1 wide 1 was337 1 the98 1 the439 1 ter 1 strong 1 sparse 1 slow 1 pbmc_10k_v3 1 http://kinametrix.com/ 1 fine 1 e 1 dense 1 broad 1 Least 1 CDC2L6 Top 50 lemmatized superlative adverbs; "How do things do to the extreme?" ------------------------------------------------------------------------ 189 most 60 least 30 well 9 transcriptome 5 highest Top 50 Internet domains; "What Webbed places are alluded to in this corpus?" ---------------------------------------------------------------------------- 2261 doi.org 871 creativecommons.org 701 paperpile.com 675 www.zotero.org 141 dx.doi.org 135 github.com 131 www.protocols.io 91 docs.google.com 23 www.ncbi.nlm.nih.gov 20 www.codecogs.com 8 scicrunch.org 8 arxiv.org 7 bigd.big.ac.cn 6 satijalab.org 6 preprocessed-connectomes-project.org 6 osf.io 6 gitlab.com 5 www.nlm.nih.gov 4 www.synapse.org 4 reframedb.org 4 rdcu.be 4 pypi.org 4 gatk.broadinstitute.org 4 dunbrack.fccc.edu 4 colororacle.org 4 bioinf.itmat.upenn.edu 4 biapss.chem.iastate.edu 3 www.wolfram.com 3 www.ukbiobank.ac.uk 3 www.nature.com 3 www.cell.com 3 www.r-project.org 3 plants.ensembl.org 3 ftp.tue.mpg.de 3 ftp.ncbi.nlm.nih.gov 3 ecrlife420999811.wordpress.com 3 commonfund.nih.gov 3 combine-lab.github.io 3 cf.10xgenomics.com 3 bitbucket.org 3 adni.bitbucket.io 2 www.uniprot.org 2 www.springer.com 2 www.qt.io 2 www.proteinatlas.org 2 www.ontobee.org 2 www.internationalgenome.org 2 www.highcharts.com 2 www.helsinki.fi 2 www.gurobi.com Top 50 URLs; "What is hyperlinked from this corpus?" ---------------------------------------------------- 571 http://www.zotero.org/google-docs/?HsLTKM 360 http://creativecommons.org/licenses/by-nc-nd/4.0/ 244 http://creativecommons.org/licenses/by-nc/4.0/ 205 http://creativecommons.org/licenses/by/4.0/ 127 http://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe 62 http://creativecommons.org/licenses/by-nd/4.0/ 47 http://doi.org/10.1101/2021.02.09.430536doi: 47 http://doi.org/10.1101/2021.02.09.430536 46 http://doi.org/10.1101/2020.01.28.923532doi: 46 http://doi.org/10.1101/2020.01.28.923532 41 http://doi.org/10.1101/2021.02.10.430512doi: 41 http://doi.org/10.1101/2021.02.10.430512 39 http://doi.org/10.1101/2021.02.09.430460doi: 39 http://doi.org/10.1101/2021.02.09.430460 37 http://doi.org/10.1101/698605doi: 37 http://doi.org/10.1101/698605 37 http://doi.org/10.1101/2021.02.09.430550doi: 37 http://doi.org/10.1101/2021.02.09.430550 36 http://doi.org/10.1101/2021.02.13.429885doi: 36 http://doi.org/10.1101/2021.02.13.429885 36 http://doi.org/10.1101/2020.10.08.327718doi: 36 http://doi.org/10.1101/2020.10.08.327718 32 http://doi.org/10.1101/2020.09.21.305516doi: 32 http://doi.org/10.1101/2020.09.21.305516 32 http://doi.org/10.1101/2020.09.02.279521doi: 32 http://doi.org/10.1101/2020.09.02.279521 31 http://doi.org/10.1101/2021.02.10.430606doi: 31 http://doi.org/10.1101/2021.02.10.430606 31 http://doi.org/10.1101/2020.09.23.308239doi: 31 http://doi.org/10.1101/2020.09.23.308239 29 http://doi.org/10.1101/2021.02.08.430343doi: 29 http://doi.org/10.1101/2021.02.08.430343 28 http://doi.org/10.1101/727867doi: 28 http://doi.org/10.1101/727867 28 http://doi.org/10.1101/2021.02.11.430762doi: 28 http://doi.org/10.1101/2021.02.11.430762 27 http://doi.org/10.1101/2021.02.12.430989doi: 27 http://doi.org/10.1101/2021.02.12.430989 25 http://doi.org/10.1101/2021.02.11.430847doi: 25 http://doi.org/10.1101/2021.02.11.430847 24 http://doi.org/10.1101/2021.02.11.430871doi: 24 http://doi.org/10.1101/2021.02.11.430871 24 http://doi.org/10.1101/2021.02.10.430705doi: 24 http://doi.org/10.1101/2021.02.10.430705 24 http://doi.org/10.1101/2021.02.01.429246doi: 24 http://doi.org/10.1101/2021.02.01.429246 23 http://doi.org/10.1101/2021.02.09.430405 23 http://doi.org/10.1101/2021.02.08.430280doi: 23 http://doi.org/10.1101/2021.02.08.430280 22 http://doi.org/10.1101/2021.02.08.430270doi: Top 50 email addresses; "Who are you gonna call?" ------------------------------------------------- 3 andrea.tangherloni@unibg.it 2 wu.bin.kmu@qq.com 2 tracey.weissgerber@charite.de 2 ooluwada@uccs.edu 2 michael.schroeder@tu-dresden.de 2 mararabra@yahoo.co.uk 2 l165018@lhr.nu.edu.pk 2 huang@southalabama.edu 2 ggrant@pennmedicine.upenn.edu 2 gcaravagna@units.it 2 eytan.ruppin@nih.gov 2 clement.abi-nader@inria.fr 2 borchert@southalabama.edu 2 allenem@pennmedicine.upenn.edu 2 alejandro.schaffer@nih.gov 1 zhang.jianbo@gmail.com 1 yaozhong@ims.u-tokyo.ac.jp 1 wjshen@stu.edu.cn 1 tyagin@udel.edu 1 tsofer@bwh.harvard.edu 1 tg76@st-andrews.ac.uk 1 smgomez@unc.edu 1 shtutmanm@sccp.sc.edu 1 sacha@labsquare.org 1 rob@cs.umd.edu 1 potoyan@iastate.edu 1 pl219@cam.ac.uk 1 pierre-luc.germain@hest.ethz.ch 1 noble@uw.edu 1 nicholas.youngblut@tuebingen.mpg.de 1 nawrocke@ncbi.nlm.nih.gov 1 mmartone@ucsd.edu 1 matta.krish@charterschool.org 1 martin.hofmann-apitius@scai.fraunhofer.de 1 marco.gallo@ucalgary.ca 1 maizie.zhou@vanderbilt.edu 1 lswang@pennmedicine.upenn.edu 1 kevin.da-silva@inria.fr 1 jxwang@mail.csu.edu.cn 1 jsybrandt@google.com 1 jsybran@clemson.edu 1 jli@stat.ucla.edu 1 isafro@udel.edu 1 gunnar.ratsch@ratschlab.org 1 gmarcais@cs.cmu.edu 1 f.ricciuti@campus.unimib.it 1 eg2912@columbia.edu 1 dilip_panthee@ncsu.edu 1 david.gibbs@isbscience.org 1 daniela.besozzi@unimib.it Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?" ------------------------------------------------------------------------------- 1027 version posted february 16 value is better 11 gene set gs 8 gene set enrichment 6 methods do not 4 genes are more 4 genes were more 4 sequence � le 4 sequences do not 3 cells using nanoliter 3 data are available 3 data using t 3 datasets are available 3 gene is highly 3 genes are not 3 genes were functionally 3 values are n 2 analysis using only 2 data are already 2 data is available 2 data is important 2 data is normalized 2 data set figure 2 data using data 2 data using empirical 2 data using hierarchical 2 data using online 2 data using umap 2 dataset does not 2 dataset was not 2 gene is differentially 2 gene set analyses 2 gene set level 2 gene set response 2 gene set signal 2 genes are likely 2 genes being differentially 2 genes have similar 2 genes were likely 2 method is also 2 method were then 2 methods are not 2 methods are similar 2 methods did not 2 methods show limited 2 methods were scanpy 2 model does not 2 model was then 2 models are available 2 sequences are likely Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?" --------------------------------------------------------------------------------------- 2 dataset does not currently 2 methods are not significantly 1 % have not yet 1 al are not directly 1 authors have no financial 1 data does not necessarily 1 data has no indels 1 data is not always 1 datasets are not equal 1 funders had no role 1 genes are not as 1 genes provide no explanation 1 methods did not correctly 1 methods do not directly 1 methods made no positive 1 model did not significantly 1 model was not able 1 models provide no pharmacological 1 sequences did not much341 1 sequences is not important 1 sequences were no more 1 sequences were not used.345 1 sets are not too 1 sets does not always 1 sets is not fundamentally 1 values are not necessarily 1 values are not reliable Sizes of items; "Measures in words, how big is each item?" ---------------------------------------------------------- 20656 10_1101-2021_02_09_430536 16705 10_1101-2020_01_28_923532 16496 10_1101-2021_02_11_430762 15659 10_1101-2021_02_09_430460 15440 10_1101-2021_02_01_429246 15281 10_1101-727867 13590 10_1101-2021_02_10_430705 13512 10_1101-2021_02_09_430550 13439 10_1101-2021_02_09_430363 12853 10_1101-698605 12824 10_1101-2020_10_08_327718 12363 10_1101-2021_02_08_430280 12164 10_1101-2020_09_02_279521 12013 10_1101-2021_02_10_430606 11859 10_1101-2021_02_10_430512 11335 10_1101-2021_02_08_430343 10901 10_1101-2021_02_10_430563 10624 10_1101-2021_02_12_430979 10584 10_1101-2021_02_13_429885 10376 10_1101-2020_09_21_305516 10071 10_1101-2021_02_11_430871 9518 10_1101-2021_02_12_430764 9478 10_1101-2021_02_10_430623 9408 10_1101-2021_02_11_430789 8797 10_1101-2020_09_23_308239 8418 10_1101-2021_02_10_430649 8205 10_1101-2020_11_17_386649 8181 10_1101-2021_02_12_430830 8136 10_1101-2021_02_12_430989 8121 10_1101-2020_02_04_934216 7973 10_1101-2021_02_11_430695 7913 10_1101-2021_02_12_430923 7909 10_1101-2021_02_08_428881 7516 10_1101-2021_02_12_430739 7219 10_1101-2021_02_08_430270 6849 10_1101-2021_02_11_430847 6710 10_1101-2021_02_12_430963 6557 10_1101-2021_02_10_430656 6404 10_1101-2021_02_08_430275 5987 10_1101-2020_09_23_310276 5941 10_1101-2021_02_09_430405 5183 10_1101-2021_02_08_430070 4932 10_1101-2021_02_10_430619 4875 10_1101-2021_02_10_430367 4252 10_1101-2020_12_24_424317 3786 10_1101-2021_02_12_431018 3128 10_1101-2021_02_09_430036 2698 10_1101-2021_02_11_430806 2191 10_1101-2020_05_15_090266 1409 10_1101-2021_02_10_430604 Readability of items; "How difficult is each item to read?" ----------------------------------------------------------- 79.0 10_1101-2021_02_09_430536 72.0 10_1101-727867 71.0 10_1101-2021_02_10_430649 71.0 10_1101-2021_02_01_429246 70.0 10_1101-2021_02_12_430739 70.0 10_1101-2021_02_09_430363 70.0 10_1101-2021_02_09_430036 69.0 10_1101-2021_02_08_430270 68.0 10_1101-2021_02_12_430830 68.0 10_1101-2021_02_08_430275 67.0 10_1101-2020_09_21_305516 67.0 10_1101-2020_11_17_386649 66.0 10_1101-2020_01_28_923532 66.0 10_1101-2021_02_12_430979 66.0 10_1101-2021_02_10_430619 66.0 10_1101-2021_02_10_430367 66.0 10_1101-2021_02_08_428881 65.0 10_1101-2021_02_11_430762 64.0 10_1101-2021_02_12_430764 64.0 10_1101-2021_02_09_430550 64.0 10_1101-2021_02_10_430656 64.0 10_1101-2020_05_15_090266 64.0 10_1101-2021_02_09_430405 63.0 10_1101-2021_02_11_430871 63.0 10_1101-2021_02_10_430606 62.0 10_1101-2021_02_12_430963 62.0 10_1101-2021_02_12_430923 61.0 10_1101-2021_02_12_431018 60.0 10_1101-2021_02_11_430695 60.0 10_1101-2021_02_08_430070 59.0 10_1101-2020_02_04_934216 58.0 10_1101-2021_02_11_430789 58.0 10_1101-2021_02_10_430623 58.0 10_1101-2021_02_08_430343 58.0 10_1101-2021_02_08_430280 57.0 10_1101-2021_02_11_430847 57.0 10_1101-2020_09_23_308239 57.0 10_1101-2020_12_24_424317 56.0 10_1101-2020_10_08_327718 56.0 10_1101-2021_02_10_430512 56.0 10_1101-2021_02_10_430604 56.0 10_1101-2020_09_02_279521 55.0 10_1101-698605 53.0 10_1101-2021_02_11_430806 53.0 10_1101-2021_02_10_430705 52.0 10_1101-2020_09_23_310276 49.0 10_1101-2021_02_13_429885 48.0 10_1101-2021_02_12_430989 48.0 10_1101-2021_02_10_430563 40.0 10_1101-2021_02_09_430460 Item summaries; "In a narrative form, how can each item be abstracted?" ----------------------------------------------------------------------- 10_1101-2020_01_28_923532 We focus our analysis on genes encoding protein targets that encode receptors on the cell all "modular", including one part that specifically targets the tumor cell via one gene/protein and MadHitter and each patient receives an optimal personalized combination of targeted therapies from a prespecified set (pill bottle). Cohort and Individual Target Set Sizes as Functions of Tumor Killing and Given the single-cell tumor data sets and the ILP optimization framework described above, we filtering as this threshold is decreased), decreases the size of the target cell surface receptor gene heterogeneity of the cancer, number of patients within the data set, size of target gene set, lack of used for filtering the gene set to avoid targeting non-cancerous tissues. the genes in the optimal target set, the expression of that gene in that non-tumor cell exceeds the set of genes which is known to be targetable to cell 𝐶. 10_1101-2020_02_04_934216 EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning task of kinase-motif phosphorylation prediction as a multi-label kinase or substrate, as well as protein scaffolds that facilitate structural orientation and downstream catalysis of the reaction, modify the efficacy of motif phosphorylation. prediction of phosphorylation events), a deep learning approach for predicting multi-label kinase-motif phosphorylation relationships. example, the TLK kinase family only has nine positive labels (verified TLK-motif interactions) and more than 10,000 resulting data set is comprised of 7302 phosphorylatable motifs and their reaction-associated kinase families (Table 1). The final output is a vector, k, of length eight, where each value corresponds to the probability that the motif a was phosphorylated by one of the kinase families indicated in We sought to illuminate the relationship between kinase-family dissimilarity and phosphorylated motif-group dissimilarity described results provide motivation to incorporate both motif dissimilarity and kinase relatedness into the predictive model, as of kinase-motif prediction compared to the single-label approaches. 10_1101-2020_05_15_090266 Summary: SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. methods compare individual CRISPR spacers with phage To increase sensitivity, (1) we compare protein coding sequences because phage genomes are mostly coding, and, (0) Preprocess input: scan the phage genome and CRISPR spacers in six ORFs q of CRISPR spacers extracted from one prokaryotic genome, and each target set T comprises the putative protein sequences t from a single phage. The performance of SpacePHARER was evaluated on the spacer test set against a target database predicted the correct host for more phages than BLASTN BLASTN in detecting phage-host pairs, due to searching 10_1101-2020_09_02_279521 Simulating the outcome of amyloid treatments in Alzheimer''s disease from imaging and clinical data When applied to multimodal imaging and clinical data from the Alzheimer''s Disease Neuroimaging Initiative our * Data used in preparation of this article were obtained from the Alzheimer''s Disease Neuroimaging Initiative (ADNI) database Keywords : Alzheimer''s Disease ; Clinical trials ; Disease progression; Amyloid hypothesis; of large datasets of different data modalities, such as clinical scores, or brain imaging measures to model Alzheimer''s disease progression based on specific assumptions on the biochemical combining traditional DPMs with dynamical models of Alzheimer''s disease progression. In this work we present a novel computational model of Alzheimer''s disease progression to multi-modal imaging and clinical data from the Alzheimer''s Disease Neuroimaging To simulate the long-term progression of Alzheimer''s disease we first project the AD subjects Figure 3 Model-based progression of Alzheimer''s disease. clinical data, based on the estimation of latent biomarkers'' relationships governing Alzheimer''s 10_1101-2020_09_21_305516 Copy-scAT: Deconvoluting single-cell chromatin accessibility of genetic subclones in cancer 1 Copy-scAT: Deconvoluting single-cell chromatin accessibility of genetic subclones in cancer 1 uses single-cell epigenomic data to infer copy number variants (CNVs) that define cancer cells. We have tested the ability of Copy-scAT to use scATAC data to call CNVs with three different approaches 100 genome sequencing (WGS) data for adult GBM (aGBM) surgical resections (n = 4 samples, 3,647 cells). adult GBM samples identified using both methods, versus total numbers of gains detected by scATAC or 160 Number of chromosome-arm level gains detected in adult GBM samples identified using both methods, 163 (c) Multiple myeloma samples were profiled by both scATAC and the single-cell CNV assay. chromosome-arm level gains detected in adult GBM samples identified using both methods, versus total 166 CNVs are detected in scATAC clusters with Copy-scAT in pediatric GBM samples. 10_1101-2020_09_23_308239 The COVID-19 PHARMACOME: A method for the rational selection of drug repurposing COVID-19 PHARMACOME, a comprehensive drug-target-mechanism graph generated from a initial version of the COVID-19 PHARMACOME, a comprehensive drug-target-mechanism graph representing COVID-19 pathophysiology mechanisms that includes both drug targets Figure 3: Overlap of compound hits between different drug repurposing screening experiments. space overlap between different COVID-19 drug repurposing screenings. The COVID-19 PHARMACOME associates pathways derived from drug repurposing targets Figure 4 shows the distribution of repurposing drugs in the COVID-19 cause-and-effect graph, overlap analysis allows for the identification of repurposing drugs targeting mechanisms that Virus-response mechanisms are targets for repurposing drugs Figure 5: Visualization of drug repurposing candidates (and their targets) used in combination treatment as our own drug repurposing screening results, we were able to identify mechanisms targeted COVID-19 PHARMACOME, we are now able to link repurposing drugs, their targets and the SARS-CoV-2 protein interaction map reveals targets for drug repurposing. 10_1101-2020_09_23_310276 The NIAGADS Alzheimer''s Genomics Database (GenomicsDB) is an interactive knowledgebase for Alzheimer''s disease (AD) genetics that provides access to GWAS summary statistics datasets The website makes available >70 genome-wide summary statistics datasets from GWAS and efficient real-time data analysis and variant or gene report generation. Gene reports provide summaries of co-located ADRD risk-associated variants and have pages linking summary statistics to variant and gene annotations, this resource makes these summary statistics available for browsing (on dataset, gene, and variant reports and as genome NIAGADS GenomicsDB variant reports and a track is available on the genome browser. The NIAGADS GenomicsDB includes allele frequency data from 1000 Genomes (phase 3, version visualizations for summarizing search results and annotations in gene and variant reports. compare NIAGADS GWAS summary statistics tracks to each other, against annotated gene or A detailed report is provided for each of the GWAS summary statistics and ADSP meta-analysis 10_1101-2020_10_08_327718 journals in three fields; plant sciences, cell biology and physiology (n=580 papers). figures were uncommon (physiology 16%, cell biology 12%, plant sciences 2%). among papers published in top journals in plant sciences, cell biology and physiology. contained images (plant science: 68%, cell biology: 72%, physiology: 55%). in physiology (49%) and cell biology (55%), and 28% of plant science papers provided and 29% of plant sciences papers contained no scale information on any image. Some publications use insets to show the same image at two different scales (cell Figure 1: Image types and reporting of scale information and insets physiology and plant science papers contained some images that were inaccessible to B: Most papers explain colors in image-based figures, however, explanations are less Figure 4: Using scale bars to annotate image size Creating clear and informative image-based figures for scientific publications. Creating clear and informative image-based figures for scientific publications. 10_1101-2020_11_17_386649 Experiments on 10,000 RNA-seq datasets show that RowDiff combined with MultiBRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most a binary matrix, where the k-mer set indexes the rows and each annotation label specifies a column. Starting from any vertex in the de Bruijn graph, Algorithm 1 defines a traversal leading to an anchor Each row in a RowDiff-transformed annotation matrix has the same or fewer set bits than A naı̈ve implementation of the RowDiff construction would be to load the matrix A in memory, and gradually replace its rows with their sparsified counterpart, while traversing the graph. We now note that, when querying annotations for paths in the graph, or sets of rows corresponding to vertices We constructed annotated de Bruijn graphs from the RNA-Seq data set in the same We now compare the representation size for RowDiff and other state-of-the-art graph annotation compression methods. 10_1101-2020_12_24_424317 classification, feature extraction and relevant gene identification through deep learning methods for 12 This research picks up from detection of different types of cancer RNA-Seq expressions using deep neural classification of gene expression profiles for different kinds of cancers. Hence, the effectiveness of deep learning models for feature extraction and relevant gene identification is performed revealing substantial results and they produced five high-ranked gene sets and reduced feature This study was aimed at classifying 12 types of cancer and identifying relevant genes and the results show were able to identify cancer-relevant pathways and genes for the sets, that different experiments generated, A deep learning approach for cancer detection and relevant gene Tumor gene expression data classification via sample expansionbased deep learning. Identification of a multi-cancer gene expression Multi-class Cancer Classification and Biomarker Identification using Deep Learning Multi-class Cancer Classification and Biomarker Identification using Deep Learning 10_1101-2021_02_01_429246 minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets suggests, a UHS is a set of k-mers that "hits" every w-long window of every possible sequence (hence the the elements of the polar sets are in the sequence: the higher the energy, the more spread apart the k-mers have densities upper bounded by |U|/σk, because only k-mers from the universal hitting set can be selected. Section 2.2 gives a formal definition of the link energy of a polar set and Theorem 1 gives upper and lower bounds using this link energy for the density of a minimizer compatible with a polar set. form a link, which in turn is the number of k-mer pairs in the polar set that are exactly w bases away on S. A context is charged if the minimizer selects a different k-mer in the first window than in the second 10_1101-2021_02_08_428881 A common workflow in single-cell RNA-seq analysis is to project the data to a latent space, cluster the cells in that space, and identify sets of marker genes that explain the differences among the nonlinear embedding model which maps the gene expression to the low-dimensional representation where the groups A notable feature of ACE''s approach is that, by identifying genes jointly, the method moves away from the notion Input: gene expression matrix Deep autoencoder learns low-dimensional representation Embedding clustering Clustering is neuralized and concatenated with the encoder Differentiation analysis by ACE Output: gene relevance ACE takes as input a single-cell gene expression matrix and learns a low-dimensional representation for each Next, a neuralized version of the k-means algorithm is applied to the learned representation to identify cell groups. input gene expression profile that lead the neuralized clustering model to alter the assignment from one group to the other. 10_1101-2021_02_08_430070 On the application of BERT models for nanopore methylation detection with deep learning models, have achieved significant performance improvements on nanopore methylation recurrent patterns of positional-signal-shift in the context window surrounding target 5-methylcytosine that the refined BERT model can achieve competitive or even better results than the state-of-the-art biRNN of datasets from the different research groups, BERT models demonstrate a good generalization Fig. 1: Basic BERT''s and refined BERT''s model structure used for methylation detection. a refined BERT model to take account of signal-shift patterns in the proposed refined BERT model achieves a competitive or even better result explore applying the BERT model for the nanopore methylation detection 2.2 Applying BERT models for nanopore methylation For the cross-sample evaluation, we train models on one dataset and test a BERT model to pay more attention to center positions. In-sample evaluation of different deep learning models on 5mC datasets. 10_1101-2021_02_08_430270 Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty Keywords and phrases: attenuation bias, genotype likelihood, linkage disequilibrium, polyploidy, reliability ratio. Let XiA and XiB be the posterior means at loci A and B for individual Equations (5)–(7) take the naive estimators most researchers use in practice (the sample covariance/correlation of posterior means) and inflate these by a multiplicative effect. Gerard and Ferrão, 2019] to obtain the posterior moments for each individual''s genotype at each SNP reliability ratios of most SNPs only increase their correlation estimates by less than 10%. To evaluate the LD estimates of high reliability ratio SNPs, we calculated the MLEs for ρ2 applied to simple linear regression with an additive effects model (where the SNP effect is proportional to the dosage), result in the standard ordinary least squares estimates when using the extreme reliability ratio of PotVar0080327, the genotype-error adjusted correlation estimate is -1. 10_1101-2021_02_08_430275 Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes identified using BSA-Seq, a technology in which next-generation sequencing (NGS) is applied to bulked segregant analysis (BSA). recently developed the significant structural variant method for BSASeq data analysis that exhibits higher detection power than standard to analyze BSA-Seq data in which genome sequences of one parent served as the reference sequences in genotype calling, and thus We analyzed a public BSA-Seq dataset using our modified method and the standard allele frequency and Gmethod allows the detection of such associations without sequencing the parental genomes, leading to further lower the the BSA-Seq data with the genome sequences of both the parents101 when the parental genome sequences are used to aid BSA-Seq data 193 The allele frequency method: The ΔAF value of each SNP in 267 BSA-Seq data analysis using the genome sequences of both the parents and the bulks. BSA-Seq data analysis using only the bulk genome sequences. 10_1101-2021_02_08_430280 given transcriptome provided as either a raw user-generated RNA-Seq dataset or NCBI SRR file identifier. SURFR identifies all ncRNA fragments (both annotated and novel) and their expressions in up to ten datasets per comprehensively compare all fragment expressions identified in up to 30 individual datasets by entering multiple SURFR session IDs window detailing each fragment identified in the individual, selected small RNA-Seq dataset. of the results page redirects the user to a SURFR window detailing the expressions of all full length sncRNAs in the provided datasets. Fragments" window (Figure 2D) for each fragment identified in the individual, selected small RNA-Seq dataset within its host gene along with the fragment''s expression (RPM) in each individual small RNA-Seq dataset, and lncRNAs expressed in a given human transcriptome from either a user-provided RNA-Seq dataset or publically More importantly, however, LAGOOn identified MALAT1 as the most highly expressed lncRNA in MDAMB-231 breast cancer cells (Figure 9). 10_1101-2021_02_08_430343 tumor microenvironment, the method identified ligands, receptors and cells meeting certain criteria of 56 9,234 samples in The Cancer Genome Atlas (TCGA), starting from a network of 64 cell types and 1,894 62 Data sources including TCGA and cell-sorted gene expression, bulk tumor expression, cell type scores, 78 ligands and receptors for each of the 64 cell types in xCell, using the source gene expression data. With this procedure, a network scaffold is induced, where cells produce ligands that bind to receptors on 113 (PFI) and tumor stage for each sample, a matrix of patient-specific edge weights was constructed 206 number of high weight edges in each tumor type did not associate with the number of samples, as might 254 in the tumor stage contrast, a majority of ligand-producing cells include GMP cells, Osteoblasts, MSC 283 In the PFI results, Th1 cells appeared in 13 high scoring edges in SKCM, all with 394 10_1101-2021_02_09_430036 A comparative study of genomic adaptations to low nitrogen availability in Genlisea aurea A comparative study of genomic adaptations to low nitrogen availability in Genlisea aurea is a carnivorous plant that grows on nitrogen-poor waterlogged sandstone aurea''s genome, CDS and non-coding DNA 2) Determination of transcriptomic nitrogen content and codon usage bias associated with higher nitrogen content tRNAs (among codons that are coding for the same amino a considerably lower number of nitrogen atoms in its genome than the two other plant species. has higher nitrogen counts per molecular unit in genomic DNA, CDS, Non-Coding DNA, protein, aurea has a higher nitrogen usage in its DNA, RNA and proteins Figure 2: Average number of nitrogen atoms per molecular unit in genomic DNA, CDS, Non-Coding DNA, aurea had lower nitrogen content in tRNA sequences but not in other Figure 3: Bar graph representing the codon usage bias and tRNA nitrogen content in G. 10_1101-2021_02_09_430363 Accommodating site variation in neuroimaging data using hierarchical and Bayesian models The potential of normative modeling to make individualized predictions has led to structural neuroimaging results that go beyond the case-control approach. in a similar way for multi-site modeling in a pooled neuroimaging data set, which contained 7499 participants that org/abide/) data set to compare a non-linear, Gaussian version of the model, to a linear hierarchical Bayesian version and mathematical description of our approach to include site as predictor in a normative hierarchical Bayesian model. With the aim to create reliable normative models in multi-site neuroimaging data, we developed and compared two model is also able to capture non-linear effects between age and thickness of the cortical region ("Hierarchical Bayesian Gaussian Process term, which allows to model non-linear association between age and cortical thickness measures. The only models that perform better for most regions than the mean of the training data set are the Hierarchical Bayesian 10_1101-2021_02_09_430405 In-silico Structural and Molecular Docking-Based Drug Discovery Against Viral Protein (VP35) of Marburg Virus: A potent Agent of MAVD including structure-based drug-like compounds screening from online databases, molecular The final small molecules of drug-like compounds would have more effective and selected for the molecular docking with FGI-103 antiviral drug-using AutoDock 4.2 software. After that, FGI-103 was set and screen other drug-like compounds from PubChem databases. The finally selected drug-like compounds were docked with the P1 site of VP35 of based on ap1 site for ligand in every dock for VP35 MARV utilizing a grid chart of 50 × 50 × 50 The ADMET properties of finally selected drug-like compounds were checked to utilize 2D molecules structure of selected drug-like compounds (A) represents the 2D The molecule structure of three drug-like compounds is shown in Figure 6. "In-Silico Structural and Molecular Docking-Based Drug Discovery "In-Silico Structural and Molecular Docking-Based Drug Discovery 10_1101-2021_02_09_430460 experimentally validated cancer mutation data in this study, we explored various string-based evolutionary features resulted in the development of a pan-cancer mutation effect prediction Distinguishing between driver and passenger mutations from sequenced cancer genomes is a Recent studies have identified specific signatures or patterns of mutations in different cancer than passenger mutations and built probabilistic models to identify driver genes that had this study, missense mutations from 58 genes that were pan-cancer-based were combined from We used the same datasets to judge our model''s ability to predict rare driver mutations based Driver and Passenger Mutations'' Features Used to Train NBDriver are Significantly Although our method''s focus was to identify missense driver mutations from sequenced cancer surrounding driver and passenger mutations obtained from sequenced cancer genomes. computational prediction of driver missense mutations," Cancer Res., vol. functionally validated cancer-related missense mutations," Genome Biology, vol. Figure 7: Differences in the distribution of features between driver and passenger mutations 10_1101-2021_02_09_430536 Genome-wide prediction and integrative functional characterization of Alzheimer''s disease-associated genes example, a module-trait network approach was proposed and applied to identify gene 63 functional enrichment-based approach to identify negative genes that are not likely 94 associated genes through an optimal selection of networks and machine learning 98 FGN, and prediction of AD-associated genes using machine learning models (Fig. 1). addition, we tested their enrichment in three AD-related gene sets associated with 122 The top-ranked genes are enriched in AD-associated functions and phenotypes 154 These results provide additional evidence that our predicted genes are associated with 194 The top-ranked genes are associated with AD based on miRNA-target networks 227 We investigated whether top-ranked genes were functionally related to AD-associated 229 We tested whether the top-ranked k genes were more likely to interact with AD-associated 576 related to AD-associated genes or miRNAs based on miRNA-target interaction networks. 10_1101-2021_02_09_430550 (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Therefore, for scRNA-seq data analysis, informative gene selection Besides scRNA-seq data analysis, informative gene selection is also crucial for designing number and a scRNA-seq dataset, scPNMF selects informative genes based on its weight matrix; First, the informative genes selected by scPNMF lead to the most accurate cell clustering. the informative genes and weight matrix of scPNMF lead to the best cell type prediction accuracy Figure 3: Benchmarking scPNMF against 11 informative gene selection methods on seven scRNA-seq (b) UMAP visualization of cells in the Zheng4 dataset based on 100 informative genes selected by We benchmark scPNMF against the 11 gene selection methods in terms of cell type prediction We propose scPNMF, an unsupervised gene selection and data projection method for scRNA-seq For cell type prediction, we project every targeted gene profiling dataset and its scRNA-seq 10_1101-2021_02_10_430367 Running title: Chen M et al / Genome Assembly Data Repository 21 Genomics Data Center (NGDC), part of the China National Center for Bioinformation 40 archive high-quality genome sequences and annotations, GWH is equipped with a 46 Collectively, GWH serves as an important resource for genome-scale data 51 https://bigd.big.ac.cn/) [13], the aim of GWH is to accept data submissions worldwide 78 GWH is a centralized resource housing genome-scale data, with the purpose to 105 GWH not only accepts genome assembly associated data through an on-line 111 GWH will assign a unique accession number to the submitted genome assembly upon 149 GWH provides data visualization for both genome 163 Collectively, GWH is a user-friendly portal for genome data submission, release, and 209 Database resources of the National Genomics Data 302 Genome assembly accession number is prefixed with "GWH", followed by four 334 Genome assembly accession number is prefixed with "GWH", followed by four 334 10_1101-2021_02_10_430512 into DDIs. In this study, a hierarchical machine learning model was created to predict DDIassociated ADRs and pharmacological insight thereof for any drug pair. drugs'' chemical structures as inputs to predict their target, enzyme, and transporter (TET) Development of RFCs for Prediction of Target, Enzyme, and Transporter Profiles of Drugs Development of a Model for Prediction of DDI-associated ADRs from TET Profiles of Drugs ADR prediction from Target, Enzyme, and Transporter Profiles of Drug Pairs To predict ADRs of a drug pair from its TET profiles, Random Forest Classifier (RFC), Application of the SVM model for DDI-associated ADRs Involving Three Major Drugs through predicted PRR changes of drug pairs upon removal of each of the targets, enzymes, and changes of drug pairs were predicted by the model upon removal of each of the targets, enzymes, Target, enzyme, and transporter (TET) profiles of atorvastatin and concomitant drugs, 10_1101-2021_02_10_430563 investigators across the SPARC consortium that provide key details about organ-specific circuitry, including structural (BIDS), the SDS has been designed to capture the large variety of data generated by SPARC investigators who are description of the SPARC curation process and the automated tools for complying with the SDS, including the SDS validator and Software to Organize Data Automatically (SODA) for SPARC. required to organize their data files and metadata organized according to the SPARC Data Structure data according to the SPARC Dataset Structure. is the preferred file format for tabular data in SPARC, the Data files are organized into 3 different top-level folders, The organization structure of the files and folders for a SPARC dataset. https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-1.2.3 https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-1.2.3 investigators include folders that organize data along a from these subjects, data files are organized within fields, the curation team developed a SPARC Dataset files/folders, and share datasets with the SPARC 10_1101-2021_02_10_430604 Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets 1 Struo2: efficient metagenome profiling database construction for ever-expanding 10 Mapping metagenome reads to reference databases is the standard approach for 12 reference databases often lack recently generated genomic data such as 15 method for constructing custom databases; however, the pipeline does not scale well with the 17 not allow for efficient database updating as new data are generated. 20 HUMAnN3 databases that can be easily updated with new genomes and/or individual gene Struo2 enables feasible database generation for continually increasing large-scale 25 ● Pre-built databases: http://ftp.tue.mpg.de/ebio/projects/struo2/ 26 ● Utility tools: https://github.com/nick-youngblut/gtdb_to_taxdump 28 Metagenome profiling involves mapping reads to reference sequence databases and is 39 computational resources, which led us to create Struo for straight-forward custom metagenome 54 CPU hours per genome versus ~2.4 for Struo (Figure 1B). 67 taxonomy (available at https://github.com/nick-youngblut/gtdb_to_taxdump ). (2020) Struo: a pipeline for building custom databases for 10_1101-2021_02_10_430606 Each point is a decoupled motif generate by a sample set of sequence. Only the max activation value of the decoupled motifs in Fig. 3b are significantly higher than the decoupled motifs of other neurons in layer 3 of Basset-3 model. discovered (q-value < 0.001) from the neuron in convolutional output layer of Basset, BD-5 and BD-10 model. c, The number of motif discovered (q-value < 0.01) from the neuron in layer 3 of Basset model using different sub-patterns in the input feature map of the max pooling layer to split the sequences set of which are DNA-sequence based DCNN models with 3 general convolutional layers for stacking sequences of different synonymous motifs with the maximum activation value In summary, we presented NeuronMotif as an effective algorithm to reveal the cisregulatory motif grammar learned by DCNN model that use DNA sequence to annotate sequences indicate more synonymous motif mixture in this DCNN model. 10_1101-2021_02_10_430619 Cutevariant: a GUI-based desktop application to explore genetics variations Cutevariant is a user-friendly GUI based desktop application for genomic research designed to search for variations in DNA samples collected in annotated files and encoded in the Variant Calling Format. application imports data into a local relational database wherefrom complex filter-queries can be built either Key words: genomics, DNA variant, desktop application, Domain Specific Language, Graphic User Interface applications import the data from VCF files into an indexed Cutevariant imports data from VCF files into a normalized Fig. 2: The Cutevariant main view showing the variants list sub-window (middle), different controllers sub-windows but not all are Just like Variant Tools, Cutevariant supports operations Features Cutevariant BrowseVCF VCF-Miner VCF-Explorer VCF-Server VCF-Filters GEMINI Variant Tools SnpSift Comparaison of time performance between cutevariant and VCF-miner for importation and query execution. 3. Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa VCF-Miner: GUI-based application for mining variants 10_1101-2021_02_10_430623 published S3-type N-of-1-pathways MixEnrich to two paired samples (e.g., diseased vs unaffected tissues) for determining patient-specific enriched genes sets: Odds Ratios (S3-OR) and S3-variance using these models to derive effect sizes and statistical significance in singlesubject studies of transcriptomes, these samples are isogenic or quasi-isogenic, and thus do not necessarily generalize to a group of subjects (cohort-level signal). The novel bioinformatic method identifies meaningful biomechanism differences between very small cohorts by using single-subject-study-derived effect sizes for gene sets. (B) For the generalized linear model-based analyses, we applied a different filtering process to the raw data where we eliminated all the transcripts with 0 counts for each subject and then calculated the coefficient 2.3 Description of the Generalized Linear Models and application of Inter-N-of-1 methods for small cohort comparison and their evaluation in the Breast Cancer Data the analysis of subsets of the TCGA Breast Cancer data, genes were declared differentially expressed if their abs(log2FC) > log2(1.2) and their FDR-adjusted p-value < 10_1101-2021_02_10_430649 Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data Recovering dropout events in a sparse gene expression matrix for scRNA-seq data is a long-standing matrix completion We introduce Bfimpute, a Bayesian factorization imputation algorithm that reconstructs two latent gene and cell matrices to impute final gene expression matrix within each cell group, with or without the aid of cell type labels or bulk Bfimpute achieves better accuracy than other six publicly notable scRNA-seq imputation methods on simulated Key words: single cell; RNA-seq; imputation; Bayesian factorization impute dropout events by adopting the bulk RNA-seq data imputation of single cell RNA-seq data could be applied by Bfimpute recovers dropout values and improves cell type identification in the simulated data. and the imputed data by Bfimpute, scImpute, and DrImpute for the human embryonic stem cell differentiation study. imputation method scimpute for single-cell rna-seq data. 10_1101-2021_02_10_430656 A like-for-like comparison of lightweight-mapping pipelines for single-cell RNA-seq data pre-processing benchmark comparing the kallisto-bustools pipeline (2) for single-cell demonstrate that, when configured to match the computational complexity of kallisto-bustools as closely as possible, alevin-fry processes Alevin-fry (3) is a new pipeline for single-cell RNA-seq benchmarking STARsolo (9), kallisto-bustools (2) and alevin-fry (3), out new tools like alevin-fry for the pre-processing of single-cell data, (1), we have now created a simple-to-follow tutorial for speedoptimized single-cell pre-processing using alevin-fry (https:// by Booeshaghi and Pachter (1) change when a like-for-like comparison between alevin-fry and kallisto-bustools is carried out, we The time and memory used by the relevant steps of the alevin-fry and kallisto-bustools pipelines for pre-processing the 20 diverse tagged-end single-cell RNA-seq datasets used in (1). A comparison of the resulting count matrices obtained from alevin-fry and kallisto-bustools, as run in this manuscript, for the pbmc_10k_v3 dataset. peak memory than alevin-fry, with the kallisto-bustools pipeline using 10_1101-2021_02_10_430705 1 VIA: Generalized and scalable trajectory inference in single-cell omics data 1 VIA: Generalized and scalable trajectory inference in single-cell omics data 35 strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks Step 1: Single-cell level graph is clustered such that each node 50 user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk along an 57 network topology and single-cell level pseudotime/lineage probability properties onto an embedding using GAMs, as The cell fates and their lineage pathways are then computed by a two-stage probabilistic method, 94 graph-traversal allows it to infer cell fates when the underlying data spans combinations of multifurcating 201 detected cell fates annotated (o) lineage pathway and gene-pseudotime trend shown for the CD41 Megakaryocytic 259 Figure 3 VIA infers trajectories in single-cell multi-omic and image datasets (a) Major lineages of human Single cells are represented by graph nodes that are connected based on 10_1101-2021_02_11_430695 Log-ratios are an important class of features for analyzing high-throughput sequencing (HTS) metagenomic data for HTS data, and more generally, high-dimensional CoDa. Unlike existing methods, CoDaCoRe is simultaneously scalable, interpretable, sparse, and accurate. unlabelled datasets, {xi}ni=1, as a method for identiLearning Sparse Log-Ratios for High-Throughput Sequencing Data CoDaCoRe variable selection for the first (most explanatory) log-ratio on the Crohn disease data (Rivera-Pinto et al., 2018). more generally, in the field of CoDa. Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data 10_1101-2021_02_11_430762 Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation alignments of SSU, LSU and 5S rRNA from all three domains as well as from organelles, along with secondary structure predictions for selected sequences. Ribovore software package for the analysis of SSU rRNA and LSU rRNA sequences 18S SSU rRNA database of 1091 sequences was updated most recently on September 27, 2018 by running version 0.28 of the Ribovore program ribodbmaker on an input set of 579,279 GenBank sequences returned from the eukaryotic SSU rRNA The results of ribotyper and rRNA sensor are combined and each sequence is separated into one of four outcome classes depending on whether it passed or failed each input a set of candidate sequences and a specified rRNA model (e.g. SSU.Bacteria) two blastn databases: one of 1267 bacterial and archaeal 16S SSU rRNA sequences 10_1101-2021_02_11_430789 Accelerating COVID-19 research with graph mining and transformer-based learning develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. is currently customized and available in the open domain to massively process COVID-19 related queries. Both systems are the next generation of the AGATHA knowledge network mining transformer model [37]. (1) Most of the existing HG systems are domain-specific (e.g., genedisease interactions) that is usually expressed in limiting the processed information (e.g., significant filtering vocabulary and papers a trained deep bi-LSTM model for extracting predicates from unstructured text. For instance, the node representing the entity "COVID-19" is connected to every sentence and predicate that The prior AGATHA semantic network only includes UMLS terms that appear in SemMedDB predicates [18] which is a major limitation. obtain embeddings per node in the semantic graph, we train AGATHA system ranking model. 10_1101-2021_02_11_430806 BIAPSS BioInformatic Analysis of liquid-liquid Phase-Separating protein Sequences web platform named BIAPSS (BioInformatic Analysis of liquidliquid Phase-Separating protein Sequences) which offers the users interactive data analytic tools for facilitating the discovery of statistically significant sequence signals for proteins with Phase-Separating protein Sequences. The objective of BIAPSS is to enable a rapid and on-the-fly deep statistical analysis of LLPS-driver proteins using the pool of sequences with The comparison to benchmarks of various protein groups enables statistical inference of specific phase-separating affinities. Furthermore, the residue-resolution biophysical regularities inferred from BIAPSS will help not only to accurately identify regions prone to phase separation but also to design sequence modifications targeting various biomedical applications. for comprehensive sequence-based analysis of LLPS proteins. the driving forces for phase separation of prion-like RNA binding proteins. disordered protein regions encode a driving force for liquid-liquid phase separation? of proteins driving liquid-liquid phase separation. 10_1101-2021_02_11_430847 SearcHPV: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer squamous cell carcinomas; however, the impact of HPV integration into the host human genome SearcHPV uncovered HPV integration sites adjacent to known cancer-related detection of HPV-human integration sites from targeted capture DNA sequencing data. developed a novel HPV integration detection tool for targeted capture sequencing data, which we SearcHPV showed a high frequency of HPV16 integration with a total of six events in UM-SCCIn this study, SearcHPV also called HPV integration sites within TP63. HPV integration sites have been associated with structural variations in the human genome3, 8, 37, which supports an additional genetic mechanism as to why HPV integration sites Genome-wide analysis of HPV integration in human and their integration sites in host genomes through next generation sequencing data. identify viruses and their integration sites using next-generation sequencing of human cancer 10_1101-2021_02_11_430871 ParticleChromo3D: A Particle Swarm Optimization Algorithm for Chromosome and Genome 3D Structure Prediction from Hi-C Data chromosome and genome structure reconstruction from Hi-C data using Particle Swarm Optimization approach chromosome bin, according to the particle swarm algorithm, and then iterates its position towards a global best This paper presents ParticleChromo3D, a new distance-based algorithm for chromosome 3D structure The structures generated by ParticleChromo3D also shows that the result at swarm size Structures generated by ParticleChromo3D at different swarm size values. obtained by comparing the ParticleChromo3D algorithm''s output structure to the simulated dataset''s true plot of ParticleChromo3D SCC performance on 500KB GM12878 cell Hi-C data for chromosome 1 to 23. plot of ParticleChromo3D SCC performance on 500KB GM12878 cell Hi-C data for chromosome 1 to 23. chromosome 3D structure reconstruction algorithms on the GM12878 data set at both the 1MB and 500KB chromosome and genome structures reconstructed from Hi-C data. 10_1101-2021_02_12_430739 Mutations in bdcA and valS correlate with quinolone resistance in wastewater Escherichia Coli Here, we systematically screen for candidate quinolone resistance-conferring mutations. coli and performed a genome-wide association study (GWAS) correlating over 200,000 mutations against quinolone resistance phenotypes. significant mutations including one located at the active site of the biofilm dispersal genes bdcA and six silent In summary, we demonstrate that GWAS effectively and comprehensively identifies resistance mutations Keywords: E Coli; Quinolone; Antibiotic Resistance; Genome-Wide Association Study (GWAS) direct route to resistance is mutations in the drug targets gyrA and parC. In summary, we aim to show that a bacterial genomewide association study can effectively and comprehensively identify targets relevant to antibiotic resistance. Based on representative resistance phenotypes, the authors selected 103 isolates for sequencing with Illumina MiSeq, 92 of which are available from coli bdcA may act indirectly on antibiotic resistance. 10_1101-2021_02_12_430764 Triku: a feature selection method based on nearest neighbors for single-cell data Triku is a feature selection method that favours genes defining the main Single-cell RNA sequencing (scRNA-seq) is a powerful technology to study the biological heterogeneity of tissues at the individual cell level, allowing the characterization of new cell populations and cell states–i.e. cell types responding to different scRNA-seq datasets are multidimensional, i.e. the expression profile per cell consists of multiple genes. feature selection method: 1) the ability to recover basic dataset structure (main cell low, meaning that features selected with the different methods yielded clustering solutions that were quite similar to the manually-labeled cell types, although there are We first studied the expression pattern of genes selected by triku and other methods, To evaluate the cluster expression of selected genes in benchmarking datasets, for proteins within the genes selected by different FS methods in the two sets of benchmarking datasets. 10_1101-2021_02_12_430830 Simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumor sequence data widely available bulk sequencing data where mutations from individual cells are and genomic mutation rate from bulk sequencing data. based on the maximum likelihood estimation of the parameters of a generative model of tumor growth and mutations. human hepatocellular carcinoma sample reveals an elevated per cell division mutation rate and high cell turnover. Due to the limitations of bulk sequencing, which only essays mutation frequencies for a population of cells from each tumor sample and does not The estimation is based on a maximum likelihood fit of the parameters of a birth-death model to the measured mutant and be estimated from readcount data, to separate the effects of the mutation rate We use pre-generated division trees from the ELynx suite at predetermined turnover rate values. Using the turnover rate, we also estimated the number of cell 10_1101-2021_02_12_430923 Kincore: a web resource for structural classification of protein kinases and their inhibitors result, among the DFGin structures, we distinguished between the catalytically active kinase conformation pages for kinase phylogenetic groups, genes, conformational labels, PDBids, ligands and ligand types. options to download data – database tables as a tab separated files; the kinase structures as PyMOL Kincore provides conformational assignments and ligand type labels to protein kinase structures from Figure 1: Representative protein kinase structure (3ETA_A) displaying the residues used to define inhibitor The distribution of different ligand types across kinase conformations is provided in Table 1. Table 1: Distribution of ligand types across protein kinase conformations (Number of chains). including conformational and ligand type labels and C-helix position, kinase family, gene name, Uniprot provides the number of kinase chains in the group across different conformations with their Database table provides the list of all the PDB chains with conformational labels and ligand 10_1101-2021_02_12_430963 adenylation site databases to enable differential 3'' UTR usage analysis. Conclusions: diffUTR enables differential 3'' UTR analysis and more generally facilitates DEU9 Popular bin-based DEU methods are provided by the limma [25,24], edgeR [23] and DEXSeq [22]41 Bins are prepared from various types of gene annotations as well as, optionally, additional APA-driven segmentation and extension, then read counts among statistically-significant genes, especially for bins with a higher expression (Figure 3A).78 diffUTR provides three main plot types to explore differential bin usage analyses, each with a88 Plotted are the UTR bins found statistically significant (binand gene-level FDR deuBinPlot (Figure 4B) provides bin-level statistic plots for a given gene, similar to those99 than CDS bins, including counts of 3'' UTR when calculating overall gene expression could under-121 diffUTR streamlines DEU analysis and outperforms alternative methods in inferring UTR changes,127 For differential UTR analysis, gene-level results are ob-206 10_1101-2021_02_12_430979 StrainFLAIR: Strain-level profiling of metagenomic samples using variation graphs results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as approaches to handle multiple similar genomes as with strains use gene clustering and then select the64 StrainFLAIR assigns and estimates species and strain abundances of a bacterial metagenomic sample graph, called the "node abundance", is computed, first focusing on unique mapped reads (first step). Strain-level abundances are then obtained by exploiting the specific genes of each reference genome188 from the reference variation graph thus simulating a new strain to be identified and quantified.231 strains from a sequenced sample, mapped onto this graph.343 Reference strains relative abundances expected and computed by StrainFLAIR or Reference strains relative abundances expected and computed by StrainFLAIR or Reference strains relative abundances expected and computed by StrainFLAIR or Reference strains relative abundances expected and computed by StrainFLAIR or 10_1101-2021_02_12_430989 Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies as well as linear regression-based analyses for studying the association of continuous exposures generation of empirical null distribution of association p-values, and we apply the pipeline to Many studies of phenotypes associated with gene expression from RNA-seq consist of small Residual permutation approach for simulations and for empirical p-value computation covariates, and outcome distributions; and (b) their relationships, aside from the exposureoutcome association, are the same as in the real data, we used a residual permutation approach. association studies applied to residual permutations were included to compute empirical papproach to study the distribution of p-values under the null of no association between the phenotypes and RNA-seq, and used this approach to further study power, and to compute approaches for transcriptome-wide analysis of RNA-seq in population-based studies, including more comprehensive study of statistical permutation approaches for RNA-seq association 10_1101-2021_02_12_431018 HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage 2 Several new variants of SARS-CoV-2 have emerged globally, of which the 18 based assemblies on raw SARS-CoV-2 sequences in addition to identifying lineages to detect 26 variants of concern, we have developed an open source bioinformatic pipeline called HaVoC 27 monitor the spread of SARS-CoV-2 variants of concern during local outbreaks. currently being used in Finland for monitoring the spread of SARS-CoV-2 variants. SARS-CoV2, variant detection, reference assembly, lineage identification, coronavirus, 40 surveillance of virus variants by sequencing the SARS-CoV-2 genomes would provide a fast 80 to query SARS-CoV-2 fastq sequence libraries and assigns lineages to them individually in 92 processing and a reference genome of SARS-CoV-2 in a separate FASTA file. The likelihood of emergence of novel SARS-CoV-2 variants of concern is increased and 209 Emerging SARS-CoV-2 Variants. 10_1101-2021_02_13_429885 know tumour purity and the ploidy of a CNA segment, then the VAF mutations mapped A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. 10_1101-698605 Comparative evaluation of full-length isoform quantification from RNA-Seq Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses benchmarking, isoform quantification, simulated data, pseudo-alignment, RNA-Seq, short Given the difficulty in full-length isoform quantification, many RNA-Seq studies simply analysis performed on the known true isoform quantifications of the simulated data to the For the simulated data we started with 11 real RNA-Seq samples: six liver and six the isoform expression level using idealized and realistic simulated data, with full and true counts), for the set of expressed isoforms in sample 1 in C) idealized and D) realistic data. Method effect on differential expression analysis, using realistic data. Method effect on differential expression analysis, using realistic data. RSEM is a gene/isoform abundance tool for RNA-Seq data which uses a generative model S1 Fig. Method effect on full-length isoform quantification using simulated data. Method effect on full-length isoform quantification using simulated data. 10_1101-727867 scAEspy: a tool for autoencoder-based analysis of single-cell RNA sequencing data This computational tool allows for coupling low-dimensional probabilistic representation of gene expression data with the downstream analysis to consider the Finally, the currently available AEs cannot be directly exploited to obtain the latent space or to generate synthetic cells. to show the cells in this embedded space or as a starting point for other dimensionality reduction approaches (e.g., t-SNE and UMAP) as well as downstream analyses Non-linear approaches for dimensionality reduction can be effectively used to capture the non-linearities among the gene interactions that may exist in the highdimensional expression space of scRNA-Seq data [16]. be effectively applied to analyse disparate types of single-cell data from different flexible method developed to cluster single-cell data; (ii) a centroid is calculated batch-effect correction methods for single-cell rna sequencing data. Wang, D., Gu, J.: VASC: dimension reduction and visualization of single-cell RNA-seq data by deep