Summary of your 'study carrel'
==============================

This is a summary of your Distant Reader 'study carrel'.

The Distant Reader harvested & cached your content into a
collection/corpus. It then applied sets of natural language
processing and text mining against the collection. The results of
this process was reduced to a database file -- a 'study carrel'.
The study carrel can then be queried, thus bringing light
specific characteristics for your collection. These
characteristics can help you summarize the collection as well as
enumerate things you might want to investigate more closely.

This report is a terse narrative report, and when processing 
is complete you will be linked to a more complete narrative
report. 

                               Eric Lease Morgan <emorgan@nd.edu>


Number of items in the collection; 'How big is my corpus?'
----------------------------------------------------------
50


Average length of all items measured in words; "More or less, how big is each item?"
------------------------------------------------------------------------------------
9329


Average readability score of all items (0 = difficult; 100 = easy)
------------------------------------------------------------------
61


Top 50 statistically significant keywords; "What is my collection about?"
-------------------------------------------------------------------------
19	February
15	International
8	RNA
7	cell
5	gene
3	mutation
3	Supplementary
3	Seq
3	Fig
3	Data
3	Alzheimer
2	set
2	sequence
2	motif
2	method
2	international
2	figure
2	covid-19
2	SARS
2	PCA
2	GWAS
2	Figure
2	Cancer
1	δaf
1	target
1	subject
1	strain
1	single
1	siamese
1	seq
1	rrna
1	rate
1	protein
1	preprint
1	polar
1	phase
1	phage
1	nitrogen
1	model
1	mer
1	llps
1	lineage
1	kinase
1	inhibitor
1	image
1	high
1	glm+egs
1	genome
1	fry
1	file


Top 50 lemmatized nouns; "What is discussed?"
---------------------------------------------
2327	gene
2265	cell
2114	preprint
1821	datum
1194	version
1191	method
1191	author
1169	sequence
1101	copyright
1083	review
1057	model
1028	funder
1027	holder
1023	peer
968	dataset
944	license
936	analysis
935	set
868	value
857	%
818	preprintthis
723	perpetuity
675	number
666	sample
651	result
651	licenseavailable
643	expression
607	type
582	mutation
542	cancer
515	figure
494	structure
472	seq
457	size
453	study
429	file
428	approach
421	motif
414	feature
411	drug
410	time
386	genome
385	site
376	p
375	effect
362	protein
361	network
360	information
355	ad
344	level


Top 50 proper nouns; "What are the names of persons or places?"
--------------------------------------------------------------
1477	al
1039	February
1013	et
776	International
758	J.
715	M.
611	.
603	NC
488	S.
487	RNA
439	D.
427	C.
421	A.
417	ND
371	R.
334	L.
330	P.
320	C
294	B.
291	E.
286	G.
283	Figure
272	Fig
267	T.
259	M
256	K.
252	S
252	A
247	BY
242	Supplementary
234	H.
233	k
229	µM
228	J
220	N.
194	http://creativecommons.org/licenses/by-nc/4.0/
183	N
176	K
176	F.
175	R
173	Data
173	Alzheimer
164	Seq
163	Y.
160	Li
158	Cancer
157	T
155	SPARC
155	Raloxifene
154	W.


Top 50 personal pronouns nouns; "To whom are things referred?"
-------------------------------------------------------------
2679	we
1639	it
243	they
164	i
129	them
57	us
37	one
28	itself
13	you
9	themselves
6	https://doi.org/10.1101/2020.01.28.923532
6	he
5	𝒙
4	​sample​
4	u
4	s
3	∆̂′
3	m′
2	𝑙𝑎
2	λ
2	ourselves
2	ours
2	n
2	il-
2	http://paperpile.com/b/5tes3g/x5omi
1	𝜟
1	𝑒𝑖
1	𝑆∗of
1	’s
1	τ2
1	α
1	ʻʻuniprotdom_postmodenzʼ
1	y∗
1	yij
1	yes
1	when398
1	uw
1	us-
1	ub
1	to136
1	theirs
1	the355
1	pepos
1	mj
1	kb
1	influ-
1	in-
1	https://www.10xgenomics.com/resources/publications/
1	https://paperpile.com/c/rqvmzs/bhgv
1	https://paperpile.com/c/5tes3g/x5omi


Top 50 lemmatized verbs; "What do things do?"
---------------------------------------------
12666	be
2222	use
2129	have
1148	make
1030	post
1027	certify
905	display
867	grant
802	base
726	show
702	biorxiv
559	https://www.zotero.org/google-docs/?hsltkm
492	include
478	provide
457	identify
451	select
414	do
414	associate
408	compare
348	�
334	perform
330	follow
328	find
326	generate
320	allow
309	represent
301	give
300	obtain
284	contain
273	set
258	apply
246	predict
246	estimate
225	see
222	require
222	describe
220	define
211	calculate
190	compute
189	take
189	develop
189	consider
182	bind
179	achieve
176	test
176	cluster
172	target
170	propose
170	know
170	exist


Top 50 lemmatized adjectives and adverbs; "How are things described?"
---------------------------------------------------------------------
2100	not
681	-
650	single
637	different
608	also
574	more
570	high
556	available
544	other
455	only
428	well
377	first
370	such
367	specific
335	non
335	large
330	low
327	then
326	most
325	same
320	genome
292	human
274	small
264	however
261	e.g.
257	new
242	multiple
240	similar
221	good
207	many
201	top
197	thus
185	random
183	clinical
175	as
169	individual
168	significant
168	average
167	linear
163	functional
161	genomic
159	here
157	original
157	biological
155	further
153	wide
149	international
148	polar
147	less
142	deep


Top 50 lemmatized superlative adjectives; "How are things described to the extreme?"
-------------------------------------------------------------------------
182	good
137	most
69	least
28	high
24	Most
18	near
18	large
17	small
14	close
13	short
12	late
10	transcriptome
10	low
8	bad
6	great
5	fast
4	manif
4	deep
3	simple
2	young
2	old
2	long
1	​k​-near
1	would71
1	wide
1	was337
1	the98
1	the439
1	ter
1	strong
1	sparse
1	slow
1	pbmc_10k_v3
1	http://kinametrix.com/
1	fine
1	e
1	dense
1	broad
1	Least
1	CDC2L6


Top 50 lemmatized superlative adverbs; "How do things do to the extreme?"
------------------------------------------------------------------------
189	most
60	least
30	well
9	transcriptome
5	highest


Top 50 Internet domains; "What Webbed places are alluded to in this corpus?"
----------------------------------------------------------------------------
2261	doi.org
871	creativecommons.org
701	paperpile.com
675	www.zotero.org
141	dx.doi.org
135	github.com
131	www.protocols.io
91	docs.google.com
23	www.ncbi.nlm.nih.gov
20	www.codecogs.com
8	scicrunch.org
8	arxiv.org
7	bigd.big.ac.cn
6	satijalab.org
6	preprocessed-connectomes-project.org
6	osf.io
6	gitlab.com
5	www.nlm.nih.gov
4	www.synapse.org
4	reframedb.org
4	rdcu.be
4	pypi.org
4	gatk.broadinstitute.org
4	dunbrack.fccc.edu
4	colororacle.org
4	bioinf.itmat.upenn.edu
4	biapss.chem.iastate.edu
3	www.wolfram.com
3	www.ukbiobank.ac.uk
3	www.nature.com
3	www.cell.com
3	www.r-project.org
3	plants.ensembl.org
3	ftp.tue.mpg.de
3	ftp.ncbi.nlm.nih.gov
3	ecrlife420999811.wordpress.com
3	commonfund.nih.gov
3	combine-lab.github.io
3	cf.10xgenomics.com
3	bitbucket.org
3	adni.bitbucket.io
2	www.uniprot.org
2	www.springer.com
2	www.qt.io
2	www.proteinatlas.org
2	www.ontobee.org
2	www.internationalgenome.org
2	www.highcharts.com
2	www.helsinki.fi
2	www.gurobi.com


Top 50 URLs; "What is hyperlinked from this corpus?"
----------------------------------------------------
571	http://www.zotero.org/google-docs/?HsLTKM
360	http://creativecommons.org/licenses/by-nc-nd/4.0/
244	http://creativecommons.org/licenses/by-nc/4.0/
205	http://creativecommons.org/licenses/by/4.0/
127	http://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe6pe
62	http://creativecommons.org/licenses/by-nd/4.0/
47	http://doi.org/10.1101/2021.02.09.430536doi:
47	http://doi.org/10.1101/2021.02.09.430536
46	http://doi.org/10.1101/2020.01.28.923532doi:
46	http://doi.org/10.1101/2020.01.28.923532
41	http://doi.org/10.1101/2021.02.10.430512doi:
41	http://doi.org/10.1101/2021.02.10.430512
39	http://doi.org/10.1101/2021.02.09.430460doi:
39	http://doi.org/10.1101/2021.02.09.430460
37	http://doi.org/10.1101/698605doi:
37	http://doi.org/10.1101/698605
37	http://doi.org/10.1101/2021.02.09.430550doi:
37	http://doi.org/10.1101/2021.02.09.430550
36	http://doi.org/10.1101/2021.02.13.429885doi:
36	http://doi.org/10.1101/2021.02.13.429885
36	http://doi.org/10.1101/2020.10.08.327718doi:
36	http://doi.org/10.1101/2020.10.08.327718
32	http://doi.org/10.1101/2020.09.21.305516doi:
32	http://doi.org/10.1101/2020.09.21.305516
32	http://doi.org/10.1101/2020.09.02.279521doi:
32	http://doi.org/10.1101/2020.09.02.279521
31	http://doi.org/10.1101/2021.02.10.430606doi:
31	http://doi.org/10.1101/2021.02.10.430606
31	http://doi.org/10.1101/2020.09.23.308239doi:
31	http://doi.org/10.1101/2020.09.23.308239
29	http://doi.org/10.1101/2021.02.08.430343doi:
29	http://doi.org/10.1101/2021.02.08.430343
28	http://doi.org/10.1101/727867doi:
28	http://doi.org/10.1101/727867
28	http://doi.org/10.1101/2021.02.11.430762doi:
28	http://doi.org/10.1101/2021.02.11.430762
27	http://doi.org/10.1101/2021.02.12.430989doi:
27	http://doi.org/10.1101/2021.02.12.430989
25	http://doi.org/10.1101/2021.02.11.430847doi:
25	http://doi.org/10.1101/2021.02.11.430847
24	http://doi.org/10.1101/2021.02.11.430871doi:
24	http://doi.org/10.1101/2021.02.11.430871
24	http://doi.org/10.1101/2021.02.10.430705doi:
24	http://doi.org/10.1101/2021.02.10.430705
24	http://doi.org/10.1101/2021.02.01.429246doi:
24	http://doi.org/10.1101/2021.02.01.429246
23	http://doi.org/10.1101/2021.02.09.430405
23	http://doi.org/10.1101/2021.02.08.430280doi:
23	http://doi.org/10.1101/2021.02.08.430280
22	http://doi.org/10.1101/2021.02.08.430270doi:


Top 50 email addresses; "Who are you gonna call?"
-------------------------------------------------
3	andrea.tangherloni@unibg.it
2	wu.bin.kmu@qq.com
2	tracey.weissgerber@charite.de
2	ooluwada@uccs.edu
2	michael.schroeder@tu-dresden.de
2	mararabra@yahoo.co.uk
2	l165018@lhr.nu.edu.pk
2	huang@southalabama.edu
2	ggrant@pennmedicine.upenn.edu
2	gcaravagna@units.it
2	eytan.ruppin@nih.gov
2	clement.abi-nader@inria.fr
2	borchert@southalabama.edu
2	allenem@pennmedicine.upenn.edu
2	alejandro.schaffer@nih.gov
1	zhang.jianbo@gmail.com
1	yaozhong@ims.u-tokyo.ac.jp
1	wjshen@stu.edu.cn
1	tyagin@udel.edu
1	tsofer@bwh.harvard.edu
1	tg76@st-andrews.ac.uk
1	smgomez@unc.edu
1	shtutmanm@sccp.sc.edu
1	sacha@labsquare.org
1	rob@cs.umd.edu
1	potoyan@iastate.edu
1	pl219@cam.ac.uk
1	pierre-luc.germain@hest.ethz.ch
1	noble@uw.edu
1	nicholas.youngblut@tuebingen.mpg.de
1	nawrocke@ncbi.nlm.nih.gov
1	mmartone@ucsd.edu
1	matta.krish@charterschool.org
1	martin.hofmann-apitius@scai.fraunhofer.de
1	marco.gallo@ucalgary.ca
1	maizie.zhou@vanderbilt.edu
1	lswang@pennmedicine.upenn.edu
1	kevin.da-silva@inria.fr
1	jxwang@mail.csu.edu.cn
1	jsybrandt@google.com
1	jsybran@clemson.edu
1	jli@stat.ucla.edu
1	isafro@udel.edu
1	gunnar.ratsch@ratschlab.org
1	gmarcais@cs.cmu.edu
1	f.ricciuti@campus.unimib.it
1	eg2912@columbia.edu
1	dilip_panthee@ncsu.edu
1	david.gibbs@isbscience.org
1	daniela.besozzi@unimib.it


Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?"
-------------------------------------------------------------------------------
1027	version posted february
16	value is better
11	gene set gs
8	gene set enrichment
6	methods do not
4	genes are more
4	genes were more
4	sequence � le
4	sequences do not
3	cells using nanoliter
3	data are available
3	data using t
3	datasets are available
3	gene is highly
3	genes are not
3	genes were functionally
3	values are n
2	analysis using only
2	data are already
2	data is available
2	data is important
2	data is normalized
2	data set figure
2	data using data
2	data using empirical
2	data using hierarchical
2	data using online
2	data using umap
2	dataset does not
2	dataset was not
2	gene is differentially
2	gene set analyses
2	gene set level
2	gene set response
2	gene set signal
2	genes are likely
2	genes being differentially
2	genes have similar
2	genes were likely
2	method is also
2	method were then
2	methods are not
2	methods are similar
2	methods did not
2	methods show limited
2	methods were scanpy
2	model does not
2	model was then
2	models are available
2	sequences are likely


Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?"
---------------------------------------------------------------------------------------
2	dataset does not currently
2	methods are not significantly
1	% have not yet
1	al are not directly
1	authors have no financial
1	data does not necessarily
1	data has no indels
1	data is not always
1	datasets are not equal
1	funders had no role
1	genes are not as
1	genes provide no explanation
1	methods did not correctly
1	methods do not directly
1	methods made no positive
1	model did not significantly
1	model was not able
1	models provide no pharmacological
1	sequences did not much341
1	sequences is not important
1	sequences were no more
1	sequences were not used.345
1	sets are not too
1	sets does not always
1	sets is not fundamentally
1	values are not necessarily
1	values are not reliable


Sizes of items; "Measures in words, how big is each item?"
----------------------------------------------------------
20656	10_1101-2021_02_09_430536
16705	10_1101-2020_01_28_923532
16496	10_1101-2021_02_11_430762
15659	10_1101-2021_02_09_430460
15440	10_1101-2021_02_01_429246
15281	10_1101-727867
13590	10_1101-2021_02_10_430705
13512	10_1101-2021_02_09_430550
13439	10_1101-2021_02_09_430363
12853	10_1101-698605
12824	10_1101-2020_10_08_327718
12363	10_1101-2021_02_08_430280
12164	10_1101-2020_09_02_279521
12013	10_1101-2021_02_10_430606
11859	10_1101-2021_02_10_430512
11335	10_1101-2021_02_08_430343
10901	10_1101-2021_02_10_430563
10624	10_1101-2021_02_12_430979
10584	10_1101-2021_02_13_429885
10376	10_1101-2020_09_21_305516
10071	10_1101-2021_02_11_430871
9518	10_1101-2021_02_12_430764
9478	10_1101-2021_02_10_430623
9408	10_1101-2021_02_11_430789
8797	10_1101-2020_09_23_308239
8418	10_1101-2021_02_10_430649
8205	10_1101-2020_11_17_386649
8181	10_1101-2021_02_12_430830
8136	10_1101-2021_02_12_430989
8121	10_1101-2020_02_04_934216
7973	10_1101-2021_02_11_430695
7913	10_1101-2021_02_12_430923
7909	10_1101-2021_02_08_428881
7516	10_1101-2021_02_12_430739
7219	10_1101-2021_02_08_430270
6849	10_1101-2021_02_11_430847
6710	10_1101-2021_02_12_430963
6557	10_1101-2021_02_10_430656
6404	10_1101-2021_02_08_430275
5987	10_1101-2020_09_23_310276
5941	10_1101-2021_02_09_430405
5183	10_1101-2021_02_08_430070
4932	10_1101-2021_02_10_430619
4875	10_1101-2021_02_10_430367
4252	10_1101-2020_12_24_424317
3786	10_1101-2021_02_12_431018
3128	10_1101-2021_02_09_430036
2698	10_1101-2021_02_11_430806
2191	10_1101-2020_05_15_090266
1409	10_1101-2021_02_10_430604


Readability of items; "How difficult is each item to read?"
-----------------------------------------------------------
79.0	10_1101-2021_02_09_430536
72.0	10_1101-727867
71.0	10_1101-2021_02_10_430649
71.0	10_1101-2021_02_01_429246
70.0	10_1101-2021_02_12_430739
70.0	10_1101-2021_02_09_430363
70.0	10_1101-2021_02_09_430036
69.0	10_1101-2021_02_08_430270
68.0	10_1101-2021_02_12_430830
68.0	10_1101-2021_02_08_430275
67.0	10_1101-2020_09_21_305516
67.0	10_1101-2020_11_17_386649
66.0	10_1101-2020_01_28_923532
66.0	10_1101-2021_02_12_430979
66.0	10_1101-2021_02_10_430619
66.0	10_1101-2021_02_10_430367
66.0	10_1101-2021_02_08_428881
65.0	10_1101-2021_02_11_430762
64.0	10_1101-2021_02_12_430764
64.0	10_1101-2021_02_09_430550
64.0	10_1101-2021_02_10_430656
64.0	10_1101-2020_05_15_090266
64.0	10_1101-2021_02_09_430405
63.0	10_1101-2021_02_11_430871
63.0	10_1101-2021_02_10_430606
62.0	10_1101-2021_02_12_430963
62.0	10_1101-2021_02_12_430923
61.0	10_1101-2021_02_12_431018
60.0	10_1101-2021_02_11_430695
60.0	10_1101-2021_02_08_430070
59.0	10_1101-2020_02_04_934216
58.0	10_1101-2021_02_11_430789
58.0	10_1101-2021_02_10_430623
58.0	10_1101-2021_02_08_430343
58.0	10_1101-2021_02_08_430280
57.0	10_1101-2021_02_11_430847
57.0	10_1101-2020_09_23_308239
57.0	10_1101-2020_12_24_424317
56.0	10_1101-2020_10_08_327718
56.0	10_1101-2021_02_10_430512
56.0	10_1101-2021_02_10_430604
56.0	10_1101-2020_09_02_279521
55.0	10_1101-698605
53.0	10_1101-2021_02_11_430806
53.0	10_1101-2021_02_10_430705
52.0	10_1101-2020_09_23_310276
49.0	10_1101-2021_02_13_429885
48.0	10_1101-2021_02_12_430989
48.0	10_1101-2021_02_10_430563
40.0	10_1101-2021_02_09_430460


Item summaries; "In a narrative form, how can each item be abstracted?"
-----------------------------------------------------------------------
10_1101-2020_01_28_923532	We focus our analysis on genes encoding protein targets that encode receptors on the cell all "modular", including one part that specifically targets the tumor cell via one gene/protein and MadHitter and each patient receives an optimal personalized combination of targeted therapies from a prespecified set (pill bottle). Cohort and Individual Target Set Sizes as Functions of Tumor Killing and Given the single-cell tumor data sets and the ILP optimization framework described above, we filtering as this threshold is decreased), decreases the size of the target cell surface receptor gene heterogeneity of the cancer, number of patients within the data set, size of target gene set, lack of used for filtering the gene set to avoid targeting non-cancerous tissues. the genes in the optimal target set, the expression of that gene in that non-tumor cell exceeds the set of genes which is known to be targetable to cell 𝐶.

10_1101-2020_02_04_934216	EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning task of kinase-motif phosphorylation prediction as a multi-label kinase or substrate, as well as protein scaffolds that facilitate structural orientation and downstream catalysis of the reaction, modify the efficacy of motif phosphorylation. prediction of phosphorylation events), a deep learning approach for predicting multi-label kinase-motif phosphorylation relationships. example, the TLK kinase family only has nine positive labels (verified TLK-motif interactions) and more than 10,000 resulting data set is comprised of 7302 phosphorylatable motifs and their reaction-associated kinase families (Table 1). The final output is a vector, k, of length eight, where each value corresponds to the probability that the motif a was phosphorylated by one of the kinase families indicated in We sought to illuminate the relationship between kinase-family dissimilarity and phosphorylated motif-group dissimilarity described results provide motivation to incorporate both motif dissimilarity and kinase relatedness into the predictive model, as of kinase-motif prediction compared to the single-label approaches.

10_1101-2020_05_15_090266	Summary: SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. methods compare individual CRISPR spacers with phage To increase sensitivity, (1) we compare protein coding sequences because phage genomes are mostly coding, and, (0) Preprocess input: scan the phage genome and CRISPR spacers in six ORFs q of CRISPR spacers extracted from one prokaryotic genome, and each target set T comprises the putative protein sequences t from a single phage. The performance of SpacePHARER was evaluated on the spacer test set against a target database predicted the correct host for more phages than BLASTN BLASTN in detecting phage-host pairs, due to searching

10_1101-2020_09_02_279521	Simulating the outcome of amyloid treatments in Alzheimer''s disease from imaging and clinical data When applied to multimodal imaging and clinical data from the Alzheimer''s Disease Neuroimaging Initiative our * Data used in preparation of this article were obtained from the Alzheimer''s Disease Neuroimaging Initiative (ADNI) database Keywords : Alzheimer''s Disease ; Clinical trials ; Disease progression; Amyloid hypothesis; of large datasets of different data modalities, such as clinical scores, or brain imaging measures to model Alzheimer''s disease progression based on specific assumptions on the biochemical combining traditional DPMs with dynamical models of Alzheimer''s disease progression. In this work we present a novel computational model of Alzheimer''s disease progression to multi-modal imaging and clinical data from the Alzheimer''s Disease Neuroimaging To simulate the long-term progression of Alzheimer''s disease we first project the AD subjects Figure 3 Model-based progression of Alzheimer''s disease. clinical data, based on the estimation of latent biomarkers'' relationships governing Alzheimer''s 

10_1101-2020_09_21_305516	Copy-scAT: Deconvoluting single-cell chromatin accessibility of genetic subclones in cancer 1 Copy-scAT: Deconvoluting single-cell chromatin accessibility of genetic subclones in cancer 1 uses single-cell epigenomic data to infer copy number variants (CNVs) that define cancer cells. We have tested the ability of Copy-scAT to use scATAC data to call CNVs with three different approaches 100 genome sequencing (WGS) data for adult GBM (aGBM) surgical resections (n = 4 samples, 3,647 cells). adult GBM samples identified using both methods, versus total numbers of gains detected by scATAC or 160 Number of chromosome-arm level gains detected in adult GBM samples identified using both methods, 163 (c) Multiple myeloma samples were profiled by both scATAC and the single-cell CNV assay. chromosome-arm level gains detected in adult GBM samples identified using both methods, versus total 166 CNVs are detected in scATAC clusters with Copy-scAT in pediatric GBM samples.

10_1101-2020_09_23_308239	The COVID-19 PHARMACOME: A method for the rational selection of drug repurposing COVID-19 PHARMACOME, a comprehensive drug-target-mechanism graph generated from a initial version of the COVID-19 PHARMACOME, a comprehensive drug-target-mechanism graph representing COVID-19 pathophysiology mechanisms that includes both drug targets Figure 3: Overlap of compound hits between different drug repurposing screening experiments. space overlap between different COVID-19 drug repurposing screenings. The COVID-19 PHARMACOME associates pathways derived from drug repurposing targets Figure 4 shows the distribution of repurposing drugs in the COVID-19 cause-and-effect graph, overlap analysis allows for the identification of repurposing drugs targeting mechanisms that Virus-response mechanisms are targets for repurposing drugs Figure 5: Visualization of drug repurposing candidates (and their targets) used in combination treatment as our own drug repurposing screening results, we were able to identify mechanisms targeted COVID-19 PHARMACOME, we are now able to link repurposing drugs, their targets and the SARS-CoV-2 protein interaction map reveals targets for drug repurposing.

10_1101-2020_09_23_310276	The NIAGADS Alzheimer''s Genomics Database (GenomicsDB) is an interactive knowledgebase for Alzheimer''s disease (AD) genetics that provides access to GWAS summary statistics datasets The website makes available >70 genome-wide summary statistics datasets from GWAS and efficient real-time data analysis and variant or gene report generation. Gene reports provide summaries of co-located ADRD risk-associated variants and have pages linking summary statistics to variant and gene annotations, this resource makes these summary statistics available for browsing (on dataset, gene, and variant reports and as genome NIAGADS GenomicsDB variant reports and a track is available on the genome browser. The NIAGADS GenomicsDB includes allele frequency data from 1000 Genomes (phase 3, version visualizations for summarizing search results and annotations in gene and variant reports. compare NIAGADS GWAS summary statistics tracks to each other, against annotated gene or A detailed report is provided for each of the GWAS summary statistics and ADSP meta-analysis 

10_1101-2020_10_08_327718	journals in three fields; plant sciences, cell biology and physiology (n=580 papers). figures were uncommon (physiology 16%, cell biology 12%, plant sciences 2%). among papers published in top journals in plant sciences, cell biology and physiology. contained images (plant science: 68%, cell biology: 72%, physiology: 55%). in physiology (49%) and cell biology (55%), and 28% of plant science papers provided and 29% of plant sciences papers contained no scale information on any image. Some publications use insets to show the same image at two different scales (cell Figure 1: Image types and reporting of scale information and insets physiology and plant science papers contained some images that were inaccessible to B: Most papers explain colors in image-based figures, however, explanations are less Figure 4: Using scale bars to annotate image size Creating clear and informative image-based figures for scientific publications. Creating clear and informative image-based figures for scientific publications.

10_1101-2020_11_17_386649	Experiments on 10,000 RNA-seq datasets show that RowDiff combined with MultiBRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most a binary matrix, where the k-mer set indexes the rows and each annotation label specifies a column. Starting from any vertex in the de Bruijn graph, Algorithm 1 defines a traversal leading to an anchor Each row in a RowDiff-transformed annotation matrix has the same or fewer set bits than A naı̈ve implementation of the RowDiff construction would be to load the matrix A in memory, and gradually replace its rows with their sparsified counterpart, while traversing the graph. We now note that, when querying annotations for paths in the graph, or sets of rows corresponding to vertices We constructed annotated de Bruijn graphs from the RNA-Seq data set in the same We now compare the representation size for RowDiff and other state-of-the-art graph annotation compression methods.

10_1101-2020_12_24_424317	classification, feature extraction and relevant gene identification through deep learning methods for 12 This research picks up from detection of different types of cancer RNA-Seq expressions using deep neural classification of gene expression profiles for different kinds of cancers. Hence, the effectiveness of deep learning models for feature extraction and relevant gene identification is performed revealing substantial results and they produced five high-ranked gene sets and reduced feature This study was aimed at classifying 12 types of cancer and identifying relevant genes and the results show were able to identify cancer-relevant pathways and genes for the sets, that different experiments generated, A deep learning approach for cancer detection and relevant gene Tumor gene expression data classification via sample expansionbased deep learning. Identification of a multi-cancer gene expression Multi-class Cancer Classification and Biomarker Identification using Deep Learning Multi-class Cancer Classification and Biomarker Identification using Deep Learning

10_1101-2021_02_01_429246	minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets suggests, a UHS is a set of k-mers that "hits" every w-long window of every possible sequence (hence the the elements of the polar sets are in the sequence: the higher the energy, the more spread apart the k-mers have densities upper bounded by |U|/σk, because only k-mers from the universal hitting set can be selected. Section 2.2 gives a formal definition of the link energy of a polar set and Theorem 1 gives upper and lower bounds using this link energy for the density of a minimizer compatible with a polar set. form a link, which in turn is the number of k-mer pairs in the polar set that are exactly w bases away on S. A context is charged if the minimizer selects a different k-mer in the first window than in the second

10_1101-2021_02_08_428881	A common workflow in single-cell RNA-seq analysis is to project the data to a latent space, cluster the cells in that space, and identify sets of marker genes that explain the differences among the nonlinear embedding model which maps the gene expression to the low-dimensional representation where the groups A notable feature of ACE''s approach is that, by identifying genes jointly, the method moves away from the notion Input: gene expression matrix Deep autoencoder learns low-dimensional representation Embedding clustering Clustering is neuralized and concatenated with the encoder Differentiation analysis by ACE Output: gene relevance ACE takes as input a single-cell gene expression matrix and learns a low-dimensional representation for each Next, a neuralized version of the k-means algorithm is applied to the learned representation to identify cell groups. input gene expression profile that lead the neuralized clustering model to alter the assignment from one group to the other.

10_1101-2021_02_08_430070	On the application of BERT models for nanopore methylation detection with deep learning models, have achieved significant performance improvements on nanopore methylation recurrent patterns of positional-signal-shift in the context window surrounding target 5-methylcytosine that the refined BERT model can achieve competitive or even better results than the state-of-the-art biRNN of datasets from the different research groups, BERT models demonstrate a good generalization Fig. 1: Basic BERT''s and refined BERT''s model structure used for methylation detection. a refined BERT model to take account of signal-shift patterns in the proposed refined BERT model achieves a competitive or even better result explore applying the BERT model for the nanopore methylation detection 2.2 Applying BERT models for nanopore methylation For the cross-sample evaluation, we train models on one dataset and test a BERT model to pay more attention to center positions. In-sample evaluation of different deep learning models on 5mC datasets.

10_1101-2021_02_08_430270	Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty Keywords and phrases: attenuation bias, genotype likelihood, linkage disequilibrium, polyploidy, reliability ratio. Let XiA and XiB be the posterior means at loci A and B for individual Equations (5)–(7) take the naive estimators most researchers use in practice (the sample covariance/correlation of posterior means) and inflate these by a multiplicative effect. Gerard and Ferrão, 2019] to obtain the posterior moments for each individual''s genotype at each SNP reliability ratios of most SNPs only increase their correlation estimates by less than 10%. To evaluate the LD estimates of high reliability ratio SNPs, we calculated the MLEs for ρ2 applied to simple linear regression with an additive effects model (where the SNP effect is proportional to the dosage), result in the standard ordinary least squares estimates when using the extreme reliability ratio of PotVar0080327, the genotype-error adjusted correlation estimate is -1.

10_1101-2021_02_08_430275	Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes identified using BSA-Seq, a technology in which next-generation sequencing (NGS) is applied to bulked segregant analysis (BSA). recently developed the significant structural variant method for BSASeq data analysis that exhibits higher detection power than standard to analyze BSA-Seq data in which genome sequences of one parent served as the reference sequences in genotype calling, and thus We analyzed a public BSA-Seq dataset using our modified method and the standard allele frequency and Gmethod allows the detection of such associations without sequencing the parental genomes, leading to further lower the the BSA-Seq data with the genome sequences of both the parents101 when the parental genome sequences are used to aid BSA-Seq data 193 The allele frequency method: The ΔAF value of each SNP in 267 BSA-Seq data analysis using the genome sequences of both the parents and the bulks. BSA-Seq data analysis using only the bulk genome sequences.

10_1101-2021_02_08_430280	given transcriptome provided as either a raw user-generated RNA-Seq dataset or NCBI SRR file identifier. SURFR identifies all ncRNA fragments (both annotated and novel) and their expressions in up to ten datasets per comprehensively compare all fragment expressions identified in up to 30 individual datasets by entering multiple SURFR session IDs window detailing each fragment identified in the individual, selected small RNA-Seq dataset. of the results page redirects the user to a SURFR window detailing the expressions of all full length sncRNAs in the provided datasets. Fragments" window (Figure 2D) for each fragment identified in the individual, selected small RNA-Seq dataset within its host gene along with the fragment''s expression (RPM) in each individual small RNA-Seq dataset, and lncRNAs expressed in a given human transcriptome from either a user-provided RNA-Seq dataset or publically More importantly, however, LAGOOn identified MALAT1 as the most highly expressed lncRNA in MDAMB-231 breast cancer cells (Figure 9).

10_1101-2021_02_08_430343	tumor microenvironment, the method identified ligands, receptors and cells meeting certain criteria of 56 9,234 samples in The Cancer Genome Atlas (TCGA), starting from a network of 64 cell types and 1,894 62 Data sources including TCGA and cell-sorted gene expression, bulk tumor expression, cell type scores, 78 ligands and receptors for each of the 64 cell types in xCell, using the source gene expression data. With this procedure, a network scaffold is induced, where cells produce ligands that bind to receptors on 113 (PFI) and tumor stage for each sample, a matrix of patient-specific edge weights was constructed 206 number of high weight edges in each tumor type did not associate with the number of samples, as might 254 in the tumor stage contrast, a majority of ligand-producing cells include GMP cells, Osteoblasts, MSC 283 In the PFI results, Th1 cells appeared in 13 high scoring edges in SKCM, all with 394 

10_1101-2021_02_09_430036	A comparative study of genomic adaptations to low nitrogen availability in Genlisea aurea A comparative study of genomic adaptations to low nitrogen availability in Genlisea aurea is a carnivorous plant that grows on nitrogen-poor waterlogged sandstone aurea''s genome, CDS and non-coding DNA 2) Determination of transcriptomic nitrogen content and codon usage bias associated with higher nitrogen content tRNAs (among codons that are coding for the same amino a considerably lower number of nitrogen atoms in its genome than the two other plant species. has higher nitrogen counts per molecular unit in genomic DNA, CDS, Non-Coding DNA, protein, aurea has a higher nitrogen usage in its DNA, RNA and proteins Figure 2: Average number of nitrogen atoms per molecular unit in genomic DNA, CDS, Non-Coding DNA, aurea had lower nitrogen content in tRNA sequences but not in other Figure 3: Bar graph representing the codon usage bias and tRNA nitrogen content in G.

10_1101-2021_02_09_430363	Accommodating site variation in neuroimaging data using hierarchical and Bayesian models The potential of normative modeling to make individualized predictions has led to structural neuroimaging results that go beyond the case-control approach. in a similar way for multi-site modeling in a pooled neuroimaging data set, which contained 7499 participants that org/abide/) data set to compare a non-linear, Gaussian version of the model, to a linear hierarchical Bayesian version and mathematical description of our approach to include site as predictor in a normative hierarchical Bayesian model. With the aim to create reliable normative models in multi-site neuroimaging data, we developed and compared two model is also able to capture non-linear effects between age and thickness of the cortical region ("Hierarchical Bayesian Gaussian Process term, which allows to model non-linear association between age and cortical thickness measures. The only models that perform better for most regions than the mean of the training data set are the Hierarchical Bayesian

10_1101-2021_02_09_430405	In-silico Structural and Molecular Docking-Based Drug Discovery Against Viral Protein (VP35) of Marburg Virus: A potent Agent of MAVD including structure-based drug-like compounds screening from online databases, molecular The final small molecules of drug-like compounds would have more effective and selected for the molecular docking with FGI-103 antiviral drug-using AutoDock 4.2 software. After that, FGI-103 was set and screen other drug-like compounds from PubChem databases. The finally selected drug-like compounds were docked with the P1 site of VP35 of based on ap1 site for ligand in every dock for VP35 MARV utilizing a grid chart of 50 × 50 × 50 The ADMET properties of finally selected drug-like compounds were checked to utilize 2D molecules structure of selected drug-like compounds (A) represents the 2D The molecule structure of three drug-like compounds is shown in Figure 6. "In-Silico Structural and Molecular Docking-Based Drug Discovery "In-Silico Structural and Molecular Docking-Based Drug Discovery 

10_1101-2021_02_09_430460	experimentally validated cancer mutation data in this study, we explored various string-based evolutionary features resulted in the development of a pan-cancer mutation effect prediction Distinguishing between driver and passenger mutations from sequenced cancer genomes is a Recent studies have identified specific signatures or patterns of mutations in different cancer than passenger mutations and built probabilistic models to identify driver genes that had this study, missense mutations from 58 genes that were pan-cancer-based were combined from We used the same datasets to judge our model''s ability to predict rare driver mutations based Driver and Passenger Mutations'' Features Used to Train NBDriver are Significantly Although our method''s focus was to identify missense driver mutations from sequenced cancer surrounding driver and passenger mutations obtained from sequenced cancer genomes. computational prediction of driver missense mutations," Cancer Res., vol. functionally validated cancer-related missense mutations," Genome Biology, vol. Figure 7: Differences in the distribution of features between driver and passenger mutations 

10_1101-2021_02_09_430536	Genome-wide prediction and integrative functional characterization of Alzheimer''s disease-associated genes example, a module-trait network approach was proposed and applied to identify gene 63 functional enrichment-based approach to identify negative genes that are not likely 94 associated genes through an optimal selection of networks and machine learning 98 FGN, and prediction of AD-associated genes using machine learning models (Fig. 1). addition, we tested their enrichment in three AD-related gene sets associated with 122 The top-ranked genes are enriched in AD-associated functions and phenotypes 154 These results provide additional evidence that our predicted genes are associated with 194 The top-ranked genes are associated with AD based on miRNA-target networks 227 We investigated whether top-ranked genes were functionally related to AD-associated 229 We tested whether the top-ranked k genes were more likely to interact with AD-associated 576 related to AD-associated genes or miRNAs based on miRNA-target interaction networks.

10_1101-2021_02_09_430550	(scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Therefore, for scRNA-seq data analysis, informative gene selection Besides scRNA-seq data analysis, informative gene selection is also crucial for designing number and a scRNA-seq dataset, scPNMF selects informative genes based on its weight matrix; First, the informative genes selected by scPNMF lead to the most accurate cell clustering. the informative genes and weight matrix of scPNMF lead to the best cell type prediction accuracy Figure 3: Benchmarking scPNMF against 11 informative gene selection methods on seven scRNA-seq (b) UMAP visualization of cells in the Zheng4 dataset based on 100 informative genes selected by We benchmark scPNMF against the 11 gene selection methods in terms of cell type prediction We propose scPNMF, an unsupervised gene selection and data projection method for scRNA-seq For cell type prediction, we project every targeted gene profiling dataset and its scRNA-seq

10_1101-2021_02_10_430367	Running title: Chen M et al / Genome Assembly Data Repository 21 Genomics Data Center (NGDC), part of the China National Center for Bioinformation 40 archive high-quality genome sequences and annotations, GWH is equipped with a 46 Collectively, GWH serves as an important resource for genome-scale data 51 https://bigd.big.ac.cn/) [13], the aim of GWH is to accept data submissions worldwide 78 GWH is a centralized resource housing genome-scale data, with the purpose to 105 GWH not only accepts genome assembly associated data through an on-line 111 GWH will assign a unique accession number to the submitted genome assembly upon 149 GWH provides data visualization for both genome 163 Collectively, GWH is a user-friendly portal for genome data submission, release, and 209 Database resources of the National Genomics Data 302 Genome assembly accession number is prefixed with "GWH", followed by four 334 Genome assembly accession number is prefixed with "GWH", followed by four 334 

10_1101-2021_02_10_430512	into DDIs. In this study, a hierarchical machine learning model was created to predict DDIassociated ADRs and pharmacological insight thereof for any drug pair. drugs'' chemical structures as inputs to predict their target, enzyme, and transporter (TET) Development of RFCs for Prediction of Target, Enzyme, and Transporter Profiles of Drugs Development of a Model for Prediction of DDI-associated ADRs from TET Profiles of Drugs ADR prediction from Target, Enzyme, and Transporter Profiles of Drug Pairs To predict ADRs of a drug pair from its TET profiles, Random Forest Classifier (RFC), Application of the SVM model for DDI-associated ADRs Involving Three Major Drugs through predicted PRR changes of drug pairs upon removal of each of the targets, enzymes, and changes of drug pairs were predicted by the model upon removal of each of the targets, enzymes, Target, enzyme, and transporter (TET) profiles of atorvastatin and concomitant drugs, 

10_1101-2021_02_10_430563	investigators across the SPARC consortium that provide key details about organ-specific circuitry, including structural (BIDS), the SDS has been designed to capture the large variety of data generated by SPARC investigators who are description of the SPARC curation process and the automated tools for complying with the SDS, including the SDS validator and Software to Organize Data Automatically (SODA) for SPARC. required to organize their data files and metadata organized according to the SPARC Data Structure data according to the SPARC Dataset Structure. is the preferred file format for tabular data in SPARC, the Data files are organized into 3 different top-level folders, The organization structure of the files and folders for a SPARC dataset. https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-1.2.3 https://github.com/SciCrunch/sparc-curation/releases/tag/dataset-template-1.2.3 investigators include folders that organize data along a from these subjects, data files are organized within fields, the curation team developed a SPARC Dataset files/folders, and share datasets with the SPARC 

10_1101-2021_02_10_430604	Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets 1 Struo2: efficient metagenome profiling database construction for ever-expanding 10 Mapping metagenome reads to reference databases is the standard approach for 12 reference databases often lack recently generated genomic data such as 15 method for constructing custom databases; however, the pipeline does not scale well with the 17 not allow for efficient database updating as new data are generated. 20 HUMAnN3 databases that can be easily updated with new genomes and/or individual gene Struo2 enables feasible database generation for continually increasing large-scale 25 ● Pre-built databases: http://ftp.tue.mpg.de/ebio/projects/struo2/ 26 ● Utility tools: https://github.com/nick-youngblut/gtdb_to_taxdump 28 Metagenome profiling involves mapping reads to reference sequence databases and is 39 computational resources, which led us to create Struo for straight-forward custom metagenome 54 CPU hours per genome versus ~2.4 for Struo (Figure 1B). 67 taxonomy (available at https://github.com/nick-youngblut/gtdb_to_taxdump ). (2020) Struo: a pipeline for building custom databases for 

10_1101-2021_02_10_430606	Each point is a decoupled motif generate by a sample set of sequence. Only the max activation value of the decoupled motifs in Fig. 3b are significantly higher than the decoupled motifs of other neurons in layer 3 of Basset-3 model. discovered (q-value < 0.001) from the neuron in convolutional output layer of Basset, BD-5 and BD-10 model. c, The number of motif discovered (q-value < 0.01) from the neuron in layer 3 of Basset model using different sub-patterns in the input feature map of the max pooling layer to split the sequences set of which are DNA-sequence based DCNN models with 3 general convolutional layers for stacking sequences of different synonymous motifs with the maximum activation value In summary, we presented NeuronMotif as an effective algorithm to reveal the cisregulatory motif grammar learned by DCNN model that use DNA sequence to annotate sequences indicate more synonymous motif mixture in this DCNN model.

10_1101-2021_02_10_430619	Cutevariant: a GUI-based desktop application to explore genetics variations Cutevariant is a user-friendly GUI based desktop application for genomic research designed to search for variations in DNA samples collected in annotated files and encoded in the Variant Calling Format. application imports data into a local relational database wherefrom complex filter-queries can be built either Key words: genomics, DNA variant, desktop application, Domain Specific Language, Graphic User Interface applications import the data from VCF files into an indexed Cutevariant imports data from VCF files into a normalized Fig. 2: The Cutevariant main view showing the variants list sub-window (middle), different controllers sub-windows but not all are Just like Variant Tools, Cutevariant supports operations Features Cutevariant BrowseVCF VCF-Miner VCF-Explorer VCF-Server VCF-Filters GEMINI Variant Tools SnpSift Comparaison of time performance between cutevariant and VCF-miner for importation and query execution. 3. Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa VCF-Miner: GUI-based application for mining variants

10_1101-2021_02_10_430623	published S3-type N-of-1-pathways MixEnrich to two paired samples (e.g., diseased vs unaffected tissues) for determining patient-specific enriched genes sets: Odds Ratios (S3-OR) and S3-variance using these models to derive effect sizes and statistical significance in singlesubject studies of transcriptomes, these samples are isogenic or quasi-isogenic, and thus do not necessarily generalize to a group of subjects (cohort-level signal). The novel bioinformatic method identifies meaningful biomechanism differences between very small cohorts by using single-subject-study-derived effect sizes for gene sets. (B) For the generalized linear model-based analyses, we applied a different filtering process to the raw data where we eliminated all the transcripts with 0 counts for each subject and then calculated the coefficient 2.3 Description of the Generalized Linear Models and application of Inter-N-of-1 methods for small cohort comparison and their evaluation in the Breast Cancer Data the analysis of subsets of the TCGA Breast Cancer data, genes were declared differentially expressed if their abs(log2FC) > log2(1.2) and their FDR-adjusted p-value < 

10_1101-2021_02_10_430649	Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data Recovering dropout events in a sparse gene expression matrix for scRNA-seq data is a long-standing matrix completion We introduce Bfimpute, a Bayesian factorization imputation algorithm that reconstructs two latent gene and cell matrices to impute final gene expression matrix within each cell group, with or without the aid of cell type labels or bulk Bfimpute achieves better accuracy than other six publicly notable scRNA-seq imputation methods on simulated Key words: single cell; RNA-seq; imputation; Bayesian factorization impute dropout events by adopting the bulk RNA-seq data imputation of single cell RNA-seq data could be applied by Bfimpute recovers dropout values and improves cell type identification in the simulated data. and the imputed data by Bfimpute, scImpute, and DrImpute for the human embryonic stem cell differentiation study. imputation method scimpute for single-cell rna-seq data.

10_1101-2021_02_10_430656	A like-for-like comparison of lightweight-mapping pipelines for single-cell RNA-seq data pre-processing benchmark comparing the kallisto-bustools pipeline (2) for single-cell demonstrate that, when configured to match the computational complexity of kallisto-bustools as closely as possible, alevin-fry processes Alevin-fry (3) is a new pipeline for single-cell RNA-seq benchmarking STARsolo (9), kallisto-bustools (2) and alevin-fry (3), out new tools like alevin-fry for the pre-processing of single-cell data, (1), we have now created a simple-to-follow tutorial for speedoptimized single-cell pre-processing using alevin-fry (https:// by Booeshaghi and Pachter (1) change when a like-for-like comparison between alevin-fry and kallisto-bustools is carried out, we The time and memory used by the relevant steps of the alevin-fry and kallisto-bustools pipelines for pre-processing the 20 diverse tagged-end single-cell RNA-seq datasets used in (1). A comparison of the resulting count matrices obtained from alevin-fry and kallisto-bustools, as run in this manuscript, for the pbmc_10k_v3 dataset. peak memory than alevin-fry, with the kallisto-bustools pipeline using

10_1101-2021_02_10_430705	1 VIA: Generalized and scalable trajectory inference in single-cell omics data 1 VIA: Generalized and scalable trajectory inference in single-cell omics data 35 strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks Step 1: Single-cell level graph is clustered such that each node 50 user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk along an 57 network topology and single-cell level pseudotime/lineage probability properties onto an embedding using GAMs, as The cell fates and their lineage pathways are then computed by a two-stage probabilistic method, 94 graph-traversal allows it to infer cell fates when the underlying data spans combinations of multifurcating 201 detected cell fates annotated (o) lineage pathway and gene-pseudotime trend shown for the CD41 Megakaryocytic 259 Figure 3 VIA infers trajectories in single-cell multi-omic and image datasets (a) Major lineages of human Single cells are represented by graph nodes that are connected based on 

10_1101-2021_02_11_430695	Log-ratios are an important class of features for analyzing high-throughput sequencing (HTS) metagenomic data for HTS data, and more generally, high-dimensional CoDa. Unlike existing methods, CoDaCoRe is simultaneously scalable, interpretable, sparse, and accurate. unlabelled datasets, {xi}ni=1, as a method for identiLearning Sparse Log-Ratios for High-Throughput Sequencing Data CoDaCoRe variable selection for the first (most explanatory) log-ratio on the Crohn disease data (Rivera-Pinto et al., 2018). more generally, in the field of CoDa. Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data Learning Sparse Log-Ratios for High-Throughput Sequencing Data

10_1101-2021_02_11_430762	Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation alignments of SSU, LSU and 5S rRNA from all three domains as well as from organelles, along with secondary structure predictions for selected sequences. Ribovore software package for the analysis of SSU rRNA and LSU rRNA sequences 18S SSU rRNA database of 1091 sequences was updated most recently on September 27, 2018 by running version 0.28 of the Ribovore program ribodbmaker on an input set of 579,279 GenBank sequences returned from the eukaryotic SSU rRNA The results of ribotyper and rRNA sensor are combined and each sequence is separated into one of four outcome classes depending on whether it passed or failed each input a set of candidate sequences and a specified rRNA model (e.g. SSU.Bacteria) two blastn databases: one of 1267 bacterial and archaeal 16S SSU rRNA sequences

10_1101-2021_02_11_430789	Accelerating COVID-19 research with graph mining and transformer-based learning develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. is currently customized and available in the open domain to massively process COVID-19 related queries. Both systems are the next generation of the AGATHA knowledge network mining transformer model [37]. (1) Most of the existing HG systems are domain-specific (e.g., genedisease interactions) that is usually expressed in limiting the processed information (e.g., significant filtering vocabulary and papers a trained deep bi-LSTM model for extracting predicates from unstructured text. For instance, the node representing the entity "COVID-19" is connected to every sentence and predicate that The prior AGATHA semantic network only includes UMLS terms that appear in SemMedDB predicates [18] which is a major limitation. obtain embeddings per node in the semantic graph, we train AGATHA system ranking model.

10_1101-2021_02_11_430806	BIAPSS BioInformatic Analysis of liquid-liquid Phase-Separating protein Sequences web platform named BIAPSS (BioInformatic Analysis of liquidliquid Phase-Separating protein Sequences) which offers the users interactive data analytic tools for facilitating the discovery of statistically significant sequence signals for proteins with Phase-Separating protein Sequences. The objective of BIAPSS is to enable a rapid and on-the-fly deep statistical analysis of LLPS-driver proteins using the pool of sequences with The comparison to benchmarks of various protein groups enables statistical inference of specific phase-separating affinities. Furthermore, the residue-resolution biophysical regularities inferred from BIAPSS will help not only to accurately identify regions prone to phase separation but also to design sequence modifications targeting various biomedical applications. for comprehensive sequence-based analysis of LLPS proteins. the driving forces for phase separation of prion-like RNA binding proteins. disordered protein regions encode a driving force for liquid-liquid phase separation? of proteins driving liquid-liquid phase separation.

10_1101-2021_02_11_430847	SearcHPV: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer squamous cell carcinomas; however, the impact of HPV integration into the host human genome SearcHPV uncovered HPV integration sites adjacent to known cancer-related detection of HPV-human integration sites from targeted capture DNA sequencing data. developed a novel HPV integration detection tool for targeted capture sequencing data, which we SearcHPV showed a high frequency of HPV16 integration with a total of six events in UM-SCCIn this study, SearcHPV also called HPV integration sites within TP63. HPV integration sites have been associated with structural variations in the human genome3, 8, 37, which supports an additional genetic mechanism as to why HPV integration sites Genome-wide analysis of HPV integration in human and their integration sites in host genomes through next generation sequencing data. identify viruses and their integration sites using next-generation sequencing of human cancer 

10_1101-2021_02_11_430871	ParticleChromo3D: A Particle Swarm Optimization Algorithm for Chromosome and Genome 3D Structure Prediction from Hi-C Data chromosome and genome structure reconstruction from Hi-C data using Particle Swarm Optimization approach chromosome bin, according to the particle swarm algorithm, and then iterates its position towards a global best This paper presents ParticleChromo3D, a new distance-based algorithm for chromosome 3D structure The structures generated by ParticleChromo3D also shows that the result at swarm size Structures generated by ParticleChromo3D at different swarm size values. obtained by comparing the ParticleChromo3D algorithm''s output structure to the simulated dataset''s true plot of ParticleChromo3D SCC performance on 500KB GM12878 cell Hi-C data for chromosome 1 to 23. plot of ParticleChromo3D SCC performance on 500KB GM12878 cell Hi-C data for chromosome 1 to 23. chromosome 3D structure reconstruction algorithms on the GM12878 data set at both the 1MB and 500KB chromosome and genome structures reconstructed from Hi-C data.

10_1101-2021_02_12_430739	Mutations in bdcA and valS correlate with quinolone resistance in wastewater Escherichia Coli Here, we systematically screen for candidate quinolone resistance-conferring mutations. coli and performed a genome-wide association study (GWAS) correlating over 200,000 mutations against quinolone resistance phenotypes. significant mutations including one located at the active site of the biofilm dispersal genes bdcA and six silent In summary, we demonstrate that GWAS effectively and comprehensively identifies resistance mutations Keywords: E Coli; Quinolone; Antibiotic Resistance; Genome-Wide Association Study (GWAS) direct route to resistance is mutations in the drug targets gyrA and parC. In summary, we aim to show that a bacterial genomewide association study can effectively and comprehensively identify targets relevant to antibiotic resistance. Based on representative resistance phenotypes, the authors selected 103 isolates for sequencing with Illumina MiSeq, 92 of which are available from coli bdcA may act indirectly on antibiotic resistance.

10_1101-2021_02_12_430764	Triku: a feature selection method based on nearest neighbors for single-cell data Triku is a feature selection method that favours genes defining the main Single-cell RNA sequencing (scRNA-seq) is a powerful technology to study the biological heterogeneity of tissues at the individual cell level, allowing the characterization of new cell populations and cell states–i.e. cell types responding to different scRNA-seq datasets are multidimensional, i.e. the expression profile per cell consists of multiple genes. feature selection method: 1) the ability to recover basic dataset structure (main cell low, meaning that features selected with the different methods yielded clustering solutions that were quite similar to the manually-labeled cell types, although there are We first studied the expression pattern of genes selected by triku and other methods, To evaluate the cluster expression of selected genes in benchmarking datasets, for proteins within the genes selected by different FS methods in the two sets of benchmarking datasets.

10_1101-2021_02_12_430830	Simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumor sequence data widely available bulk sequencing data where mutations from individual cells are and genomic mutation rate from bulk sequencing data. based on the maximum likelihood estimation of the parameters of a generative model of tumor growth and mutations. human hepatocellular carcinoma sample reveals an elevated per cell division mutation rate and high cell turnover. Due to the limitations of bulk sequencing, which only essays mutation frequencies for a population of cells from each tumor sample and does not The estimation is based on a maximum likelihood fit of the parameters of a birth-death model to the measured mutant and be estimated from readcount data, to separate the effects of the mutation rate We use pre-generated division trees from the ELynx suite at predetermined turnover rate values. Using the turnover rate, we also estimated the number of cell

10_1101-2021_02_12_430923	Kincore: a web resource for structural classification of protein kinases and their inhibitors result, among the DFGin structures, we distinguished between the catalytically active kinase conformation pages for kinase phylogenetic groups, genes, conformational labels, PDBids, ligands and ligand types. options to download data – database tables as a tab separated files; the kinase structures as PyMOL Kincore provides conformational assignments and ligand type labels to protein kinase structures from Figure 1: Representative protein kinase structure (3ETA_A) displaying the residues used to define inhibitor The distribution of different ligand types across kinase conformations is provided in Table 1. Table 1: Distribution of ligand types across protein kinase conformations (Number of chains). including conformational and ligand type labels and C-helix position, kinase family, gene name, Uniprot provides the number of kinase chains in the group across different conformations with their Database table provides the list of all the PDB chains with conformational labels and ligand 

10_1101-2021_02_12_430963	adenylation site databases to enable differential 3'' UTR usage analysis. Conclusions: diffUTR enables differential 3'' UTR analysis and more generally facilitates DEU9 Popular bin-based DEU methods are provided by the limma [25,24], edgeR [23] and DEXSeq [22]41 Bins are prepared from various types of gene annotations as well as, optionally, additional APA-driven segmentation and extension, then read counts among statistically-significant genes, especially for bins with a higher expression (Figure 3A).78 diffUTR provides three main plot types to explore differential bin usage analyses, each with a88 Plotted are the UTR bins found statistically significant (binand gene-level FDR deuBinPlot (Figure 4B) provides bin-level statistic plots for a given gene, similar to those99 than CDS bins, including counts of 3'' UTR when calculating overall gene expression could under-121 diffUTR streamlines DEU analysis and outperforms alternative methods in inferring UTR changes,127 For differential UTR analysis, gene-level results are ob-206

10_1101-2021_02_12_430979	StrainFLAIR: Strain-level profiling of metagenomic samples using variation graphs results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as approaches to handle multiple similar genomes as with strains use gene clustering and then select the64 StrainFLAIR assigns and estimates species and strain abundances of a bacterial metagenomic sample graph, called the "node abundance", is computed, first focusing on unique mapped reads (first step). Strain-level abundances are then obtained by exploiting the specific genes of each reference genome188 from the reference variation graph thus simulating a new strain to be identified and quantified.231 strains from a sequenced sample, mapped onto this graph.343 Reference strains relative abundances expected and computed by StrainFLAIR or Reference strains relative abundances expected and computed by StrainFLAIR or Reference strains relative abundances expected and computed by StrainFLAIR or Reference strains relative abundances expected and computed by StrainFLAIR or

10_1101-2021_02_12_430989	Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies as well as linear regression-based analyses for studying the association of continuous exposures generation of empirical null distribution of association p-values, and we apply the pipeline to Many studies of phenotypes associated with gene expression from RNA-seq consist of small Residual permutation approach for simulations and for empirical p-value computation covariates, and outcome distributions; and (b) their relationships, aside from the exposureoutcome association, are the same as in the real data, we used a residual permutation approach. association studies applied to residual permutations were included to compute empirical papproach to study the distribution of p-values under the null of no association between the phenotypes and RNA-seq, and used this approach to further study power, and to compute approaches for transcriptome-wide analysis of RNA-seq in population-based studies, including more comprehensive study of statistical permutation approaches for RNA-seq association 

10_1101-2021_02_12_431018	HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage 2 Several new variants of SARS-CoV-2 have emerged globally, of which the 18 based assemblies on raw SARS-CoV-2 sequences in addition to identifying lineages to detect 26 variants of concern, we have developed an open source bioinformatic pipeline called HaVoC 27 monitor the spread of SARS-CoV-2 variants of concern during local outbreaks. currently being used in Finland for monitoring the spread of SARS-CoV-2 variants. SARS-CoV2, variant detection, reference assembly, lineage identification, coronavirus, 40 surveillance of virus variants by sequencing the SARS-CoV-2 genomes would provide a fast 80 to query SARS-CoV-2 fastq sequence libraries and assigns lineages to them individually in 92 processing and a reference genome of SARS-CoV-2 in a separate FASTA file. The likelihood of emergence of novel SARS-CoV-2 variants of concern is increased and 209 Emerging SARS-CoV-2 Variants.

10_1101-2021_02_13_429885	know tumour purity and the ploidy of a CNA segment, then the VAF mutations mapped A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing.

10_1101-698605	Comparative evaluation of full-length isoform quantification from RNA-Seq Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses benchmarking, isoform quantification, simulated data, pseudo-alignment, RNA-Seq, short Given the difficulty in full-length isoform quantification, many RNA-Seq studies simply analysis performed on the known true isoform quantifications of the simulated data to the For the simulated data we started with 11 real RNA-Seq samples: six liver and six the isoform expression level using idealized and realistic simulated data, with full and true counts), for the set of expressed isoforms in sample 1 in C) idealized and D) realistic data. Method effect on differential expression analysis, using realistic data. Method effect on differential expression analysis, using realistic data. RSEM is a gene/isoform abundance tool for RNA-Seq data which uses a generative model S1 Fig. Method effect on full-length isoform quantification using simulated data. Method effect on full-length isoform quantification using simulated data.

10_1101-727867	scAEspy: a tool for autoencoder-based analysis of single-cell RNA sequencing data This computational tool allows for coupling low-dimensional probabilistic representation of gene expression data with the downstream analysis to consider the Finally, the currently available AEs cannot be directly exploited to obtain the latent space or to generate synthetic cells. to show the cells in this embedded space or as a starting point for other dimensionality reduction approaches (e.g., t-SNE and UMAP) as well as downstream analyses Non-linear approaches for dimensionality reduction can be effectively used to capture the non-linearities among the gene interactions that may exist in the highdimensional expression space of scRNA-Seq data [16]. be effectively applied to analyse disparate types of single-cell data from different flexible method developed to cluster single-cell data; (ii) a centroid is calculated batch-effect correction methods for single-cell rna sequencing data. Wang, D., Gu, J.: VASC: dimension reduction and visualization of single-cell RNA-seq data by deep