DeepHBV: A deep learning model to predict hepatitis B virus (HBV) integration sites.


DeepHBV: A deep learning model to predict hepatitis B virus 

(HBV) integration sites.

Canbiao Wu1 ¶, Xiaofang Guo2 ¶, Mengyuan Li3 ¶, Xiayu Fu4, Zeliang Hou1, Manman 
Zhai1,5, Jingxian Shen1, Xiaofan Qiu1, Zifeng Cui3, Hongxian Xie6, Pengmin Qin5, 
Xuchu Weng1, Zheng Hu3,7*, Jiuxing Liang1*

1 Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, 
China; Institute for Brain Research and Rehabilitation, South China Normal 
University, Guangzhou, China.
2 Department of Medical Oncology of the Eastern Hospital, the First Affiliated 
Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China
3 Department of Gynecological Oncology, the First Affiliated Hospital, Sun Yat-sen 
University, Guangzhou, Guangdong, China
4 Department of Thoracic Surgery, the First Affiliated Hospital, Sun Yat-sen 
University, Guangzhou, Guangdong, China
5 School of Psychology, South China Normal University, Guangzhou, Guangdong, 
China
6 Generulor Company Bio-X Lab, Guangzhou, Guangdong, China
7 Department of Obstetrics and Gynecology, Tongji Hospital, Tongji Medical College, 
Huazhong University of Science and Technology, Wuhan, Hubei, China

*Corresponding author
Email: huzheng1998@163.com(ZH), liangjiuxing@m.scnu.edu.cn(JL)

¶These authors contributed equally to this work.

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Abstract

Hepatitis B virus (HBV) is one of the main causes for viral hepatitis and liver 

cancer. Previous studies showed HBV can integrate into host genome and further 

promote malignant transformation. In this study, we developed an attention-based 

deep learning model DeepHBV to predict HBV integration sites by learning local 

genomic features automatically. We trained and tested DeepHBV using the HBV 

integration sites data from dsVIS database. Initially, DeepHBV showed AUROC of 

0.6363 and AUPR of 0.5471 on the dataset. Adding repeat peaks and TCGA Pan 

Cancer peaks can significantly improve the model performance, with an AUROC of 

0.8378 and 0.9430 and an AUPR of 0.7535 and 0.9310, respectively. On independent 

validation dataset of HBV integration sites from VISDB, DeepHBV with HBV 

integration sequences plus TCGA Pan Cancer (AUROC of 0.7603 and AUPR of 

0.6189) performed better than HBV integration sequences plus repeat peaks (AUROC 

of 0.6657 and AUPR of 0.5737). Next, we found the transcriptional factor binding 

sites (TFBS) were significantly enriched near genomic positions that were paid 

attention to by convolution neural network. The binding sites of AR-halfsite, Arnt, 

Atf1, bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, 

Erra and Foxo3 were highlighted by DeepHBV attention mechanism in both dsVIS 

dataset and VISDB dataset, revealing the HBV integration preference. In summary, 

DeepHBV is a robust and explainable deep learning model not only for the prediction 

of HBV integration sites but also for further mechanism study of HBV induced 

cancer.

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Author summary

Hepatitis B virus (HBV) is one of the main causes for viral hepatitis and liver cancer. 

Previous studies showed HBV can integrate into host genome and further promote 

malignant transformation. In this study, we developed an attention-based deep 

learning model DeepHBV to predict HBV integration sites by learning local genomic 

features automatically. The performance of DeepHBV model significantly improves 

after adding genomic features, with an AUROC of 0.9430 and an AUPR of 0.9310. 

Furthermore, we enriched the transcriptional factor binding sites of proteins by 

convolution neural network. In summary, DeepHBV is a robust and explainable deep 

learning model not only for the prediction of HBV integration sites but also for the further 

study of HBV integration mechanism. 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Introduction

HBV is the main cause of viral hepatitis and liver cancer (hepatocellular 

carcinoma: HCC) [1]. It is a small DNA virus that can integrate into the host genome 

via an RNA intermediate [1]. First, HBV attaches and enters into hepatocytes, then 

transports its nucleocapsid which contains a relaxed circular DNA (rcDNA) to the 

host nucleus. In host nucleus, rcDNA is converted into covalently closed circular 

DNA (cccDNA) which produces messenger RNAs (mRNA) and pregenomic RNA 

(pgRNA) by transcription. Via reverse transcription in host nucleus, pgRNA produces 

new rcDNA and double-stranded linear DNA (dslDNA), which tend to integrate into 

the host cell genome [2]. 

Previous study showed HBV integration breakpoints distributed randomly across 

the whole genome with a handful of hotspots [3]. For instance, HBV was reported to 

recurrently integrate into the telomerase reverse transcriptase (TERT) and 

Myeloid/lymphoid or mixed-lineage leukemia 4 (MLL4, also known as KMT2B) 

genes. The insertional events were also accompanied by the altered expression of the 

integrated gene [2,3,5], indicating important biological impacts on the local genome. 

Further analysis revealed that the association between HBV integration and genomic 

instability existed in these insertional events [4]. Moreover, significant enrichment of 

HBV integration was found near the following genomic features in tumours compared 

to non-tumour tissue: repetitive regions, fragile sites, CpG islands and telomeres [2]. 

However, the pattern and the mechanism of HBV integration still remained to be 

explored. Many of the HBV integration sites distributed throughout the human 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


genome and seem completely random [4,6,7]. Whether the features and patterns of 

these “random” viral integration events could be learned and extracted remained an 

open question, and once solved, will greatly improve the understanding towards HBV 

integration induced carcinogenesis. 

Deep learning has an excellent performance in computational biology research, 

such as medical image identification [8], discovering motifs in protein sequences [9]. 

The convolutional neural network (CNN) is the most important part in deep learning, 

which enables a computer to learn and program itself from training data [10]. Though 

deep learning performs excellent in a various of fields, the detailed theory of how it 

makes the decision was hard to explain due to its black box effect. Therefore, an 

approach named attention mechanism which can highlight the outstanding parts was 

invented to open the “black box” [11,12].

In this study, we developed, DeepHBV, an attention-based model to predict the 

HBV integration sites using deep learning. The attention mechanism calculates the 

attention weight for each position and connect the encoder and the decoder in the 

meanwhile. It highlights the regions concentrated by DeepHBV and helps figure out 

the patterns that were paid attention to. DeepHBV can predict HBV integration sites 

accurately and specifically, and the attention mechanism identified positions with 

potential important biological meanings. 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Results

DeepHBV effectively predicts HBV integration sites by adding genomic features.

DeepHBV model structure and the scheme of encoding a 2 kb sample into a 

binary matrix were described in Fig 1. DeepHBV model was tested with our HBV 

integration sites database (http://dsvis.wuhansoftware.com). HBV integration 

sequences were prepared according to HBV integration sites as positive/negative 

samples following the steps in Method. The negative samples should be twice number 

of positive samples to keep data balance and to improve the confidence level. The 

positive samples were divided into 2902 and 1264 as positive training dataset and 

testing dataset. Ccorrespondingly, we extracted 5804 and 2528 negative samples as 

negative training dataset and testing dataset. DeepHINT, an existing deep learning 

model for predicting HIV integration sites according to surroundings [15], will also be 

evaluated using HBV integration sequences for training and testing. Both models 

were trained by the same HBV integration training dataset and used the same testing 

dataset for the evaluation. DeepHBV with HBV integration sequences showed an 

AUROC of 0.6363 and an AUPR of 0.5471 while DeepHINT with HBV integration 

sequences demonstrated an AUROC of 0.6199 and an AUPR of 0.5152 (Fig 2). The 

comparison of DeepHBV and DeepHINT was described in Discussion part.

Several previous studies showed that HBV integration has a preference on 

surrounding genomic features such as repeat, histone markers, CpG islands, etc [2,4]. 

Thus, we tried to add these genomic features into DeepHBV, by mixing genomic 

feature samples together with HBV integration sequences as new datasets, then 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


trained and tested the updated DeepHBV models. We downloaded following genomic 

features from different datasets [16-18] into four subgroups: (1) DNase Clusters, 

Fragile site, RepeatMasker; (2) CpG islands, GeneHancer; (3) Cons 20 Mammals, 

TCGA Pan-Cancer; (4) H3K4Me3 ChIP-seq, H3K27ac ChIP-seq (S2 Fig). After 

obtaining genomic feature data positions (sources are mentioned in S2 Table), we 

extended the positions to 2000 bp and extracted related sequences on hg38 reference 

genome. We defined these sequences as positive genmoic feature samples. Then we 

mixed HBV integration sequences, positive genome feature samples, and randomly 

picked negative genomic feature samples (see Method) together and trained the 

DeepHBV model. Once a subgroup performed well, we re-test each genomic feature 

in that subgroup to figure out which specific genomic feature affect the model 

performance significantly (S2 Fig) (AUROC and AUPR values were recorded in S3 

Table). From the ROC and PR curves, we found DeepHBV with HBV integration 

sites plus the genomic features repeat (AUROC: 0.8378 and AUPR: 0.7535) and 

TCGA Pan Cancer (AUROC: 0.9430 and AUPR: 0.9310) can significantly improve 

the HBV integration sites prediction performance against DeepHBV with HBV 

integration sequences (Fig 2). We also performed the same test on DeepHINT, but did 

not find a subgroup can substantially improve the model performance (results were 

recorded in S3 Table). Together, DeepHBV with HBV integration sequences plus 

repeat or TCGA Pan Cancer can significantly improve the model performance. 

Validation of DeepHBV using independent dataset VISDB

It is necessary of DeepHBV to be applied on general datasets, we tested the 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


pre-trained DeepHBV models (DeepHBV with HBV integration sequences + repeat 

peaks and DeepHBV with HBV integration sequences + TCGA Pan Cancer peaks) on 

the HBV integration sites dataset in another viruses integration sites (VIS) database 

VISDB [19]. We found that in the model trained with HBV integration sequences + 

repeat sequences showed an AUROC of 0.6657 and an AUPR of 0.5737, while the 

model trained with HBV integrated sequences + TCGA Pan Cancer showed an 

AUROC of 0.7603 and an AUPR of 0.6189.

The DeepHBV model with HBV integration sequences + TCGA Pan Cancer 

performed better compared with DeepHBV model with HBV integration sequences + 

repeat and was more robust on both testing dataset from dsVIS (AUROC: 0.9430 and 

AUPR: 0.9310) and independent testing dataset from VISDB (AUROC: 0.7603 and 

AUPR: 0.6189). Thus, we decided to use this model for future HBV integration sites 

study.

Study the preference pattern of HBV integration by conserved sequence 

elements

DeepHBV can extract features with translation invariance by pooling operation, 

which enables DeepHBV to recognise certain patterns even the features were slightly 

translated. The participating of attention mechanism into DeepHBV framework might 

partly open the deep learning black box by giving an attention weight to each 

position. Each attention weight represented the computational importance level of that 

position in DeepHBV judgement. The attention weights in attention layer were 

extracted after two de-convolution and one de-pooling operation and the output shape 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


is 667×1. Each score represented an attention weight of a 3 bp region. Positions with 

higher attention weight scores might have more important impact on the pattern 

recognition of DeepHBV, meaning these positions might be the critical points for 

identifying HBV integration positive samples. We first averaged the fractions of 

attention scores in all HBV integration sequences and normalized them to the mean of 

all positions. Then we visualised the fractions of attention scores and found the figure 

showed peak-valley-peak patterns only in positive samples (Fig 3). We were 

interested in the positions with higher attention weights in convolution neural 

network. And we found that, in the attention weight distribution of DeepHBV with 

HBV integration sites + TCGA Pan Cancer, a cluster of attention weights much 

higher than other weights often occurred in the positive samples. While in the model 

of DeepHBV with HBV integration sites + repeat did not show this pattern (Fig 3).

To further discover the pattern behind these positions with higher attention 

weights, we defined the sites with top 5% highest attention weights as attention 

intensive sites, the regions of 10 bp near them as attention intensive regions. We 

mapped these attention intensive sites on hg38 reference genome with genomic 

features (Fig 4), but found that the positional relationship between attention intensive 

sites and genomic features was not quite clear. The results indicated that there may 

exist other specific pattern closely related to HBV integration preference, and when 

analysed carefully, could be recognized by the DeepHBV model. 

Convolution and pooling module will learn the patterns with translation 

invariance in deep learning, based on that deep learning network tend to learn the 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


domains happened recurrently among different samples in the same pooling matrix, 

even if the learned feature was not at the same position in these different samples 

[20,21]. Attention intensive regions are more likely to be conserved due to the 

translation invariance in convolution and pooling module, and would give hints to the 

selection preference of HBV integration sites. Transcriptional factor-binding sites 

(TFBS) motifs are conserved genomic elements which can be critical to the regulation 

of downstream genes. Therefore, we tested whether TFBS played important roles in 

HBV integration preference. We used all HBV integration samples whose prediction 

scores were higher than 0.95 from dsVIS and VISDB separately to enrich local TFBS 

motifs in attention intensive regions by HOMER v 4.11.1 [22] with its vertebrates 

transcription factor databases (Table 1). From the result of DeepHBV with HBV 

integration sequences + TCGA Pan Cancer, binding sites of AR-halfsite, Arnt, Atf1, 

bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, Erra, 

Foxo3, HEB, HIC1, HIF-1b, LRF, Meis1, MITF, MNT, MyoG, n-Myc, NPAS2, 

NPAS, Nr5a2, Ptf1a, Snail1, Tbx5, Tbx6, TCF7, TEAD1, TEAD3, TEAD4, TEAD, 

Tgif1, Tgif2, THRb, USF1, Usf2, Zac1, ZEB1, ZFX, ZNF692, ZNF711 can be both 

enriched in attention intensive regions of dsVIS and VISDB sequences. We selected 

two representative samples to obtain a more intuitive display. Genomic features, HBV 

integration sites from dsVIS and VISDB, attention intensive sites and TFBS were 

aligned and shown in hg38 reference genome (Fig 4). Most attention intensive sites 

can be mapped to enrich TF motifs. And the clusters of high attention weight from the 

output of DeepHBV with HBV integration sites plus TCGA Pan Cancer showed the 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


binding site of a tumour suppressor gene HIC1, circadian clock related elements 

BMAL1, CLOCK, c-Myc and NAPS2 (Fig 4). The data provided novel insights into 

HBV integration site selection preference and reveal biological importance that 

warrants future experimental confirmation.

Table 1. Enriched TFBS from attention intensive regions of DeepHBV with HBV 
integration sites + TCGA Pan Cancer peaks.

HOMER known results HOMER de novo results
Rank Name P-value Rank Best Match/Details P-value

1 BMAL1 1E-323 1 TEAD3 1E-2283
2 NPAS 1.00E-259 2 EBF1 1E-1926
3 CLOCK 1.00E-165 3 TCF7 1E-958
4 c-Myc 1.00E-126 4 GRHL2 1E-504
5 ZFX 1.00E-108 5 Dux 1E-477
6 Tgif2 1.00E-75 6 Ptf1a 1E-465
7 MNT 1.00E-71 7 TEAD 1E-385
8 LRF 1.00E-62 8 Ahr::Arnt 1.00E-302
9 Tbx5 1.00E-62 9 Sox5 1.00E-245
10 ZNF711 1.00E-57 10 TEAD 1.00E-233
11 n-Myc 1.00E-54 11 Zic2 1.00E-204
12 ZNF416 1.00E-52 12 Nr2e3 1.00E-197
13 USF1 1.00E-47 13 SOX18 1.00E-182
14 bHLHE40 1.00E-45 14 ZBTB14 1.00E-174
15 Rbpj1 1.00E-36 15 USF2 1.00E-153
16 Zac1 1.00E-35 16 Isl1 1.00E-142
17 Tgif1 1.00E-32 17 ZNF264 1.00E-142
18 ZEB1 1.00E-30 18 Ascl2 1.00E-133
19 THRb 1.00E-29 19 ZNF460 1.00E-120
20 Ptf1a 1.00E-29 20 LRF 1.00E-117
21 bHLHE41 1.00E-29 21 ZNF416 1.00E-117
22 TEAD1 1.00E-27 22 PKNOX1 1.00E-103
23 Stat3 1.00E-24 23 Bcl6b 1.00E-91
24 Meis1 1.00E-21 24 Arnt 1.00E-90
25 c-Myc 1.00E-21 25 Osr2 1.00E-88
26 Usf2 1.00E-20 26 TFAP2A 1.00E-79
27 NPAS2 1.00E-17
28 HIC1 1.00E-17
29 TEAD 1.00E-17
30 TEAD4 1.00E-16
31 AR-halfsite 1.00E-16
32 STAT6 1.00E-15

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


33 TCF4 1.00E-13
34 MITF 1.00E-13
35 TEAD3 1.00E-13
36 Atf1 1.00E-12
37 HIF-1b 1.00E-11
38 Foxo3 1.00E-10
39 E2A 1.00E-09
40 TEAD2 1.00E-09
41 Mef2a 1.00E-08
42 ZNF692 1.00E-07
43 Nkx3.1 1.00E-07
44 COUP-TFII 1.00E-07
45 MyoG 1.00E-07
46 Nkx2.5 1.00E-06
47 Snail1 1.00E-05
48 HEB 1.00E-05
49 Tbx6 1.00E-05
50 SCRT1 1.00E-04
51 Nr5a2 1.00E-04
52 Nanog 1.00E-03
53 Oct11 1.00E-03
54 Elk1 1.00E-03
55 Erra 1.00E-03
56 Gata6 1.00E-03
57 BHLHA15 1.00E-03
58 AMYB 1.00E-03
59 Nr5a2 1.00E-03
60 NFkB-p65-Rel 1.00E-02
61 Zic 1.00E-02
62 TRPS1 1.00E-02
63 Hoxa9 1.00E-02
64 HIF2a 1.00E-02
65 Isl1 1.00E-02
66 CEBP:AP1 1.00E-02
67 EWS:FLI1-fusion 1.00E-02
68 FOXK1 1.00E-02
69 ETS 1.00E-02

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Discussion

In this study, we developed an explainable attention-based deep learning model 

DeepHBV to predict HBV integration sites. In the comparison of DeepHBV and 

DeepHINT on predicting HBV integration sites (S3 Table), DeepHBV out-performed 

DeepHINT after adding genomic features due to its more suitable model structure and 

parameters on recognising the surroundings of HBV integration sites. We applied two 

convolution layers (1st layer: 128 convolution kernels and the kernel size is 8; 2nd 

layer: 256 convolution kernels and the kernel size is 6) and one pooling layer (with 

pooling size of 3) in DeepHBV while in DeepHINT the model only have one 

convolution layer (64 convolution kernels and the kernel size is 6) and one pooling 

layer (with pool size of 3). The increasing of convolution layers enables the 

information from higher dimensions can be extracted, the increasing of convolution 

kernels enables more feature information to be extracted [23].

We trained the DeepHBV model using three strategies (1) DNA sequences near 

HBV integration sites (HBV integration sequences), (2) HBV integration sequences + 

TCGA Pan Cancer peaks, (3) HBV integration sequences + repeat peaks. We found 

that the model with HBV integration sequences adding TCGA Pan Cancer or repeat 

can both significantly improve the model performance. And the DeepHBV with HBV 

integration sequences adding TCGA Pan Cancer peaks performed better on 

independent test dataset VISDB. However, the attention intensive regions cannot be 

well aligned to these genomic features. Thus, we further inferred that other features 

such as TFBS motifs may influence DeepHBV in the prediction process. And 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


HOMER was applied to recognise these TFBS that might be related to HBV-related 

diseases or cancer development. 

We noticed that the attention intensive regions identified by attention mechanism 

of DeepHBV with HBV integration sequences + TCGA Pan Cancer showed strong 

concentration on the binding site of the tumour suppressor gene HIC1, circadian 

clock-related elements BMAL1, CLOCK, c-Myc, NAPS2, and the transcription 

factors TEAD and Nr5a2. These DNA binding proteins were closely related to tumour 

development [24-30]. For instance, HIC1 is a tumour suppressor gene in 

hepatocarcinogenesis development [24,25]. BMAL1, CLOCK, c-Myc, NAPS2 all 

participate in the regulation of circadian clock [26], which is reported to promote 

HBV-related diseases [27,28]. In accordance, the binding motif of circadian 

clock-related elements were also enriched from the attention intensive regions of 

DeepHBV with HBV integration sequences + repeats, further confirming the results 

(S4 Table). In addition, the other transcription factors identified by Deep HBV are 

TEAD and Nr5a2. TEAD deregulation affected well-established cancer genes such as 

BRAF, KRAS, MYC, NF2 and LKB1, and showed high correlation with 

clinicopathological parameters in human malignancies [29]. Nr5a2 (also known as 

Liver receptor homolog-1, LRH-1) binds to the enhancer II (ENII) of HBV genes, and 

serves as a critical regulator of their expression [30]. 

In summary, DeepHBV is a robust deep learning model of using convolutional 

neural network to predict HBV integrations. Our data provide new insight into the 

preference for HBV integration and mechanism research on HBV induced cancer.

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Methods

Data preparation

A detailed step-by-step instruction of DeepHBV was provided in S1 and S2 

Notes. To obtain positive training and testing samples for DeepHBV, we extracted 

1000 bp DNA sequences from upstream and 1000 bp DNA sequences from 

downstream of HBV integration sites as positive dataset, each sample was denoted as 

𝑆 = (𝑛1,𝑛2,…,𝑛2000), where 𝑛i represents the nucleotide in position i. DeepHBV, as a 

deep learning network also require negative samples that do not contain HBV 

integration sites as background area. The existing of HBV integration hot spots which 

contains several integration events within 30~100 kb range [13] prompted us that we 

should selected background area keeping enough distance from known HBV 

integration sites. Thus, we discarded the regions around known HBV integration sites 

with length 50 kb on hg38 reference genome and selected 2 kb length DNA sequences 

randomly on remained regions as negative samples.

We encoded extracted DNA sequences using one-hot code to make the 

calculation of distance between features in training and the calculation of similarity 

more accuracy. Original DNA sequences were converted to binary matrices of 4-bit 

length where each dimension corresponds to one nucleotide type. Finally, we 

converted a 2000 bp DNA sequence into a 2000×4 binary matrix.

Feature extraction

DeepHBV model first applied convolution and pooling module to learn and 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


obtain sequence features around HBV integration sites (S1 Fig). Each binary matrix 

representing a DNA sequence entered the convolution and pooling module to execute 

convolution calculation. We employed multiple variant convolution kernels to 

calculation in order to obtain different features. S = (𝑛1,𝑛2,…,𝑛2000) denoted as a 

specific DNA sequence and E represented the binary matrix- encoded from S, the 

convolutional calculation in convolution layer refers to 𝑋 = 𝑐𝑜𝑛𝑣(𝐸), which can be 

described as:

𝑋𝑘,𝑗= ∑
𝑝―1
𝑗=0 ∑

𝐿
𝑙=1 𝑊𝑘,𝑗,𝑙𝐸𝑙,𝑖+𝑗                       (1)

Where 1 ≤ 𝑘 ≤ 𝑑, 𝑑 refers to the number of kernels, 1 ≤ 𝑖 ≤ 𝑛 ― 𝑝 +1, 𝑖 

refers to the index, 𝑝 refers to the kernel size, n refers to input sequence length, 𝑊 

refers to the kernel weight. 

Convolutional layer activated eigen vectors using Rectified Linear Unit (ReLU) 

after extracting relative eigen vectors. ReLU is an activation function in artificial 

neural networks which can be described as 𝑓(𝑥) = max (0,𝑥). We applied ReLU on 

the output matrix of each convolution layer and mapped each element on a sparse 

matrix. ReLU imitates real neuron activation, which enables data fitted to the model 

better. Then we applied max-pooling strategy to complete dimension reduction as 

well as support the maximum retention of predicted information. Till now, we 

achieved the final eigen vector 𝐹c from the binary matrix represented DNA sequence 

after feature extracting in convolution and pooling module.

Attention mechanism in DeepHBV model

DeepHBV added attention mechanism in order to capture and understand the 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


position contribution in abstracted eigen-vector 𝐹c. Eigen-vector entered the attention 

layer, which will calculate a weight value to each dimension in 𝐹c. The attention 

weight represents the contribution level of the convolutional neural network (CNN) in 

that position. The output of attention weight 𝑡𝑗 is the contribution score, larger 𝑡𝑗 

score means bigger contribution in this position to HBV integration sites prediction. 

All contribution scores were normalized to achieve the dense eigenvector matrix, 

which denoted as 𝐹𝑎:

𝐹𝑎 = ∑
𝑞
𝑗=1 𝑎𝑗𝑣𝑗                          (2)

Where，

𝑎𝑗 = 
𝑒𝑥𝑝 (𝑡𝑗)

∑𝑞𝑖 𝑒𝑥𝑝 (𝑡𝑖)
                           (3)

Where 𝑎𝑗 represents the relevant normalisation score, 𝑣𝑗 represents the 

eigenvector at position 𝑗 of the input eigenmatrix. Each position represents an 

extracted eigen-vector in each convolution kernel.

The convolution-pooling module and the attention mechanism module need to be 

combined in model prediction progress, in another word, eigen-vector 𝐹c and 

relative eigen important score 𝐹𝑎 should work together in HBV integration sites 

prediction.

We linked the values in eigen-vector 𝐹c and linearly mapped them to a new 

vector 𝐹𝑣, which is:

𝐹𝑣= (𝑑𝑒𝑛𝑠𝑒(𝑓𝑙𝑎𝑡𝑡𝑒𝑛(𝐹c)))                       (4)

In this step, flatten layer performed function 𝑓𝑙𝑎𝑡𝑡𝑒𝑛() to reduce dimension and 

concatenate data; function 𝑑𝑒𝑛𝑠𝑒() was executed by dense layer, which will map 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


dimension-reduced data to a single value. Then 𝐹𝑣 and 𝐹𝑎 concatenated vector 

entered linear classifier prediction to calculate the probability of HBV integration 

happened within the current sequence, with:

𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑐𝑜𝑛𝑐𝑎𝑡(𝐹𝑎,𝐹𝑣))                    (5)

Where 𝑃 is the predicted score, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑() represents the activation function 

acted as classifier in final output, 𝑐𝑜𝑛𝑐𝑎𝑡() represents the concatenate operation. 

In the meantime, if we give the output eigenvector 𝐹c from 

convolution-and-pooling module as input, and execute attention mechanism, weight 

vector 𝑊 can be achieved:

𝑊 = 𝑎𝑡𝑡(𝑎1,𝑎2,…,𝑎𝑞)                          (6)

Where 𝑎𝑡𝑡() refers to the attention mechanism, 𝑎𝑖 denotes the eigenvector in 

𝑖𝑡ℎ dimension in the eigenmatrix, 𝑊 represents the dataset containing contribution 

scores of each position in the eigenmatrix extracted by convolution-and-pooling 

module.

DeepHBV model training

After confirming each parameter in DeepHBV (S1 Table), we trained the deep 

learning neural network model DeepHBV via binary crossentropy. The loss function 

of DeepHBV can be defined as:

loss = -∑𝑖 𝑦𝑖 log(𝑃) + (1 ― 𝑦𝑖) log(1 ― 𝑃)              (7)

Where, 𝑦𝑖 is the prediction score, 𝑃 is the binary tag value of that sequence (in 

this dataset, positive samples were labelled as 1 and negative samples were labelled as 

0). Back propagation algorithm was adapted in training progress and 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Nesterov-accelerated adaptive moment estimation (Nadam) gradient descent 

algorithm was applied to optimise parameter initialization.

The deep learning neural network model adapted Python 3.7, Keras library 2.2.4 

[14] using three NVIDIA® Tesla V100-PCIE-32G（NVIDIA Corporation, California, 

USA ） for training and testing. DeepHBV takes around 90 min and 30 s for model 

training and testing respectively using the computational platform under such 

software and hardware settings.

Data Availability

DeepHBV is available as an open-source software and can be downloaded from 

https://github.com/JiuxingLiang/DeepHBV.git

Reference

1. Liang TJ. Hepatitis B: the virus and disease. Hepatology 2009;49(5 

Suppl):S13-21.

2. Tu T, Budzinska MA, Shackel NA et al. HBV DNA Integration: Molecular 

Mechanisms and Clinical Implications. Viruses 2017;9(4).

3. Sung WK, Zheng H, Li S et al. Genome-wide survey of recurrent HBV 

integration in hepatocellular carcinoma. Nat Genet 2012;44(7):765-9.

4. Zhao LH, Liu X, Yan HX et al. Genomic and oncogenic preference of HBV 

integration in hepatocellular carcinoma. Nat Commun 2016;7:12992.

5. Ding D, Lou X, Hua D et al. Recurrent targeted genes of hepatitis B virus in the 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


liver cancer genomes identified by a next-generation sequencing-based approach. 

PLoS Genet 2012;8(12):e1003065.

6. Tu T, Budzinska MA, Vondran FWR et al. Hepatitis B Virus DNA Integration 

Occurs Early in the Viral Life Cycle in an In Vitro Infection Model via Sodium 

Taurocholate Cotransporting Polypeptide-Dependent Uptake of Enveloped Virus 

Particles. J Virol 2018;92(11).

7. Mason WS, Gill US, Litwin S et al. HBV DNA Integration and Clonal 

Hepatocyte Expansion in Chronic Hepatitis B Patients Considered Immune 

Tolerant. Gastroenterology 2016;151(5):986-998 e4.

8. Litjens G, Kooi T, Bejnordi BE et al. A survey on deep learning in medical 

image analysis. Med Image Anal 2017;42:60-88.

9. Bailey TL, Baker ME, Elkan CP. An artificial intelligence approach to motif 

discovery in protein sequences: Application to steroid dehydrogenases. The 

Journal of Steroid Biochemistry and Molecular Biology 1997;62(1):29-44.

10. Yamashita R, Nishio M, Do RKG et al. Convolutional neural networks: an 

overview and application in radiology. Insights into Imaging 2018;9(4):611-629.

11. Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning 

to Align and Translate. Computer Science 2014.

12. Guidotti R, Monreale A, Ruggieri S et al. A Survey of Methods for Explaining 

Black Box Models. ACM Comput. Surv. 2018;51(5):Article 93.

13. Hu Z, Zhu D, Wang W et al. Genome-wide profiling of HPV integration in 

cervical cancer identifies clustered genomic hot spots and a potential 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


microhomology-mediated integration mechanism. Nat Genet 2015;47(2):158-63.

14. Chollet Fao. Keras. 2015.

15. Hu H, Xiao A, Zhang S et al. DeepHINT: understanding HIV-1 integration via 

deep learning with attention. Bioinformatics 2019;35(10):1660-1667.

16. Haeussler M, Zweig AS, Tyner C et al. The UCSC Genome Browser database: 

2019 update. Nucleic Acids Res 2019;47(D1):D853-D858.

17. Inoue F, Kircher M, Martin B et al. A systematic comparison reveals substantial 

differences in chromosomal versus episomal encoding of enhancer activity. 

Genome Res 2017;27(1):38-52.

18. Robinson JT, Thorvaldsdottir H, Winckler W et al. Integrative genomics viewer. 

Nature Biotechnology 2011;29(1):24-26.

19. Tang D, Li B, Xu T et al. VISDB: a manually curated database of viral 

integration sites in the human genome. Nucleic Acids Res 2019.

20. Zhang W, Itoh K, Tanida J et al. Parallel distributed processing model with local 

space-invariant interconnections and its optical architecture. Appl Opt 

1990;29(32):4790-7.

21. Bruna J, Zaremba W, Szlam A et al. Spectral Networks and Locally Connected 

Networks on Graphs. Computer Science 2013.

22. Heinz S, Benner C, Spann N et al. Simple Combinations of Lineage-Determining 

Transcription Factors Prime cis-Regulatory Elements Required for Macrophage 

and B Cell Identities. Molecular Cell 2010;38(4):576-589.

23. Seide F, Gang L, Dong Y. Conversational speech transcription using 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


context-dependent deep neural networks. 2012.

24. Taniguchi K, Roberts LR, Aderca IN et al. Mutational spectrum of beta-catenin, 

AXIN1, and AXIN2 in hepatocellular carcinomas and hepatoblastomas. 

Oncogene 2002;21(31):4863-71.

25. Zheng J, Xiong D, Sun X et al. Signification of Hypermethylated in Cancer 1 

(HIC1) as Tumor Suppressor Gene in Tumor Progression. Cancer Microenviron 

2012;5(3):285-93.

26. Paibomesai MI, Moghadam HK, Ferguson MM et al. Clock genes and their 

genomic distributions in three species of salmonid fishes: Associations with 

genes regulating sexual maturation and cell cycling. BMC Res Notes 

2010;3:215.

27. Fekry B, Ribas-Latre A, Baumgartner C et al. Incompatibility of the circadian 

protein BMAL1 and HNF4alpha in hepatocellular carcinoma. Nat Commun 

2018;9(1):4349.

28. Mukherji A, Bailey SM, Staels B et al. The circadian clock and liver function in 

health and disease. J Hepatol 2019;71(1):200-211.

29. Huh HD, Kim DH, Jeong HS et al. Regulation of TEAD Transcription Factors in 

Cancer Biology. Cells 2019;8(6).

30. Cai YN, Zhou Q, Kong YY et al. LRH-1/hB1F and HNF1 synergistically 

up-regulate hepatitis B virus gene transcription and DNA replication. Cell 

Research 2003;13(6):451-458.

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Figure legends

Figure 1. The deep learning framework applied in DeepHBV. (a) Scheme of 

encoding a 2 kb DNA sequence into a binary matrix using one-hot code; (b) A brief 

flowchart of DeepHBV structure, the matrix shape was included in brackets, and a 

detailed flowchart was in S1 Fig.

Figure 2. Evaluation of DeepHBV and DeepHINT model prediction performance 

on the test dataset. (a) receiver-operating characteristic (ROC) curves and (b) 

precision recall (PR) curves, respectively. “DeepHBV with HBV integration 

sequences” refers to DeepHBV model with only HBV integration sequences as input; 

“DeepHINT with HBV integration sequences” refers to DeepHINT model with only 

HBV integration sequences as input; “DeepHBV with HBV integration sequences + 

repeat” refers to DeepHBV integration sequences and repeat sequences as input; 

“DeepHBV with HBV integration sequences” refers to DeepHBV integration 

sequences and TCGA Pan Cancer sequences as input: “DeepHBV with HBV 

integration sequences + repeat + (test) VISDB” refers to DeepHBV using HBV 

integration sequences and repeat sequences for training and using VISDB as 

independent test dataset; “HBV with HBV integration sequences + TCGA Pan Cancer 

+ (test) VISDB” refers to DeepHBV using HBV integration sequences as TCGA Pan 

Cancer sequences for training and using VISDB as independent test dataset.

Figure 3. The attention weight distribution of analysed by DeepHBV with HBV 

integration sequences + genomic features. (a) DeepHBV with HBV integration 

sequences + TCGA Pan Cancer peaks; (b) DeepHBV with HBV integration 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


sequences + repeat peaks. The left graph showed the fractions of attention weight, 

which were averaged among all samples and normalized to the average of all 

positions, each index represents a 3 bp region due to the multiple convolution and 

pooling operation. The graphs on the right are representative samples of attention 

weight distribution of positive samples and negative samples. 

Figure 4. Attention intensive regions highlighted essential local genomic features 

on predicting HBV integration sites. Representative examples showed the positional 

relationship between the attention intensive sites and several genomic features using 

DeepHBV with HBV integration sequences + TCGA Pan Cancer model on (a) 

chr5:1,294,063-1,296,063 (hg38), (b) chr5: 1291277-1293277 (hg38). Each of these 

two sequences contains HBV integration sites from both dsVIS and VISDB. Enriched 

DNA binding proteins detected by HOMER from the attention intensive regions using 

the output of DeepHBV then we applied FIMO [1] to find the enriched motif position 

and label the motifs on attention intensive regions. UCSC genome browser [2] and 

Matplotlib [3] was used for visualisation. “HPV integration site” refers to the sites 

selected from our unpublished database used as testing samples. “Attention Intensive 

Sites” denotes the sites with top 5% attention weight. “RepeatMasker”, “TCGA Pan 

Cancer”, “DNase Clusters”, “Con20mammals”, “GeneHancer”, “Layered H3K27ac”, 

“Layered H3K36me3” are genomic features.

References

1. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


motif. Bioinformatics 2011;27(7):1017-8.

2. Haeussler M, Zweig AS, Tyner C et al. The UCSC Genome Browser database: 

2019 update. Nucleic Acids Res 2019;47(D1):D853-D858.

3. Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & 

Engineering 2007;9(3):90-95.

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


Supporting information

S1 Fig. DeepHBV framework. Each part represents a layer in neural network and 𝑛

× 𝑛 stands for the output dimension which was explained in S2 Note. Two 

continuous convolution layers were used to extract features; max-pooling layers can 

reduce the dimension while keeping the feature matrix has the ability to predicting 

information; dropout layer randomly drop some results to prevent over-fit; flatten 

layer is responsible for reduce the dimensions and connect them; dense layer is used 

to map the output from last layer to a specific value; attention layer and attention 

flatten are used to give a weight score to each dimension in the feature matrix; 

concatenate layer concatenates captured features and importance scores of those 

features from the convolution module and the attention mechanism model. Prediction 

Output offered the final output reveals the probability of HBV infection.

S2 Fig. Prediction performance on the HBV integration dataset with different 

types of genomic features added in. We found that character 1 and character 3 

outperformed the DeepHBV model with an significant increase in AUPR and 

AUROC score on character 1 and character 3, indicating that DeepHBV can capture 

genomic features from character 1 and character 3 effectively, so we did further 

analysis on each single items in character group 1 and 3, and found that Repeats and 

TCGA Pan Cancer are the genomic features that can be captured by DeepHBV which 

significantly improved model performance. DeepHBV with HBV integration 

sequences + repeats reached the AUROC of 0.8378 and the AUPR of 0.7535, which 

DeepHBV with HBV integration sequences + TCGA Pan Cancer reached the 

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


AUROC of 0.9430 and the AUPR of 0.9310.

S1 Table. The parameters for the deep neural network used in DeepHBV.

S2 Table. Genomic features and sources. (Access date: Novemember 16th, 2019)

S3 Table. Comparison of DeepHBV and DeepHINT result record.

S4 Table. Enriched TFBS from attention intensive regions of DeepHBV with 

HBV integration sites + repeat peaks.

S1 Note. DeepHBV framework. DeepHBV neural network structure design and 

hyperparameters involved in DeepHBV are noted.

S2 Note. Mathematical matters of the DeepHBV. There are explanations for 8 

mathematical matters (i.e. encoding DNA sequences, convolution layers, the max 

pooling layer, dropout layer, attention layer, concatenate layer, linear classifier and optimisation 

algorithm) of the DeepHBV in this part.

.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425855
http://creativecommons.org/licenses/by/4.0/