On the application of BERT models for nanopore methylation detection


✐

✐

✐

✐

✐

✐

✐

✐

Genome Analysis

On the application of BERT models for nanopore

methylation detection

Yao-zhong Zhang 1,∗, Sera Hatakeyama 1, Kiyoshi Yamaguchi 1,

Yoichi Furukawa 1, Satoru Miyano 2, Rui Yamaguchi 3, and Seiya Imoto 1,∗

1Institute of Medical Science, the University of Tokyo, Tokyo, 108-0071, Japan
2 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, 101-0062, Japan
3Aichi Cancer Center Research Institute, Nagoya, 464-8681, Japan

∗To whom correspondence should be addressed.

Abstract

Motivation: DNA methylation is a common epigenetic modification, which is widely associated with

various biological processes, such as gene expression, aging, and disease. Nanopore sequencing

provides a promising methylation detection approach through monitoring abnormal signal shifts for

detecting modified bases in target motif regions. Recently, model-based approaches, especially those

with deep learning models, have achieved significant performance improvements on nanopore methylation

detection. In this work, we explore using bidirectional encoder representations from transformers (BERT)

for doing the task, which can provide non-recurrent neural structures for fast parallel computation.

Results: We find original BERT architecture does not work as well as the bidirectional recurrent neural

network (biRNN) on the nanopore methylation prediction task. Through further analysis, we observe

recurrent patterns of positional-signal-shift in the context window surrounding target 5-methylcytosine

(5mC) and N6-methyladenine (6mA) motifs. We propose a refined BERT with relative position

representation and center hidden units concatenation, which takes account of task-specific characters into

modeling. We perform systematic evaluations in-sample and cross-sample. The experiment results show

that the refined BERT model can achieve competitive or even better results than the state-of-the-art biRNN

model, while the model inference speed is about 6x faster. Besides, on the cross-sample evaluation

of datasets from the different research groups, BERT models demonstrate a good generalization

performance.

Availability: The source code and data are available at https://github.com/yaozhong/methBERT

Contact:yaozhong@ims.u-tokyo.ac.jp

1 Introduction

Methylation of DNA/RNA/histone is commonly observed in developmental

disorders, aging, and genomic disease, such as cancer. Fast and

accurately detecting methylation status has a fundamental requirement

to find distinctive biomarkers for aging/disease profiling. For a

virome/metagenome study, quick and accurate epi-transcriptome detection

also plays an important role in understanding unseen strains (Kim et al.,

2020). One commonly used DNA methylation detection approach is

Whole-Genome Bisulfite Sequencing (WGBS). To detect modified bases,

WGBS first takes sodium bisulfite conversion before sequencing. As the

pre-chemical bisulfite conversion is a relatively harsh process, it makes

DNA sequences more fragmental and a large amount of DNA is usually

required. Also, limited to the read length, it is difficult to align short

reads in low-complex regions and analyze methylation patterns in a long-

range. The data processing of WGBS is sophisticated and time-consuming.

Various biases (e.g. GC and fragment length) including those introduced

by bisulfite treatment are required to be dealt with in the data analysis.

WGBS can only be used for DNA samples, which limits its application

of detecting RNA methylation. Single-molecule sequencing (e.g., PacBio

and Nanopore) provides a promising approach through detecting abnormal

signals in target motif regions, as modified bases usually have different

current signals. Compared with the sodium bisulfite approach, no extra

chemical treatment is required, which helps to reduce potential biases.

Currently exist nanopore methylation detection methods can be

categorized into two types. One is testing-based (e.g.,Tombo (Stoiber et al.,

2016)), the other is model-based (e.g., nanopolish (Simpson et al., 2017),

deepMod(Liu et al., 2019) and deepSignal (Ni et al., 2019)). A testing-

based approach performs statistical test on paired signals (candidate and

reference) and does not require any training process. Also, it can be applied

for any chemical modifications. A model-based approach trains a model

1

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/


✐

✐

✐

✐

✐

✐

✐

✐

2 Zhang et al.

x x x x x1 2 i n-1 n

...... ......Embedding

Attention

Feed Forwad

Attention

Feed Forwad

Attention

Linear

Methylation status

Feed Forwad

Linear

C 

G
5mcC 

A 

T

A 5’ 3’ 

DNA sequence
x ix i-k x i+k

W
V

− k
, W

K

− k
W

V

k
, W

K

k

...... ............ ......

Attention

Feed Forwad

Attention

Feed Forwad

Attention

Concate

Methylation status

Feed Forwad

Attention

Feed Forwad

Attention

Feed Forwad

Attention

Feed Forwad

Linear (tanh)

Attention

Feed Forwad

Attention

Feed Forwad

Attention

Feed Forwad

relative position constraint window

x1 xn

(a). Basic BERT for methyaltion detection (b). Refined BERT with relative position representation

Fig. 1: Basic BERT’s and refined BERT’s model structure used for methylation detection. Compared with the basic BERT, enhanced constraints and

additional edges are highlighted in red color.

on known chemical modifications and makes predictions whether a signal

sequence contains methylation signals or not. Sequential models, such as

hidden Markov model (HMM) and bidirectional recurrent neural network

(biRNN), are commonly used in the model-based approach.

Although model-based approaches have already achieved competitive

results, the sequential computational order makes them difficult to be

optimized in parallel for fast inference. Meanwhile, finding discriminative

signal patterns for identifying methylated signals is also important for

developing novel detection algorithms. In this work, based on the

bidirectional encoder representations from transformers (BERT), we

explore the non-recurrent modeling approach for nanopore methylation

detection. Though analyzing nucleotide sequences with both methylated

and unmethylated signals, we profile positional signal-shift for different

motifs and methyltransferases. We find ±3bp region surrounding the
center methylation candidate shows significant signal-shifts. Different

methylation types, such as 5-methylcytosine (5mC) and N6-methyladenine

(6mA), also demonstrate different signal-shift patterns. We hence propose

a refined BERT model to take account of signal-shift patterns in the

modeling. We evaluate the proposed methods on the publicly available

benchmark dataset. In both in-sample and cross-sample evaluation, the

proposed refined BERT model achieves a competitive or even better result

when compared with the state-of-the-art biRNN model, while its model

inference speed is about 6x faster. In the cross-sample evaluation, BERT

models also demonstrate their transfer learning ability across different

datasets.

2 Methods

In this section, we introduce BERT (Devlin et al., 2018) and refined BERT

applied for nanopore methylation detection. The BERT is built on the base

of Transformer (Vaswani et al., 2017), which employs self-attention as

the core module in its stacked network structure. It is proposed to replace

recurrent and convolution operation with purely attention mechanisms. A

typical transformer network consists of encoding and decoding module.

BERT only uses the encoding module of a typical transformer for pre-

training on the unsupervised data. BERT has achieved break-through

results on many natural language understanding tasks. In this work, we

explore applying the BERT model for the nanopore methylation detection

task to leverage the power of advanced deep learning models.

2.1 BERT and refined BERT model

Figure 1 shows the model structures of BERT models used for nanopore

methylation detection. We explore two types of BERT models. One is the

most commonly used BERT (Figure 1(a)), the other is the refined BERT

(Figure 1(b)), which is optimized for nanopore methylation detection.

2.1.1 Embedding module

Given extracted features for each position in a sequence, the embedding

layer maps input vectors into hidden spaces. In the embedding layer,

besides event embedding, positional embedding (PE) is also included. As

a BERT is used to learn bidirectional contextual information, positional

information is important in the modeling. The original PE (Vaswani et al.,

2017) uses a sinusoid embedding, which is fixed and not learnable.

PE(pos, 2i) = sin
pos

100002i/dmodel

PE(pos, 2i + 1) = cos
pos

100002i/dmodel
,

where pos is the position and i is the embedding dimension. For any

fixed offset k, PEpos+k can be represented as a linear function of

PEpos. According to the recent progress (Huang et al., 2020), learnable

PE and relative position embedding can help to further improve BERT’s

performances. Therefore, in the refined BERT model, we use learnable

PE and relative position representation. The learnable PE takes positional

embedding vectors as parameters, which are updated during the learning

process.

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/


✐

✐

✐

✐

✐

✐

✐

✐

BERT for nanopore methylation detection 3

2.1.2 Self-attention module

Following the embedding layer, there are three stacked transformer blocks.

Each transformer block consists of a multi-head self-attention layer and

position-wise fully connected feed-forward network. The self-attention

mechanism is a modeling approach of describing context information for

different positions of inputs under a deep learning framework. The self-

attention mechanism imitates the human sight mechanism and provides

a model with the ability to zoom in or out in a particular position of an

input sequence. It demonstrates the effectiveness in many different tasks

including natural language understanding, image recognition, and several

bioinformatics applications.

Attention function is described as mapping Q and a set of key-value

(K, V ) pairs to an output. Formally, for an input x = (x1, ..., xn) of n

elements where xi ∈ Rdx , we calculate query Q, key K and value V
vectors of dimension dk based on the embedding vector of embed(x). The

attention module generates a new sequence z = (z1, ..., zn) of the same

length as x. zi is calculated as a weighted sum of linearly transformed

input elements as follows:

zi =
n∑

j=1

aij(xjW
V )

aij =
exp eij∑n

k=1 exp eik

eij =
(xiW

Q)(xjW
T )T

√
dz

,

where W Q, W K, W T ∈ Rdx×dz are parameter matrices.
The self-attention computes a pairwise correlation of embed(xi) and

embed(xj), which can be calculated in a parallel way. While in a biRNN,

recurrent hidden units are required to be calculated successively. This

architecture difference makes BERT can be optimized for fast inference.

2.1.3 Relative position representation in self-attention heads

For nanopore sequencing, signals are supposed to be more affected by

the nucleotide passing through the pore. Its surrounding nucleotides may

also have effects on the current signals. For those nucleotides that are

too far away in a context window, it is intuitive to assume they have less

effect on the detected current signals. In the refined BERT model, we

add relative position representation in the attention module following the

method proposed by Shaw et al. (2018). For any two input elements xi

and xj , the relative position information is modeled with two distinct

edge representations aVij , a
K
ij . For linear sequences, those edges are used

to capture the relative position differences between input elements. As the

precise relative position is not useful beyond a certain distance, we clip

the maximum distance (e.g. ±3bp) in calculating attention aij ∈ A.

a
K
ij = W

K
clip(j−i,k)

a
V
ij = W

V
clip(j−i,k)

clip(x, k) = max(−k, min(k, x))

2.1.4 Final full connection layer

After the stacked transformer blocks, hidden units of the center position

feed to a full connection linear layer that makes the final prediction of

whether a given input contains a methylated motif or not. In the refined

BERT, besides the hidden units of the center position, hidden units in its

surrounding window (e.g., ±3bp) are concatenated as the input of the final
full connection layer.

2.2 Applying BERT models for nanopore methylation

detection

The BERT models are then applied to replace different classification

models (e.g. biRNN) in a typical model-based methylation detection

framework. In this framework, raw signals of each read are first translated

into nucleotide sequences (basecalling). Signals are then aligned to

corresponding reference nucleotides through the re-squiggle process. After

that, the target motif (e.g. CpG) and its context regions are localized

through nucleotide matching and signals in a context window of a fixed

length (e.g. 21bp) are transformed into event-based features as the input

of methylation callers. Typical event-based features include signal mean,

signal standard deviation, event length, and nucleotide information (Liu

et al., 2019). Here, we utilize the framework of deepMOD and perform the

same pre-process for the data. We use Tombo (Ver 1.5.1) to perform re-

squiggling and utilize Minimap2 (Ver 2.17-r941) to align events to the

reference genome. Here, we use E.coli K-12 MG1655 and H.Sapiens

GRCh38 as the reference genomes.

3 Experiments

We compare BERT models with the state-of-the-art biRNN model, which

is used as the basic network structure in DeepMOD (Liu et al., 2019) and

DeepSignal (Ni et al., 2019). To compare with other non-deep-learning-

based methods, we utilized the CpG benchmark pipeline (Yuen et al.,

2020) as a pivot.

3.1 Data and model parameters

We train and test the models on the public accessible 5mC (Stoiber et al.,

2016; Simpson et al., 2017) and 6mA (Stoiber et al., 2016) datasets. The

datasets include samples of E.coli K-12 MG1655, K-12 ER2925, and

H.sapiens NA12878. Negative control samples are amplified with PCR and

no modified bases are included. Positive control samples are synthetically

introduced by specific enzymes after PCR amplification, which includes

SssI, Hhal, MpeI methylases for 5mC, and TaqI, EcoRI, and Dam for

6mA modification. We use the samples that are sequenced with Oxford

Nanopore R9 flow cells. For each dataset, we randomly shuffle reads in

positive and negative controls and construct the training, validate and test

set according to a split proportion of 80/10/10 for in-sample evaluation.

For the cross-sample evaluation, we train models on one dataset and test

on the other dataset.

BiRNN uses the default model architecture and parameter setting

of DeepMOD, which consists of three stacked bi-directional recurrent

layers (hidden_size=100) and one full connection layer for the center

position. The total number of biRNN parameters is 570,802 for an input

length of 21bp. BERTs use three attention layers (hidden_size=100,

attention_head=4) and one full connection layer. For the refined

BERT, learnable positional encoding, attention with relative position

representation and center-hidden-concatenation are used. For BERT and

refined BERT, there are total of 364,902 and 368,202 parameters, which

are around 35% less than that of biRNN. More detailed information on the

model structures is described in the supplement material. We implement

the three models using Pytorch. All the models are optimized using Adam

optimizer (Kingma and Ba, 2014) with the learning rate of 1e − 4 and
maximum iteration epoch of 50. Model parameters are selected based on

the minimum validation loss.

3.2 Exploring differentiated signal positions in the context

window surrounding target motifs

Ideally, we assume a modified nucleotide (e.g., the center position of

XXXXXXXXXXC5mCGXXXXXXXXX) has different current signals,

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/


✐

✐

✐

✐

✐

✐

✐

✐

4 Zhang et al.

(a1) Stoiber-E.coli_Cg_SssI (a2) Stoiber-E.coli_Cg_MpeI (a3) Stoiber-E.coli_gCgc_Hhal

(b1) Simpson-E.coli_Cg_SssI (b2) Simpson-H.Sapiens_Cg_SssI

(c1) Stoiber-E.coli_gaAttc_EcoRI (c2) Stoiber-E.coli_tcgA_TaqI (c3) Stoiber-E.coli_gAtc_Dam

Fig. 2: Boxplot of positional signal-shift for 5mC and 6mA datasets of the specific motif and methyltransferase. (a1),(a2) and (a3) are on Stoiber’s E.coli

5mC dataset. (b1) and (b2) are on Simpson’s 5mC dataset. (c1), (c2) and (c3) are on Stoiber’s E.coli 6mA dataset. Each dataset is represented in a format

of dataSource_motif_methltansferase.

when compared with the unmodified one. As the boundary of nucleotide/k-

mer signals are not rigorous and surrounding nucleotides may also

be affected, it is worthwhile investigating signal-shift patterns related

to methylation in a large context. To identify signal-shift affected by

methylation for a specific dataset, we use a simple quantification approach

to calculate significant signal changes of each position in the context

window. Given a dataset of a specific motif and methyltransferase, we

first cluster instances with the same nucleotide sequence to avoid the

effect of nucleotide sequences. We reserve sequence clusters that contain

both methylation and unmethylation instances (≥ 1). For each sequence
cluster, we normalize event signal values of methylation samples with their

according unmodified averaged event signal values for each position. The

i-th positional signal-shift is then calculated as smethi − avg(s
unmeth
i ).

For those normalized methylation samples, we calculate basic statistics of

signal-shift for each position and draw boxplots for 5mC and 6mA training

sets.

Shown in Figure 2, for all datasets, we can observed positions of

significantly signal-shift are located in a range of ±3bp to the center
position (the 11th) in which the target nucleotide is located. For the rest

off-center positions, the averaged signal-shift values are close to 0. This

indicates a modified nucleotide not only affect its corresponding current

signals but also the signals of its surrounding nucleotides.

Besides, 5mC and 6mA datasets show different positional-signal-shift

patterns. Specific positions, such as -2bp position (9th) in the 5mC dataset

and +1bp position (12th) in the 6mA dataset, have larger averaged signal-

shift values. Such pattern can be generalized across the different dataset

with the same motif and methyltransferase. For example, Figure 2 (a1),

(b1) and (b2) show a similar positional signal-shift pattern. For different

methyltransferases, such as Hhal (Figure 2(a3)) also shows a similar

pattern as in SssI, while MpeI does not have a similar pattern obviously

(Figure 2(a2)).

Those positional signal patterns can be directly modeled by a biRNN,

while for the basic BERT, they are not specifically considered in its model

structure. In a biRNN, such as the implementation of deepMOD, the

last full connection layer uses hidden units of the center time step as

the input. Meanwhile, the bi-directional structure and the information

decay from both ends to the center position render the model focusing

more on center positions. For the basic BERT, as any arbitrary time-

step pair is processed with the same attention module, the importance

of center positions are not specifically considered in the model. Therefore,

we propose a refined BERT model to solve this problem. We incorporate

relative-position attention and center-hidden-units concatenation to enable

a BERT model to pay more attention to center positions.

3.3 In-sample evaluation

To evaluate model performance, we first perform the in-sample evaluation

on 5mC and 6mA datasets. The predictions of different models are

evaluated on the read and genomic level. For the genomic level evaluation,

we group all reads aligned to the same genomic coordinate, and uses a

threshold of prediction methylation percentage ≥ 0.1 (same as deepMOD)
as a genomic position prediction.

In general, on the five 5mC datasets, the AUC performance

of the three models are relatively close on both read level and

genomic level. The basic BERT model does not work as well as

the biRNN model that AUC scores are lower. The refined BERT

model achieves equivalent or better AUC scores on the genomic-

level. Note that on the dataset Stoiber_E.coli_CG_MpeI and

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/


✐

✐

✐

✐

✐

✐

✐

✐

BERT for nanopore methylation detection 5

Dataset Species Motif_Methyltransferase Model
Single (read-level) Group (>=1, genomic-level)

AUC Precision Recall AUC Precision Recall

Stoiber E.coli

GCGC_HhaI

biRNN 0.9205 0.9545 0.8593 0.9322 0.9320 0.9134

BERT_basic 0.9183 0.9528 0.8556 0.9305 0.9299 0.9113

BERT_refined 0.9239 0.9563 0.8655 0.9351 0.9341 0.9177

CG_MpeI

BiRNN 0.7184 0.8943 0.4555 0.7482 0.8764 0.5452

BERT_basic 0.7045 0.8682 0.4316 0.7312 0.8494 0.5211

BERT_refined 0.717 0.9017 0.4511 0.7482 0.8848 0.5412

CG_SssI

BiRNN 0.9017 0.9576 0.8097 0.9127 0.9508 0.8420

BERT_basic 0.9001 0.9534 0.8071 0.9107 0.9463 0.8395

BERT_refined 0.9068 0.9509 0.821 0.9162 0.9433 0.852

Simpson

E. coli CG_SssI

BiRNN 0.9514 0.9512 0.9316 0.9284 0.8805 0.9854

BERT_basic 0.9477 0.9469 0.9268 0.9227 0.8718 0.9845

BERT_refined 0.9464 0.9656 0.9124 0.9456 0.9135 0.9803

H.Sapiens CG_SssI

BiRNN 0.9004 0.8891 0.9230 0.9010 0.8900 0.9240

BERT_basic 0.8962 0.8813 0.9248 0.8969 0.8823 0.9256

BERT_refined 0.9045 0.9143 0.8984 0.9053 0.9147 0.9003

Table 1. In-sample evaluation of different deep learning models on 5mC datasets. The best score of each dataset is highlighted in bold.

Dataset Species Motif_Methyltransferase Model
Single (read-level) Group (>=1, genomic level)

AUC Precision Recall AUC Precision Recall

Stoiber E.coli

gaAttc_EcoRI

BiRNN 0.8524 0.8088 0.7497 0.8429 0.7797 0.8035

BERT_basic 0.8607 0.8151 0.7653 0.8591 0.7969 0.8277

BERT_refined 0.8611 0.8826 0.7473 0.8655 0.8596 0.7987

tcgA_TaqI

BiRNN 0.7722 0.7922 0.5750 0.7750 0.7789 0.6290

BERT_basic 0.7573 0.8168 0.5392 0.7653 0.8063 0.5937

BERT_refined 0.7857 0.7788 0.6064 0.7843 0.7643 0.6586

gAtc_Dam

BiRNN 0.6123 0.7656 0.247 0.6337 0.7631 0.3241

BERT_basic 0.6128 0.7329 0.2529 0.631 0.7311 0.3305

BERT_refined 0.6188 0.7513 0.2634 0.6385 0.7471 0.3421

Table 2. In-sample evaluation of different deep learning models on 6mA datasets.The best score of each dataset is highlighted in bold.

Simpson_E.coli_CG_SssI, although the read-level AUC of the

refined BERT are 0.0014 and 0.005 lower than that of biRNN, the

genomic-level performance of the refined BERT is equal or significantly

better than biRNN. This can be explained by the more accurate

prediction in several low read-coverage regions. On the 6mA dataset,

the refined BERT model achieves the best AUC performance on both

read-level and genomic-level. The performance of the basic BERT

model is variant and unstable. On Stobier_E.coli_gaAttc_EcoRI and

Stoiber_E.coli_gAtc_Dam, the basic BERT performs slightly better

than biRNN on the read-level AUC, but has a large performance gap on

Stoiber_E.coli_gaAttc_EcoRI.

In summary, in the in-sample evaluation, the refined BERT model

can achieve competitive or better results when compared with the biRNN

model on benchmark 5mC and 6mA datasets.

3.4 Cross-sample evaluation

We then conduct the cross-sample evaluation. To compare with other non-

deep-learning based methods, we utilize the benchmark pipeline (Yuen

et al., 2020) as a pivot. We test models on the same benchmark dataset1,

which is generated based on Simpson’s E.coli dataset with different

methylation levels. In the dataset, 100 arbitrary sites are selected, which

contain singleton CpG in a window of 10nt from both methylated and

unmethylated instances in the Simpson’s E.coli dataset. Yuen et al. created

11 specific mixtures of methylated and unmethylated reads, containing 0%,

10%, ..., 100% of methylated reads. Each mixture contains approximately

2400 reads. More detailed information can be found in (Yuen et al., 2020).

Different from the deepMOD model used in the original benchmark

pipeline, which is pre-trained on a mixture dataset of all 5mC positive

(Cg_SssI, Cg_MpeI, and gCgc_Hhal) and negative controls (UMR, con1,

and con2). Here, we test two different models trained on a single

dataset with the same methyltransferase to reduce potential overlapping

between the training and testing set. All three models are trained

on Stoiber_Ecoli_CG_SssI and Simpson_Hsapiens_CG_SssI,

separately. Simpson_Hsapiens_CG_SssI is sequenced by the

same group on different species, while Stoiber_Ecoli_CG_SssI is

sequenced by a different group on the same species. We use METEORE

pipeline (Yuen et al., 2020) to generate violin plots for model predictions

on each mixture. The Pearson’s correlation r, coefficient of determination

r2 and root mean square error (RMSE) are used as the evaluation metrics

for each model.

With the training data of Simpson_Hsapiens_CG_SssI, all three

models achieve performances ranked next to the best reported results of

Megalodon (r=0.9860, r2 = 0.9723, RMSE=0.0758) on the dataset (Yuen

et al., 2020). BiRNN achieves the best Pearson correlation r=0.9828 and

r2=0.9658, while refine BERT achieves minimal RMSE of 0.0732 among

the evaluated three models.

When using Stoiber_Ecoli_CG_SssI for training models, the

performances of all three models decrease. This indicates the challenge of

using datasets sequenced by different research groups. Here, both BERT

models show better performances than biRNN, as in Figure 3b. The refined

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/


✐

✐

✐

✐

✐

✐

✐

✐

6 Zhang et al.

(a) Models trained with Simpson_Hsapiens_CG_SssI dataset.

(b) Models trained with Stoiber_Ecoli_CG_SssI dataset.

Fig. 3: Violin plots of prediction results of models trained on different datasets.

BERT achieves the best r=0.9446, r2=0.8924 and RMSE of 0.1449 among

the three models, which demonstrate the generalization ability on datasets

sequenced by different research groups. Based on the reported benchmark

results, the Pearson correlation ranks between reported deepMOD and

deepSignal (Megalodon > DeepMODmixModel (0.9467) > refined BERT

> DeepSignalhuman_hx1 (0.9420) >Guppy>Nanopolish>Tombo).

3.5 Model inference speed

The main motivation of applying BERT models is to use a non-recurrent

modeling approach for the nanopore methylation detection task to improve

the model inference speed. We performed a speed test on a server with

24 CPU cores (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) and

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/


✐

✐

✐

✐

✐

✐

✐

✐

BERT for nanopore methylation detection 7

Model Model inference time Total running time

biRNN 162.91 s 711.56 s

BERT_basic 22.71 s 615.36 s

BERT_refined 27.29 s 622.73 s

Table 3. Model inference and total running time on the benchmark dataset1 for

all 26402 reads.

one V100 NIVIDA GPU card. In the running, CPUs are responsible

for data loading and feature extraction, while GPU works for model

inference. We tested the model inference time and total running time

of the three models on the benchmark dataset1. For each mixture split,

we repeated 5 times running and took the averaged value. As shown in

Table 3, the model inference speed of BERT models is around 6x∼7x
faster than biRNN model (BERT_refined:5.96x, BERT_basic:7.16x). The

inference time of refined BERT is only slightly slower than the basic BERT

model. The gap of the total time is not that large (BERT_refined:1.14x,

BERT_basic:1.16x), as the data I/O and feature extraction take major time.

In the current implementation of BERT, we use reads as the basic data

unit and integrate the data pre-processing part during a read-batch loading

process. The data I/O and feature extraction part can be further accelerated.

4 Discussion

A BERT commonly works in a pre-training and fine-tuning approach. In

the pre-training phase, a BERT learns bi-directional representations from

unlabeled data. After that, learned feature representations are used on task-

specific data for further fine-tuning. It has lead to several state-of-the-art

results on many downstream tasks in language understanding. According

to the data scale, the number of BERT parameters is usually large, and

training such a model requires a huge amount of computational resources.

For example, the BERT used for natural language modeling has a parameter

scale ranging from 110M to 340M (Devlin et al., 2018). In this work, we

did not follow this schema. Instead, we utilized the model architecture of

BERT to provide a lightweight and non-recurrent solution to replace the

recurrent biRNN model. In our experiment, the BERT uses three attention

layers with 4 attention heads and 100 hidden units for each layer. The total

number of model parameters is around 0.37M, which is even less than that

of biRNN (0.57M). In the future, when more nanopore methylation data

becomes available, a larger BERT model and pre-training and fine-tuning

scheme can be further explored.

5 Conclusion

In this work, we explored applying BERT models for nanopore methylation

detection, which aims to use a non-recurrent modeling approach for fast

inference. We quantified positional signal-shift related to methylation for

different datasets of specific motif/methylase and found patterns across

datasets. In the process of evaluation, we found the original BERT

architecture does not work as well as biRNN. We proposed a refined BERT

considering task-specific characters into the modeling. Compared with the

original BERT, the refined BERT uses learnable positional encoding and

self-attention with relative position representation, and focuses more on

the center positions in a ±3bp range. The experiment results show that the
refined BERT can achieve competitive and even better results than the state-

of-the-art biRNN model on a set of 5mC and 6mA benchmark datasets,

while the model inference speed is about 6x faster. On the cross-sample

evaluation, for the case that train and test data from different research

groups, BERTs (include the original BERT) show a better performance

than biRNN.

Acknowledgements

We would like to thank Marcus Stoiber and Jared Simpson for making

nanopore methylation data publicly available, Zaka Wing-Sze Yuen for

providing the benchmark dataset and pipeline, authors of deepMOD and

deepSignal for providing their source codes.

References

Devlin, J. et al. (2018). Bert: Pre-training of deep bidirectional

transformers for language understanding. arXiv preprint

arXiv:1810.04805.

Huang, Z. et al. (2020). Improve transformer models with better relative

position embeddings. arXiv preprint arXiv:2009.13658.

Kim, D. et al. (2020). The architecture of sars-cov-2 transcriptome. Cell,

181(4), 914–921.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic

optimization. arXiv preprint arXiv:1412.6980.

Liu, Q. et al. (2019). Detection of dna base modifications by deep

recurrent neural network on oxford nanopore sequencing data. Nature

communications, 10(1), 1–11.

Ni, P. et al. (2019). Deepsignal: detecting dna methylation state from

nanopore sequencing reads using deep-learning. Bioinformatics, 35(22),

4586–4595.

Shaw, P. et al. (2018). Self-attention with relative position representations.

arXiv preprint arXiv:1803.02155.

Simpson, J. T. et al. (2017). Detecting dna cytosine methylation using

nanopore sequencing. Nature methods, 14(4), 407.

Stoiber, M. H. et al. (2016). De novo identification of dna modifications

enabled by genome-guided nanopore signal processing. BioRxiv, page

094672.

Vaswani, A. et al. (2017). Attention is all you need. pages 5998–6008.

Yuen, Z. W.-S. et al. (2020). Systematic benchmarking of tools for cpg

methylation detection from nanopore sequencing. bioRxiv.

.license
CC-BY-NC-ND 4.0 Internationalpeer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a

The copyright holder for this preprint (which was not certified bythis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430070doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430070
http://creativecommons.org/licenses/by-nc-nd/4.0/
http://creativecommons.org/licenses/by-nc-nd/4.0/