key: cord-0174166-geisq3dk
authors: Bhunia, Ayan Kumar; Ghose, Shuvozit; Kumar, Amandeep; Chowdhury, Pinaki Nath; Sain, Aneeshan; Song, Yi-Zhe
title: MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition
date: 2021-04-05
journal: nan
DOI: nan
sha: 2c12b14ed6f3353d524009affdaaf673eb1a52ae
doc_id: 174166
cord_uid: geisq3dk

Handwritten Text Recognition (HTR) remains a challenging problem to date, largely due to the varying writing styles that exist amongst us. Prior works however generally operate with the assumption that there is a limited number of styles, most of which have already been captured by existing datasets. In this paper, we take a completely different perspective -- we work on the assumption that there is always a new style that is drastically different, and that we will only have very limited data during testing to perform adaptation. This results in a commercially viable solution -- the model has the best shot at adaptation being exposed to the new style, and the few samples nature makes it practical to implement. We achieve this via a novel meta-learning framework which exploits additional new-writer data through a support set, and outputs a writer-adapted model via single gradient step update, all during inference. We discover and leverage on the important insight that there exists few key characters per writer that exhibit relatively larger style discrepancies. For that, we additionally propose to meta-learn instance specific weights for a character-wise cross-entropy loss, which is specifically designed to work with the sequential nature of text data. Our writer-adaptive MetaHTR framework can be easily implemented on the top of most state-of-the-art HTR models. Experiments show an average performance gain of 5-7% can be obtained by observing very few new style data. We further demonstrate via a set of ablative studies the advantage of our meta design when compared with alternative adaption mechanisms.

Handwritten Text Recognition (HTR) has been a longstanding research problem in computer vision [6, 35, 47, 29] . As a fundamental means of communication, handwritten text can appear in a variety of forms such as memos, whiteboards, handwritten notes, stylus-input, postal automation, reading aid for visually handicapped, etc [49] . In general, the target of automatic HTR is to transcribe hand-* Interned with SketchX written text to its digital content [40] so that the textual content can be made freely accessible.

Handwriting recognition is inherently difficult due to its free flowing nature and complex shapes assumed by characters and their combinations [6] . Torn pages, and warped or touching lines also make HTR more challenging. Most importantly however, handwritten texts are diverse across individual handwritten styles where each style can be very unique [13, 23] -while some might prefer an idiosyncratic style of writing certain characters like 'G' and 'Z', others may choose a cursive style with uneven line-spacing.

Modern deep learning based HTR models [28, 40, 26] mostly tackle these challenges by resourcing to a large amount of training data. The hope is that most style variations would have already been captured because of the data volume. Albeit with limited success, it had become apparent that these models tend to over-fit to styles already captured, but generalising poorly to those unseen. This is largely owning to the uniquely different styles amongst writers -there is always a new style that is unobserved, and is drastically different to the already captured (see Figure  1 ). The practical implication of this is, e.g., my iPad does not recognise my handwriting as well as it does for my 4- year-old. Our ultimate vision is therefore to offer an "adapt to my writing" button, where one is asked to write a specific sentence, so to make recognition performance of my own writing on par with that of my child.

Prior work on resolving the style gap remains very limited. A very recent attempt turns to training using synthetic data, so to help the model to become more accommodating towards new styles [24] . However, synthetic data can hardly mimic all writer-specific styles found in the real-world, especially when the style is very unique. Although domain adaptation and generalisation approaches might sound viable, they generally do not offer satisfactory performance (as shown later in experiments), and require additional training via multiple gradient update steps. The sub-optimal performance can be mostly attributed to the large and often very unique domain gaps new writing styles bring, as opposed to the common dataset biases studied by domain adaption/generalisation.

In this paper, we turn to a meta-learning formulation, which not only yields performances that are of potential commercial value (from 81.3% to 89.2% Word Recognition Accuracy), but also offers quick adaption (with just a single gradient update) using very few samples (≤16). The general motivation behind meta-learning [14, 31, 43] matches ours very well -absorbing information from related tasks and generalise onto unseen ones, by performing quick adaptation using a small set of examples during testing. However, getting it to work with HTR has its own challenges, and to our knowledge has not been tackled before in the literature. The main challenges come from the inherent character sequence recognition nature of HTR, which is different to conventional meta-learning whose objective is mostly fewshot classification [14, 42] . Furthermore, we importantly discovers that there also exists character-level style discrepancies, which when unaccounted for would trigger significant performance drop (see Section 3.4) .

To address these specific challenges, we first introduce a character-wise cross-entropy loss to our meta-learning framework. This albeit being a simple change, is crucial in light of the sequence recognition nature of our problem. We further guide the adaptation by introducing instancespecific weights on top of the character-wise loss, instead of treating all characters equally by simply averaging [40, 28] . Modelling such character-specific weight is however nontrivial, as no fixed weight labels exist to supervise the learning process. Consequently, we let the model learningto-learn instance-specific weights for character-wise crossentropy loss during the adaptation step. Our final model, MetaHTR, is therefore a meta-learning pipeline for writeradaptive HTR where the model itself adaptively weights different characters to prioritise learning from more discrepant characters. That is, during inference, our MetaHTR framework exploits few additional handwritten images of a specific writer through a support set, and gives rise to a writer-adapted model via a single gradient update step (see Figure 1 ). Our meta-learning design can be coupled with any state-of-the-art HTR model, and empirical investigation shows that model agnostic meta-learning (MAML) pipeline [14] provides a legitimate choice to design our MetaHTR framework upon.

Contributions of this paper can be summarised as follows: (1) We introduce for the first time, the problem of writer-adaptive HTR, where the model adapts to new writing styles with only very few samples during inference, (2) We introduce a meta-learning framework to tackle this new problem, by introducing learnable instance-wise weights for a character-specific loss specifically designed for HTR.

(3) We confirm that our framework consistently improves upon even the most recent state-of-the-art methods.

Text Recognition: Connectionist Temporal Classification (CTC) layer [17] made end-to-end sequence discriminative learning possible. Subsequently, CTC module was replaced by attention-based decoding mechanism [25, 40] that encapsulates language modeling, weakly supervised character detection and character recognition under a single model. This involves a rectification network to handle irregular text, followed by the final text recognition network. Needless to say attentional decoder became the stateof-the-art paradigm for text recognition for both scene text [28, 50, 48, 51] and handwriting [6, 29, 47, 52] . Different incremental propositions have been made in this context, such as designing multi-directional convolutional feature extractor [10] , improving attention [9, 26] and stacking multiple BLSTM layers for better context modelling [28] .

Besides word recognition accuracy, some works have focused on improving performance in low data regime, by designing adversarial feature deformation module [6] , learning optimal augmentation strategy [29] , and learning from synthetic training data via domain adaptation [52] . In this work, we introduce a new dimension of handwritten text recognition where model could be adapted during inference based on few handwritten word samples of the new writer in order to cope up with writer specific handwriting style. Dealing with Writer Specificity: Writer identification [21, 20] has been a long standing problem in the handwriting analysis community. Furthermore, the phenomenon of writer specific nature of handwriting is accepted in forensic science [8, 21] , and handwritten signature [19] is used as an authentication medium in various official and banking sectors. Few shot writer-specific handwriting generation started in the online handwritten data [1] with coordinates, and has further been realised for offline handwritten images [23] . Although the idea of style specific adap-tation [34, 15, 12] has been introduced two decades ago, it has been limited to online handwritten characters [34] , handcrafted feature normalisation [34] , few pre-defined handwritten styles (not user specific) and fixed lexiconvocabulary [12] . Nevertheless, there has been no work encasing the full potential of end-to-end trainable deep models for writer adaptive lexicon free offline handwritten word recognition. Adaptation could be done without any increase of model parameters -leading to cost-effective deployment. Meta-learning: Meta-Learning aims to train a model on a series of related tasks, such that it learns the unseen task with only a few training samples [14] . One way is to learn optimal initialization, such that it quickly adapts to new tasks with a few data [42, 45] . Various meta-learning algorithms can be broadly categorised in three groups. While memory network based methods [32] learn across the task knowledge and aim to generalise to the unseen task, metricbased methods [42] aim to model a metric space in which learning is efficient with only a few samples. The earlier two approaches are mostly architecture specific, and have been employed for few-shot classification problem. In contrast, there has been a significant attention towards using optimization based meta-learning algorithms [14, 31, 43, 2] due to its model-agnostic nature. Specifically, we choose the recently introduced algorithm, model-agnostic metalearning (MAML) [14] as it is compatible with any model which is trained with gradient descent and applicable to a variety of different learning problems. MAML aims to encode the prior knowledge into optimization process for fast adaption, and several variants have been proposed [31, 2, 38, 39] . Later MAML++ [2] introduced the sets of tricks to stabilize the training of the MAML. MetaSGD [27] proposes to train learnable learning rates for every parameter. Inspired by the success of domain adaptive dialog generation [36] , we introduce MAML for writer adaptive handwritten text recognition. Nevertheless, we extend MAML for instant-adaptive sequence-recognition task over its offthe-shelf version [14] that was initially proposed for nonsequential few-shot classification problem.

Overview: Traditionally, HTR model inputs a handwritten text image X and generates output character sequence Y = (y 1 , y 2 , . . . , y L ), where L is the variable length of text. Conventional HTR models [4] learn from multiple data instances often denoted as training dataset D =

Due to data instance specific training, it ignores the writer specific data distribution, without modeling the shared common knowledge [22] across different writers. Henceforth, the performance deteriorates on handwritten text images of unseen writers because of poor generalisation on diverse handwriting styles.

In contrast, we take a meta-learning approach which seeks to learn the general rules of handwritten recognition from distribution of multiple writer specific handwritten text recognition tasks. Let W S and W T denote the disjoint training and testing writer set respectively, i.e., W S ∩ W T = Ø. The training and testing sets are denoted as

Every i th writer in both training and testing set, has its own set of N i labelled images as

During training, data is sampled across writer specific tasks from training set D S to learn a good initialization point θ, by modeling the shared knowledge across different writers -such that it can quickly adapt to any new writer using few examples. During inference, with respect to j th writer from testing set as D T j , we consider to have access to k (very few) labelled samples, based on which we update θ → θ j using just one gradient step to obtain a writer-specialised HTR model -this is called k-shot adaptation.

Nowadays, state-of-the-art text recognition networks, many of which were originally proposed for scene text [40] , are now simultaneously validated [6, 29, 52, 47] over HTR datasets, as both follows a unified framework and objective. Therefore, attentional decoder based pipeline being the current state-of-the-art for text recognition, we select three seminal works, namely ASTER [40] , SAR [26] and SCAT-TER [28] , to use as our baseline HTR models. Moreover, ours is a meta-framework and could be adopted with most deep-text recognition pipelines.

For completeness, we briefly summarise the outline of text recognition model. In general, they consist of four components: (a) a convolutional feature extractor, (b) BLSTM layers for context modeling (c) a RNN decoder predicting the characters autoregressively one at a time step, and (d) an attentional block. Let the extracted convolutional feature map be F ∈ R h ×w ×d for a rectified image input, where h , w and d signify height, width and number of channels. Every d dimensional feature at F i,j encodes a particular local image region based on the receptive fields, which can be reshaped into list of vectors

Thereafter BLSTM is employed to capture the long range dependencies on every position, thus alleviating the constraints of limited receptive field giving list of context rich vectors as:

At every time step t, the decoder RNN predicts an output character or end-of-sequence (EOS) y t based on three factors: a) previous internal state s t−1 of decoder RNN, (b) the character y t−1 predicted in the last step, and (c) a glimpse vector g t representing the most relevant part of F for predicting y t . In order to get g t , the previous hidden state s t−1 acts as a query to discover the attentive regions as Figure 2 . Our MetaHTR framework involves a bi-level optimisation process. The inner loop optimisation computes learnable character instance-weighted loss L inner upon the support set, followed by obtaining a pseudo-updated model (θ ). This includes a learnable character instance specific weight prediction module (gγ) and learnable layer-wise learning rate parameters (α). We expect θ to generalise well on remaining validation set, thus finally updating the meta-parameters (θ, γ, α) by outer-loop loss L outer over the validation set.

is a character embedding layer with embedding dimension R 128 , and [.] signifies a concatenation operation. Finally, y t is predicted as:

We denote the complete set of parameters for every baseline as θ, and particularly that for the final classification layer as φ = {W o , b o }. SAR [26] addresses 2D attention to eliminate the need of image rectification network [40] and SCATTER [28] couples multiple BLSTM layers for richer context modelling on the top of [40] . We refer the reader to [40, 26, 28] for further architectural details.

A popular optimization-based meta-learning algorithm is model-agnostic meta-learning (MAML) [14] . Here, the goal is to learn good initialization parameters θ that represent an across-task shared knowledge among related tasks, so that it can quickly adapt to any novel task of same distribution with only a few gradient update iterations.

Let T represent multiple tasks where T i denotes the i th task sampled from some task distribution p(T ) i.e. T i ∼ p(T ). In our case, T i is sampled across a task containing labelled training data from a specific writer D S i . Each task T i consist of a support set D tr and a validation set D val . Additionally, let a neural network be represented by f θ , where θ is the initial parameter of the network. Intuitively, MAML tries to find a good initialization of parameters θ, representing the prior or meta-knowledge, so that a few updates of θ using D tr can make large improvements by reducing the error measures and boosting the performance in D val . To learn this optimal initialisation parameter θ, we first adapt (task-specific) f θ to T i using D tr by fine-tuning:

Evaluation of the adapted model is performed on unseen examples sampled from the same task D val ∈ T i , to measure the generalisation of f θ . This acts as a feedback for MAML to adjust its initialization parameters θ to achieve better generalisation on any T i (across-task):

Let the baseline text recognition model in our case (parameterized by θ) be represented as f θ . Instead of naive writer-specific fine-tuning that usually requires hundreds of gradient updates, we seek to learn the general rules [14] of handwriting recognition using multiple writer specific handwritten text recognition tasks. Meta-training involves sampling tasks which here is defined with respect to each specific writer. In particular, T i ∼ p(T ) indicates selecting 2B labelled samples from i-th writer training set D S i , out of which we make D tr for inner loop update and D val for outer loop update each containing B samples. It should be noted that model parameters are updated by averaging gradients [11] of outer loop loss over a meta-batch 2) and any baseline text recognition model (section 3.1), one can train a metalearning model [11] , given an access to the loss function. A naive approach would be using traditional crossentropy loss [40] , which usually trains any attentionaldecoder based text recognition system, for both inner (Eqn.

2) and outer loop (Eqn. 3). If output from text-recognition model isȲ = {ȳ 1 ,ȳ 2 , · · · ,ȳ L }, character-wise (CW) cross-entropy (ce) loss summed over the ground-truth output sequence Y = {y 1 , y 2 , · · · , y L } can be defined as:

Motivation: Being a sequence recognition problem, L C involves a summation operation [40] over the character sequence, thus treating every character specific crossentropy loss equally. We conjecture this task specific adaptation for sequence recognition could be boosted if weight values for each character instance-specific loss are learned, such that the model adapts better with respect to those characters having a high discrepancy. Intuitively speaking, our model learns knowledge across tasks, where given a word 'covid' from a new specific writer, properties of certain handwritten characters (e.g. 'c', 'v', 'i') could be close to the encoded knowledge of MAML's initialisation parameter [5] , to enhance easier recognition. On the contrary, significant discrepancy could exist among certain characters (e.g. 'o', 'd') that are difficult to recognise using average knowledge encapsulated inside MAML's initialisation parameter. Thus, during fast adaptation, the model needs to update itself by prioritising the optimisation with respect to those particular characters (e.g. 'o', 'd') whose style variation is more towards unknown to the model's initialization. In other words, for faster adaptation via inner loop loss, we intend to learn the instance specific weight of character-wise cross-entropy loss instead of simply averaging over all characters. Recent literature shows that meta-learning provides the flexibility to learn any hyperparameters [14] , parameterized loss functions [7] , learning rates [27] , or weight attenuation [5] in the meta-learning process itself. Character wise recognition accuracy from different writers before and after adaptation are plotted in Figure 3 . It can be seen that the characters getting low accuracy using MAML's initialisation parameter before adaptation (X axis) also get low accuracy even after the adaptation (Y axis). However, our proposed learnable characterinstance specific weight of MetaHTR helps to enhance the performance of those discrepant characters after adaptation. More insightful analysis is in Section 4.3. Meta-Optimisation: Naturally, the question arises what information could be used to determine these weights. Re-cent studies show that the gradients used for fast adaptation (inner loop) contains the information [5] related to disagreement (e.g. this knowledge further needs to be learned or accumulated in the adaptation process) with respect to model's initialization parameters. As calculating gradients with respect to all the model's parameters is quite cumbersome, we calculate gradient of t-th character specific crossentropy loss with respect to final classification layer (parameter φ) as ∇ φ L t ce (θ). It is then concatenated with gradients of mean loss (Eqn. 4) which sums over character sequence with respect to φ (both gradient matrix being flattened) as G t = concat ∇ φ L t ce (θ), ∇ φ L c (θ) . We postulate that gradient of the mean and character-instance specific losses provide knowledge towards determining how to weigh different character specific losses. Thus, we pass this G t through a network g γ predicting a scalar weight value for t-th character specific loss as:

Here, g γ is a 3-layer MLP network of parameters γ followed by a sigmoid to generate weights. Therefore, the instance weighted inner loop loss becomes:

Traditional MAML uses a predefined constant learning rate α in the inner-level optimization. Inspired from [27, 46] , we specify a learnable rate for each layer as follows:

where α is a vector of size equal to number of layers in the baseline HTR model. The outer-loop loss is kept as traditional L C (see Figure 2 ). Please note that θ is dependant on {θ, γ, α} through the inner loop update (Eqn. 7), and all three meta-parameters (θ, γ, α) are metalearned via the outer-loop update as (θ, γ, α) ← (θ, γ, α) − β∇ (θ,γ,α) Ti L outer (θ i ; D val ). Training and inference process is summarised in Algorithm 1 and 2, respectively.

Datasets: We evaluate the performance of our writer adaptive MetaHTR on two popular datasets of Latin scripts, IAM [30] and RIMES [18] . While IAM contains a total number of 1,15,320 English handwritten word images written by 657 different writers, RIMES consists of 66,982 French word images of 1300 different writers. Both datasets contain word samples with annotated writer information, thus enabling sampling of writer specific meta-batches, to perform episodic training [14] . For RIMES, we use samples from a subset of 375 writers which is usually used in writer identification task [41] as well. Following [6] , we use the same partition for training, validation and testing as provided for IAM, while the partition released by ICDAR 2011 competition is used for RIMES. Adapt: θ j = θ − α∇ θ L inner (θ; D tr j ) 6: end for 7: Return Writer specialised HTR model params. θ j .

Implementation Details: Following traditional supervised learning protocols [4] , we first pre-train every considered baseline HTR models using ADADELTA optimiser with learning rate 1, and a batch size 64. Thereafter, we perform the meta-training process on pre-trained baseline model's parameters for 20 epochs according to Algorithm 1. Only one inner loop update is used during inference (Algorithm 2) unless otherwise mentioned. Additionally, the effect of increasing inner loop updates is shown in our ablative study (section 4.3). During meta-training, we consider metabatch size of M = 8 -our meta-batch comprises 8 different writer specific tasks T i , that are used for updating the meta-parameters by taking average gradient. Within each task, the batch-size of support and validation set is B = 16. We use ADAM as meta-optimiser with outer-loop learning rate β as 0.0001, while the inner-loop learning α is metalearned along with instance-specific weight γ t of characterwise cross-entropy loss. We implemented our framework in PyTorch [33] and conducted experiments on a 11 GB Nvidia RTX 2080-Ti GPU.

Evaluation Metric: We use Word Recognition Accuracy (WRA) [6] for both with-Lexicon (L) and with No-Lexicon (NL) (unconstrained) HTR. As there is no separate adaptation set (support set) explicitly defined for testing set writers W T in either of these datasets, we do the following : let N T j be the total number of images under test writer j, we take random k images (for k-shot adaptation) as the support set for adaptation and the adapted model is evaluated on remaining (N T j − k) images. We do this for ten times, and cite average result to reduce the randomness. We use k = 16 for our cited results unless mentioned otherwise. Due to this adaptation set constraint, only those writers having more than 32 word images, contribute towards accuracy calculation. For fairness, we ensured uniform adaptation and testing set for all the competitive baselines. 

To the best of our knowledge, there exists no prior work particularly dealing writer specific adaptation for offline handwritten word images. However, we design several strong baselines from five different perspectives to justify our MetaHTR framework. (i) Learning Augmentation Approach: Recently, there have been attempts to learn efficient data-augmentation strategy to learn the style-variation present in the handwritten data using a learnable agent [29] or an adversarial feature deformation module [6] . (ii) Generative Approach: One can synthetically generate [23] multiple handwritten images with different words mimicking someone's handwriting style from few given handwritten examples (adaptation set). Thereafter, naive fine-tuning [44] could be done over large (5K in our experiment) synthetically generated data considering them as writer's style specific training-set. (iii) Meta-Learning based Adaptation: We follow this paradigm in our MetaHTR framework, however, there could be some off-the-shelf alternatives. An obvious choice could be naive-fine-tuning [44] over the same labelled images of the adaptation set. We compare our method with typical MAML [14] formulation, along with its first-order (MAML-FO) approximated version [14] to judge how far the performance drops, while improving the computational speed. We compare with MetaSGD [27] which uses learnable learning rate for each parameter and ANIL [37] , where only final classification layer (φ) is pseudo-updated in the inner loop for computational ease. (iv) Domain Adaptation (DA) Approach: All the training writers are considered as a single source domain, and the trained model is adapted using samples from a specific writer using adversarial learning as used in [24] . (v) Domain Generalisation (DG) Approach: Domain generalisation aims to learn a generalised model via episodic training from writer-specific task distribution, which can directly perform well across unseen writers without any further gradient update. Following [16] , we can twist our meta-learning pipeline to fit the objective of DG. Thus, we optimise the baseline HTR model using weighted (λ = 0.5) summation for gradient (over metatrain set) and meta-gradient (over meta-test split through inner loop update). Mathematically, using our notation: argmin θ λ · L(θ; D tr ) + (1 − λ) · L(θ ; D val ), where L is the loss function (see Eqn. 4) and θ is pseudo-updated parameter by inner loop with learning rate 0.0005. It is worth noting that although DG [16] and augmentation based approaches [29] cannot be compared directly to ours, as they do not involve any model-updation step at test time.

The unconstrained WRA on IAM is used to cite any performance gap for describing rest of the paper, unless mentioned otherwise. In Table 1 , we compare our MetaHTR framework with corresponding state-of-the-art (SOTA) baselines [40, 26, 28] and naive fine-tuning [44] method. MetaHTR outperforms (Figure 4) every SOTA baseline by a significant margin of around 5-7%.

Furthermore, we compare with five classes of alternative approaches in Table 2 to tackle the style variation from different writers. We observe the following: (i) GA: While naive-fine-tuning hardly gives any improvement, the generative approach opens the room for generating multiple images with different words by mimicking some particular writer's handwriting style from the same adaptation set as used in MetaHTR. This being followed by naive finetuning does improve over the baseline HTR models by 2.3%, but lags behind our MetaHTR framework by 5.6% for ASTER baseline. We attribute this to the inconsistency of style in the generated image [23] with respect to any given writer. Fine-tuning via synthetically generated images hurts the HTR model performance on real handwritten samples due to the inherent domain gap [16] . Furthermore, style conditioned handwritten image generation involving a separate cumbersome network make it computationally more expensive. (ii) DG: Although DG approach [16] improves the performance for unseen writers compared to the baseline models, it still lags behind our MetaHTR framework by 4.6%, 5.8%, 5.7% with respect to our three baselines due to obvious reasons of not exploiting writer specific few-shot labelled data during inference. Moreover, this could be a very straight forward alternative for cases where we do not have any access to specific writer's samples, but can enrich the model with writer-specific data distribution via episodic training [16] , to learn a common knowledge across writers for better performance than baseline models. (iii) Augmnt: Augmentation based approach improves the performance on top of baseline models by incorporating synthetic learnable deformations, both in image space [29] and feature space [6] . Having no option of using specific writer's handwritten examples, this falls inferior to our proposed method as well. (iv) DA: The performance of DA is found to be limited due to scarce adaptation data scenario and alleged instability of adversarial training. (v) Meta Learning based Adaptation: Our close competitors are gradient-based meta-learning alternatives [22] . Out all of these, MAML [14] scores quite close to ours, yet lags by 2.1%, 2.4% 2.3% for ASTER, SAR and SCATTER respectively. Although first-order approximation of MAML (MAML-FO) is computationally simpler, unconstrained WRA drops by 0.2% compared to MAML on ASTER baseline. We want to emphasise that our model needs second-order gradient computation [14] in the outer loop process as g γ is related to L outer through inner loop update. Meta-SGD [27] needs double the number of parameters than MAML as it meta-learns learning rate value for every parameter. To our surprise however, it performs lower than MAML fitted on top of our baseline text recog-nition models. Probably, the need of extensive parameter updates leads towards over-fitting by the meta-learning process, thus failing to generalise. In contrast, we use layerwise learnable learning-rate in MetaHTR which is computationally way less expensive and provides better generalisation. Although computationally cheaper, ANIL [37] is still inferior to MAML baseline. We attribute the superiority of our method over other gradient based meta-learning algorithms for sequence recognition task (e.g. HTR) to two main factors: learnable character-instance specific weighting mechanism for inner-loop loss and re-designing layerwise learnable learning rate. 

[i] Significance of learnable γ t : (a) To show the efficacy of the learnable instance specific weight for character specific loss, we remove g γ and use simple mean cross-entropy loss (Eqn. 4) for inner-loop update. By doing this, the performance drops by 1.9%, 2.2% and 2.1% for ASTER, SAR and SCATTER, respectively on IAM dataset. (b) Next, we get deeper to verify whether different characters of a same writer really shows discrepancy [5] or not. For that, we evaluate character specific accuracy using our Meta-HTR model with learned initialization parameter [14] . HMMbased Viterbi Forced alignment is used to locate and crop out every character from word images in a cost-effective way. From the Figure 6 , it is qualitatively evident that there exist significant variation in terms of recognition accuracy across different characters -signifying that a few characters are harder to adapt or recognise, than others due to wide style discrepancy. For further analysis, we plot the average (for support set) character instance specific weight predicted by our meta-learned model with respect to a particular writer, for which the result is fairly consistent compared to character recognition result. This indicates that those characters which obtain low recognition accuracy, mostly receive higher weights in the inner loop loss calculation, and vice-versa, which strongly supports our intuition. (c) Furthermore, we try to explore individual gradient ∇ φ L t ce coming from every t-th character prediction of attentional decoder without concatenating it with the mean gradient ∇ φ L c for γ t calculation (Eqn. 5). However, the performance drop (by 1.9%) using ASTER baseline implies that character-instance specific gradient along with mean gradient, provides more context to judge the character wise style discrepancy with respect to the initialisation parameters. [ii] Layer-wise learnable learning rate: To analyse the contribution of learnable layer-wise learning rate mechanism, we replace it with a fixed inner-loop learning rate of 0.001 (optimised) keeping rest of the design same. This leads to a drop of 0.8%, 0.6% and 0.6% using respective three baselines, thus justifying its contribution. Furthermore, MetaSGD [27] is nearly 2.5x times computationally expensive than MAML on our HTR baselines during inference. Although computational overhead rises due to our module g γ , the performance gain of nearly 2% overshadows additional 0.2x inference time compared to MAML under similar setup.

[iii] Size of adaptation data: A few examples are enough to achieve instant adaptation as suggested by many existing few-shot works [36, 11] . Here, we vary the size of adaptation set (k) in Figure 5 to discover the effect on recognition accuracy. The performance nearly saturates between 10-20 samples, thus justifying the few-shot design. Moreover, our MetaHTR tends towards saturation with slightly less samples compared to MAML under same setup.

[iv] Number of adaption steps: The number of adaptation steps during inference is varied in Figure 5 . In summary, just a single gradient step update, used in most of our experiments, shows highest performance gain. On the contrary, more updates sometimes showed diminishing results that contradicts the tendency reported in [14] . The reasons might be that inner loop concentrates on unnecessary style details, thus forgetting the generic prior knowledge learned.

In this paper, we proposed a novel writer adaptive offline handwritten text recognition framework which aims to fully utilise the additional writer specific information available at test time. We employ an extension of Model Agnostic meta-learning (MAML) algorithm to train our writer adaptive HTR network that can quickly adapt its parameters according to the writer's handwritten text. The proposed framework is applied to three existing text recognition models without changing its architecture, and shows consistently improved performance on multiple handwritten benchmarks datasets.

Deepwriting: Making digital ink editable via deep generative modeling

How to train your maml. ICLR

Rimes evaluation campaign for handwritten mail processing

Seong Joon Oh, and Hwalsuk Lee. What is wrong with scene text recognition model comparisons? dataset and model analysis

Learning to forget for meta-learning

Handwriting recognition in low-resource scripts using adversarial learning

Meta-learning via learned loss

Chin-Teng Lin, Weiping Ding, and Zehong Cao. Semi-supervised feature learning for improving writer identification

Focusing attention: Towards accurate text recognition in natural images

Aon: Towards arbitrarily-oriented text recognition

Scene-adaptive video frame interpolation via meta-learning

Writer adaptation for online handwriting recognition. T-PAMI

Text and style conditioned gan for generation of offline handwriting lines

Modelagnostic meta-learning for fast adaptation of deep networks

Writer adaptation for handwritten word recognition using hidden markov models

Learning meta face recognition in unseen domains

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks

Icdar 2009 handwriting recognition competition

Learning features for offline handwritten signature verification using deep convolutional neural networks

Deep adaptive learning for writer identification based on single handwritten word images

Fragnet: Writer identification using deep fragment networks. T-IFS

Meta-learning in neural networks: A survey

Ganwriting: Contentconditioned generation of styled handwritten word images

Unsupervised adaptation for syntheticto-real handwritten word recognition

Recursive recurrent nets with attention modeling for ocr in the wild

Show, attend and read: A simple and strong baseline for irregular text recognition

Metasgd: Learning to learn quickly for few-shot learning

Scatter: selective context attentional scene text recognizer

Learn to augment: Joint data augmentation and network optimization for text recognition

The iam-database: an english sentence database for offline handwriting recognition

On first-order meta-learning algorithms

Tadam: Task dependent adaptive metric for improved few-shot learning

Pytorch: An imperative style, high-performance deep learning library

A constructive rbf network for writer adaptation

Cnn-n-gram for handwriting word recognition

Domain adaptive dialog generation via meta learning

Rapid learning or feature reuse? towards understanding the effectiveness of maml

Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization

Stylemeup: Towards style-agnostic sketch-based image retrieval

Aster: An attentional scene text recognizer with flexible rectification. T-PAMI

Text independent writer recognition using redundant writing patterns with contourbased orientation and curvature features. Pattern Recognition

Prototypical networks for few-shot learning

Meta-transfer learning for few-shot learning

A dataset of datasets for learning to learn from few examples

Matching networks for one shot learning

Tracking by instance detection: A metalearning approach

Decoupled attention network for text recognition

Symmetry-constrained rectification network for scene text recognition

Origaminet: Weaklysupervised, segmentation-free, one-step, full page text recognition by learning to unfold

Towards accurate scene text recognition with semantic reasoning networks

Verisimilar image synthesis for accurate detection and recognition of texts in scenes

Sequence-to-sequence domain adaptation network for robust text image recognition