key: cord-0455954-b34rwyrl authors: Jafari, Amir Reza; Heidary, Behnam; Farahbakhsh, Reza; Salehi, Mostafa; Jalili, Mahdi title: Transfer Learning for Multi-lingual Tasks -- a Survey date: 2021-08-28 journal: nan DOI: nan sha: 89a4f4d1f8c93da85d829a1acfe8cafea2b50c00 doc_id: 455954 cord_uid: b34rwyrl These days different platforms such as social media provide their clients from different backgrounds and languages the possibility to connect and exchange information. It is not surprising anymore to see comments from different languages in posts published by international celebrities or data providers. In this era, understanding cross languages content and multilingualism in natural language processing (NLP) are hot topics, and multiple efforts have tried to leverage existing technologies in NLP to tackle this challenging research problem. In this survey, we provide a comprehensive overview of the existing literature with a focus on transfer learning techniques in multilingual tasks. We also identify potential opportunities for further research in this domain. The phenomenon of multilingualism in different NLP tasks is one of the most exciting and demanding topics in this area. During the past decade, these topics have been the center of attention in the linguistic and computer science community alongside the increasing use of transfer learning in NLP. This task is significantly more critical with the extensive usage of social media and massive engagement of end-users across the world to trending topics. As multilingualism is gaining massive attention in order to reach better performance in NLP multilingual tasks and applications, an overview about the history of transfer learning in language models, published main language models, and more specifically, multilingual, cross-lingual, and even language-specific models is necessary to step in this field. In our survey, we mainly focused on multilingualism models and tasks; see figure 1 as an illustration of the main components presented in this paper. First, we started by reviewing the main concepts and a brief history of language models. We then categorized them into three main groups in Section 2. Later in Section 3, we focused more on reviewing architecture and structure from a multilingual perspective and specified the importance of cross-lingual and multilingual models in NLP. Furthermore, we introduced available datasets for each application to help those who want to work on a specific domain in this subject. Since evaluating these language models is possible with analyzing different pre-defined NLP applications, we focused on these applications in section 4 and reviewed the existing literature that evaluates their models on these applications in different languages. In Section 5, we provide future directions and the challenges that exist in this subject, and we tried to provide a good perspective for future studies. A survey of transfer learning [105] Mainly focused on transfer learning paradigm and its current solutions and applications applied to transfer learning An overview of transfer learning with less details and more focus on this paradigm in multilingual models and applications 2016 A survey of cross-lingual word embedding models [89] provide a comprehensive typology of cross-lingual word embedding models and compare their data requirements and objective functions we focused on outputs of models and only talk about structure and word embedding enough to help readers to understands outputs 2017 Transfer Learning in Natural Language Processing (NLP) [88] This survey represents an effort at providing a succinct yet complete understanding of the recent advances in NLP using deep learning in with a special focus on detailing transfer learning and its potential advantages We assumed that readers have a basic knowledge of transfer learning and we focused on transfer learning in multilingual models 2019 A survey of multilingual neural machine translation [23] This survey present an in-depth survey of existing literature on MNMT and also categorize various approaches based on the resource scenarios as well as underlying modeling principles We have a more general overview of multilingual tasks which include machine translation too but not limited to a specific task 2019 Cross-lingual learning for text processing: A survey [80] a comprehensive table of all the surveyed papers with various data related to the cross-lingual learning techniques they use we have a model perspective and focused on multilingual language models more 2021 Table 1 : Recent surveys in field of Transfer learning and multilingual NLP tasks The concept of transfer learning and its solutions in machine learning and data mining has been reviewed in [105] , and more detailed history about the evolution of transfer learning in NLP can be found in [88] . [85] provides a comprehensive survey of pre-trained models in NLP and categorizes the existing pre-trained models based on taxonomy. Also, [23] [79] reviewed multilingualism in specific NLP tasks such as machine translation and text processing. Our survey aims to review a more general overview of multilingualism in different tasks and introduce those language models that deal with different languages or a language with lower resources. Table 1 shows our main focus in this work and how it compares with other surveys. We will not provide many details about transfer learning. Our survey can be helpful for those with sufficient knowledge about transfer learning who are interested in applying it to multilingual models and tasks. By applications of transfer learning on language models, a new era has emerged in the NLP domain. Most performance analysis techniques on NLP-related tasks focus on languages with sufficiently large available data, and those with low resources are often kept out of attention. Before reviewing the history of language models and introducing transfer learning in this subject, it is necessary to overview some basic concepts. In this section, we provide a brief overview of general concepts, including language models and transfer learning. Language Modeling (LM) is one of the main parts of NLP tasks, which uses various probabilistic techniques for predicting a word or a sequence of words in a sentence. The importance of LM in NLP is undeniable, especially in terms of multilingual models which contain many NLP tasks, such as machine translation [101] , question answering [12] , speech recognition [67] or sentiment analysis [63] . From the statistical point of view, LM is a learning process to predict the probability distribution of a sequence of words occurring in a sentence. In fact, LM works with analyzing the text in data by learning the features and characteristics of a language with suitable algorithms and then understanding phrases and predict the next words in sentences by probabilistic analysis [74] [92]. Transfer Learning is one of the Machine learning approaches, where the information gained from pre-training a model with general tasks is reused in other related tasks for improving efficiency and faster fine-tuning [88] . This approach was introduced as a machine learning approach with the introduction of ImageNet in 2010 [36] as a successful large CNN model. With fine-tuning deep neural networks, more than 14 million images have been divided into more than 20,000 categories. Transfer Learning has been used in large number of studies in different NLP Applications, and propsed state-of-the-art result in that case like sentiment analysis These two terms, which are usually used interchangeably in most works, can be defined as follow: Multilingual/Cross-lingual learning is a part of transfer learning that focuses on transferring knowledge from one language with usually higher available resources to another language with lower resources. This concept may lead to better performance in many downstream tasks, especially in languages lacking valuable data. In general, We can look at these concepts from two perspectives: 1: Multilingual usually deals with models. We define this concept as a model pre-trained on different language datasets that check performance on related downstream tasks. Cross-lingual usually comes with learning a model based on a high resource language and then use and evalu- Figure 2 : Evolution of Transfer Models in NLP ate this model for low-resource language for different NLP tasks [80] . 2: In terms of cross-lingual embedding, the same vector projection is used for similar words in different languages as a semantic view. In Multilingual embeddings, just using the same embeddings for different languages is considered without assurance of interaction between different languages. In addition, In cross-lingual, we have a query in one language, and the aim is to retrieve the document in another language. However, in Multilingual, in addition to this, the focus is on the models that deal with multiple languages. Zero-shot Learning (ZSL) is a type of classification problem where a classifier is trained on a specific set of labels in different classes during training, and then it evaluates the samples that have not been previously observed [51] . ZSL in multilingual tasks refers to classifying data based on few or even no labeled examples in under-resourced languages with training on multiple languages with noticeable resources. To narrow down the ZSL in NLP downstream tasks, ZSL plays an important role specifically in the field of cross-lingual. For example, in [84] ZSL is used for text classification for generalizing models on new unseen classes after training to learn the relationship between a sentence and the embedding of its tags. ZSL is also used for news sentiment classification by assigning the sentiment category to news in other languages without any training data required after training on Slovene news [77] . [61] and [7] used ZLS for the Question-answering task to generalize it to unseen questions. Since intent-detection plays a crucial role in question-answering, [109] studied the zero-shot intent detection problem to detect user intents for no labeled utterances data. For the task of Entity recognition with portable domain, [30] presents a zero-shot learning approach for entity recognition of users' talks which not annotated during training, and for Dependency parsing, [97] analyses the ZSL approach for Multilingual Sentence Representations The early methods used in NLP research were mainly based on probabilistic language models, such as n-gram [6] . These simple language models aim to predict the next word in a sequence by assigning a probability to a sequence of words. The first binding of neural networks with language modeling was proposed in 2001 [10] . With simultaneously learning of distributed representation and probability function for each word, this model improves n-gram models and also can use longer context as an input. The influence of the main types of neural networks in NLP started with the introduction of Recurrent neural networks (RNNs) [67] . In RNNs, the output of the previous step is used as an input for predicting the next word. This method has shown remarkable performance in solving such problems using hidden layers. Since RNNs are difficult to train, Long short term memory (LSTM) [32] has become more popular for language modeling [29] More recently, convolutional neural networks (CNNs) have also been used in NLP research. [40] proposed a Dynamic k-Max Pooling network over linear sequences to extract sentences' feature graph for semantic modeling of sentences. The main advantages of these networks are supporting varying length sentences as input and also their applicability to any language. For sentence-level classification tasks, [45] used CNN with little hyperparameter tuning and static vectors to improve performance in NLP tasks, such as sentiment analysis and question classification. By using dilation in the convolutional layers to increase its receptive field, the ByteNet proposed a mechanism to address the source's variant lengths and the target in context [39] . The combination of both CNNs and LSTMs is used for dimensional sentiment analysis [103] , and for faster train and test time, QRNNs was proposed over LSTM [13] Since many machine learning algorithms in text processing are incapable of processing strings or plain text in their raw format, they need to convert these inputs into numerical vectors. Assigning words with the same semantic and syntactic view to the same vectors as a basic definition of word embeddings, a distributed representation of words [11] learned the joint probability function of sequences of words and trained them in a neural language model. In 2013 [66] introduced one of the most popular techniques (called word2vec) to learn word embeddings using neural network methods such as Skip Gram and Common Bag Of Words (CBOW). Word2vec uses both algorithms to learn weights that act as word vector representations used for different NLP tasks. Another word embedding method for vector representation of words, called GloVe, is based on unsupervised learning algorithm [35] . GloVe uses a different mechanism and equations to create the embedding matrix. Instead of training on the entire sparse matrix or individual context windows, this model tries to train on word-word cooccurrence probabilities in a large corpus. As a result, this model outperforms on word analogy task and similarity tasks and named entity recognition because of the vector space structure that has been introduced [35] Collbert and Weston [19] proposed a single convolutional neural network architecture that can be referred as a starting point in pre-trained models. The output of this neural network architecture for a given sentence can be used for NLP downstream tasks. With introduction of transfer learning, a revolution in Language models architecture has begun, leading to significant improvements in downstream NLP tasks' performance. Bidirectional training of transformers, which was the BERT model's innovation, enabled training on a text sequence to be either from left-to-right or combined leftto-right and right-to-left. In the transformer mechanism, the input text is first processed by an encoder, and then the decoder predicts the task's goal. Therefore, the encoder reads the entire input sequence at once, allowing the model to learn the context from all previous and next tokens. This often provides high accuracy. The reputation of transfer learning had a significant influence on pre-trained models. It made building NLP models easier since transfer learning made it possible to train a model on one dataset at first and then performing different NLP tasks on a different dataset. This phenomenon is becoming more popular, especially in multilingualism, since the structure that is needed for this is adopted well with the transfer learning concept. We categorized the existing pre-trained language models into the three main groups: • Base Models: Those types of language model which introduce a new structure in LM • Multilingual Models: Those types of language model which deal with multiple languages • Language-specific Models: Those types of language model which focus on a specific languages rather than English The term 'Base Model' refers to the models that gained a huge amount of attention by introducing a new structure or changing the previous architecture. We focused more on Bert and post-Bert models that are shown in figure 2. In 2018, Google's AI language team introduced a Bidirectional Encoder Representations from Transformers called "BERT" which was a revolution in the field of pre-trained models [27] . This pre-training model contains learning from unlabeled text jointly in both the left and right directions. This innovation led to outstanding improvement in a wide range of NLP tasks. A year after BERT, Facebook introduced a new optimized method called "RoBERTa" based on the masking strategy used in Bert and changed a number of key parameters in that model. In addition, increasing dataset size and training time turned out to be a critical improvement in results. Also, another key change in this model in comparison with BERT was removing "Next Sentence Prediction" which was marked as an unnecessary task by RoBERTa [59] . Another model that we can consider as a base model in our presented model types is "ERNIE" which is Enhanced representation through knowledge integration that it outperforms Google's BERT in multiple language tasks concentrating in Chinese language [116] . Since many language models focused on a single language representation, multilingual models gained attention in this field. After successfully proposing BERT by Google, the multilingual version of BERT was published a year after. This model, called "mBERT", supported sentence representation for 104 languages. This model outperformed previous works in many multilingual tasks. Concentrating on semantic aspect of mBERT, [56] shows that splitting mBERT representation into two separate components: language-specific and language-neutral. The second component has high accuracy in less difficult tasks such as ord-alignment and sentence retrieval. Another model based on Transformers with masked language modeling (MLM) objective like BERT, is XLM which trained with translation Language Modeling to learn different languages similar representations [49] . XLM structure is based on BERT, but as roBERTa [59] proved improvement in results with changing parameters compared to BERT, a new multilingual model called XLM-R had been published, which eliminated translation Language Modeling task in XLM and instead, trained roBERTa on a bigger multilingual dataset containing 100 languages [2] Although multilingual models showed high performance in multilingual tasks, researchers showed that concentrating on a specific language and fine-tuning for particular tasks in that language can lead to better results in subtasks. For example, CamemBERT model, which is a French pre-trained model based on roBERTa, showed that learning on French data and fine-tuning only for French outperform other multilingual models like mBERT and UDify [64] . In Table 2 some other language specific models has been represented which shows that proposing a language-specific Increasing the efficiency of transformers and technology shifts in processing units has made it possible to provide language models that can advance multiple languages. In this section, we discuss the importance of multilingual models and analyze their maturity. We propose a big picture of these models side by side with the monolingual models. We also investigate research studies from two viewpoints, historical and model characteristics. We also review the architecture, performance, hardware requirements, and language features of these models. Practical NLP applications are often developed for English language because it is impossible to train large and precise language models in languages with a small labeled dataset. The importance of modeling for such languages in unexpected situations has been investigated. However, language models in low-resource languages are not limited to emergencies, and for providing a wide range of new NLP-dependent technology services, which are now primarily done in the context of deep neural networks, language models are necessary. Cross-lingual models use large unlabeled datasets of one language to build a language model that can be finetuned in another language with a small corpus, to perform much better in the target language. In this part, we review the studies that have examined the capabilities of multilingual models. Some focused on the strengths of these models and applications that have good performance; other ones showed NLP tasks in which the performance of multilingual models was inferior to monolingual models. Table 3 compared these studies. Pires et al. in [81] examines the multilingual capabilities of the mBert model, and for this purpose, the model was pre-trained on the Wikipedia data set collected from more than 100 languages. The model was then finetuned with task-specific supervised data in one language and tested for performance in another. The results of this study show that mBert has incredible performance in cross-lingual tasks. Factors influencing the model's performance mentioned lexical overlap and typological similarity. However, the model performed well in languages with different scripts. Another study deals with the semantic features of the mBert model and divides the resulting model into two parts related to specific language and general language. The second part performed well in tasks such as word alignment and exact sentence retrieval but is not suitable for machine translation applications [56] . [108] evaluates mBERT as a zero-shot cross-lingual model on about 40 different languages and five different NLP tasks: natural language inference, document classification, NER, part-of-speech tagging, and dependency parsing. They show that with fine-tuned hyperparameters, mBert has an excellent performance in the mentioned tasks. On the other hand, some studies have shown the inefficiency of multilingual models in some applications [87] . Some research studies have shown BERT's great performance in cross-lingual applications, which is even more surprising because of the absence of a cross-lingual objective during the training phase. [44] examines the effect of different components of the BERT model on its crosslingual performance in three different aspects: linguistic features (lexical and structural), model architecture, and the format of the inputs and training objectives. They showed that the lexical similarity (similar words or similar parts in words) of languages has a negligible effect on performance, and instead, the depth of the network is much more effective. Another conclusion is that the depth of the network and the total number of the parameters in the architectural features of the model were more critical than the Multi-head layers of attention. They also showed that NSP and tokenizing at the character or word level reduce model performance in cross-lingual. Some studies work on specific task optimization of multilingual models. CLBT model [104] focus on dependency parsing that point to lexical properties. The process of technology development in the field of multilingual models can be studied from a historical perspective (evolution in time) or a model perspective, which will be detailed in this section. In terms of historical evolution in time, multilingual models have gone through a challenging process ( Figure 3 ). At first, models like ELMO [78] remained loyal to bidirectional LSTM and performed well. Then, transformers [100] were introduced and led the architecture and performance of the models for some time. Transformers increased the need for parallel execution and performance in various tests by replacing the recursive architecture with an attention mechanism and increasing the requirement for processing resources and training time. The BERT model was released and improved later in many other works. For example, the ALBERT model uses methods to reduce the number of parameters to provide the heavy BERT model in lighter versions. Shortly, researchers took a separate path in terms of design and architecture then proposed fresh models such as XLNET using autoregressive model or ELECTRA that pre-train text encoder as the generator. From the model point of view, according to Figure 4 , multilingual research studies fall into four categories: The first generation that came before introducing BERT, such as ELMO, shifted the results by using all the output of the Bidirectional LSTM inner layers. In the second generation, BERT and its minor improvements are categorized, using more extensive data sets and changing pre-train tasks, classifier changes and optimizations are some of the changes seen in the Roberta [59] , UDIFY [46] , Albert [50] , and XLM [49] . In the Post-BERT era, models had significant modifications, including XLENT [114] and ERNIE [116] . The first uses an auto-regressive pre-train instead of an Autoencoder. In the next stage, we have models such as ELECTRA [62] that took a relatively different path than the BERT-based models. For example, in ELECTRA, the encoder is trained as a discriminator instead of a generator. GPT [86] and mBert [27] focus on learning contextual word embeddings. These learned encoders are still needed to represent words in context by downstream tasks. Besides, various pre-training tasks are also proposed to learn PTMs for different purposes. UDify: This model uses over 120 Universal Dependencies [73] treebanks in more than 70 languages and finetuned BERT on all datasets as a single one. That shows state-of-the-art universal POS, UFeats, Lemmas, UAS, and LAS scores. Hence can be assumed, multilingual multitask model. [46] mBert: Multilingual BERT published same time with BERT, support over 100 languages. Technically, It is just BERT trained on Wikipedia text of many languages. For the content size bias resistance for different languages, low resource languages were oversampled and general languages undersampled. XLM: This study was presented to evaluate Pre-trained cross-lingual models (XLMs) and suggested two methods for pre-training. The first method is unsupervised pre- training based on monolingual data, and the second method is pre-training based on multilingual data. Evaluations were performed in the XLNI [20] and WMT'16 tasks [76] . Another innovation of this research [49] is the introduction of several objectives for pre-learning. They used MLM and Causal Language Modeling (CLM) for unsupervised learning, which examined its proper performance. They also used translation language modeling objective (TLM) alongside MLM, which is essentially an extension of MLM in the BERT model, using a set of parallel sentences instead of consecutive sentences. A self-supervised model uses RoBERTa objective task on a CommonCrawl dataset 1 contains the unlabeled text of 100 languages with a token number of five times more than RoBERTa. The advantage of this model is that, unlike XLM, it does not require parallel entry, so it is scalable. [2] XLNet: If we categorize unsupervised learning in two types of autoregressive and autoencoding, the XLNet model [114] focuses on autoregressive models that attempts to estimate the probability distribution of the text. In contrast, autoencoding models such as BERT try to reconstruct the original data by seeing incomplete data generated by covering some sentence tokens. The advantage of these models over autoregressive models is that they can advance the learning process in both directions. Their disadvantage is that guessed tokens are independent of uncovered tokens, which are very far from the features of natural language. The XLNet model tries to take advantage of both categories using permutations of all possible modes instead of focusing on one-way learning in AR models. In this way, it uses the content of both sides in guessing a token. Also, because it does not use incomplete data, it is unnecessary to face the difference of tokens in the pre-training and finetuning phase, which is the weakness of AE models. From an architectural point of view, most models use an architecture similar to BERT-base or BERT-large [62] [50] [114] . This group of research studies uses a combination of transformer and attention layers, in which the 1 https://commoncrawl.org attention layers play a vital role in embedding the meaning and context of words. On the other hand, there are other models that had extended the BERT architecture [116] or acted differently [78] [49] . Another aspect of comparing models is batch size. From this point of view, models similar to BERT use 8,000 batch size, while other models, such as ERNIE, use size of 512. As shown in previous research works Transfer learning is affiliated with computing resources. In this section, we show which models are more efficient in the manner of hardware requirements. As shown in the table 4, several research teams have introduced basic models based on the Transformers architecture, which differ in terms of architecture, the total number of model parameters, and the hardware used. In terms of the hardware processing units used, the models use TPUs or GPUs. Although different models use proprietary combinations of hardware, in some cases, such as the XLNet, up to 512 TPU has been used for less than three days, which according to the CEO of Hologram AI, cost 245,000$ and produced 5 tons of CO2. To defeat BERT in 18 of 20 tasks [94] . Regarding the number of model parameters, we can name from the small version of the Electra model with 14 million parameters to Albert Large with 235 million parameters. In table 5, the available multilingual datasets are introduced, and for each dataset, in addition to a short description, we provided the evaluation metrics and the task that is used in previous studies. As shown in Figure 5 ,we can categorized linguistic domain to be considered for multilingual tasks from several perspectives: -From Morphology point of view : Since morphology deals with the formation of words and the relation of words together, defining this category is meaningful in the multilingual task because this formation varies in different languages but can have many common properties too. The NewsQA [98] reading comprehension dataset of over 100K human-generated question-answer pairs from over 10K news articles from CNN, with answers consisting of spans of text from the corresponding articles Question Answering EN [37] [31] [9] Table 5: Available multilingual dataset for different tasks Figure 5 : Categorization of Multilingual Tasks morphological structure of words usually consists of prefixes/suffixes, singularization/pluralization, gender detection, word inflection (words modification in order to express grammatical categories). -From Syntax point of view : Syntactic perspective in multilingual tasks refers to words relation and combination to form a bigger language unit such as sentences, clauses, and phrases. In everyday life, this view is more commonly known as the grammatical view. Alongside the relation between words, part of speech and dependency tree are considered in this category. -From Semantics point of view : This view refers to the meaning of words and sentences. Semantic perspective is one of the main categories in linguistics for the multilingual task because semantic structure and relation of words and sentences are essential features of any language. -From Pragmatics point of view : The pragmatic perspective in multilingual tasks deals with the contribution of context to meaning. There are several hot topics such as topic modeling, coreference-anaphora, summarization, and question-answering in NLP that are considered in this perspective. With the large amount of data being generated every day in different forms, such as unstructured data, emails, chats, and tweets, the role of NLP tasks and applications gain ever-increasing importance. Analyzing data using these applications will help businesses to gain valuable insights. Trending topics like elections and Covid-19 often result in increased activity in the content generation on social media, requiring attention from the NLP community. For low-resources languages, some applications become more challenging to analyze. However, in general, NLP successfully has been used in different types of applications such as virtual assistants, speech recognition, sentiment analysis, chatbots, etc. [4] . As an example, Google Translate, which is a free multilingual machine translation service developed by Google, is powered by NLP behind the scene. Amazon Alexa or Google Assistant uses speech recognition and NLP techniques such as question answering, text classification, and machine translation to help users achieve their goals. Even in the digital marketing industry, analyzing data using these techniques helps the community understand customers' interests and generate accurate reports based on business needs. A key value of our study is the importance of reviewing different NLP and application tasks, not just in the English language but in other languages. Since many NLP models and applications try to cover multiple languages, having preliminary knowledge about the works on these applications and tasks from a multilingual perspective would help researchers follow their works. In this section, we analyze NLP applications and tasks from a multilingual point of view separately. With the help of transfer learning, one may reach a good performance on many NLP tasks, not only in a highresource language like English but many low-resources languages. Since language models other than English are getting more attention in academia and industry, new research studies concentrate more on the multilingual aspect of NLP in different tasks. In some cases, transfer learning from a multilingual model to a language-specific model can improve performance in many downstream tasks. [47] uses this approach for the Russian language, which resulted in improving performance on s reading comprehension, paraphrase detection, and sentiment analysis tasks. Furthermore, the training time of this model decreased compare to multilingual models. One of the aspects of text analysis is the style of the text. Many factors, such as formality markers, emotions, and metaphors, influence the analysis of the style of the text. [43] provides a benchmark corpus (xSLUE) containing text in 15 different styles and 23 classification tasks as an online platform for cross-style language understanding and evaluation. This research shows that there are many ways to develop low-resource or lowperformance styles and other applications such as crossstyle generation. Another challenge in NLP applications, especially in low-resource languages, is detecting hate speech. [93] developed an architecture for pre-trained Transformers to examine cross-lingual zero-shot and few-shot learning. This model with the novel attention-based classification block AXEL uses transformers technique for English and Spanish datasets. Also, [96] uses transfer learning with BERT and RNN to represent shared tasks on multilingual offensive language. The importance of translation in the field of NLP is undeniable. Especially in the case of multilingual, this service is in the spotlight. Most of these models train on a single model (mostly English) and try to translate to other languages. Facebook AI introduced "M2M-100" [28] , a Many-to-Many multilingual translation model that translates directly between any pair language among a pool of 100 languages. Using zero-shot systems, [48] explores closeness between languages focusing both on automatic standard metrics (BLEU and TER). Much research has been conducted in the field of Speech Recognition, mainly focusing on deep neural networks and RNNs [33] [68] [119] . With the increasing use of transformers in NLP, recent research studies in the field of speech recognition mainly use transformers in their architecture. For multilingual speech recognition, [118] proposed a sequence-to-sequence attention-based model with a single Transformer that uses sub-word without using any pronunciation lexicon for their model. Sentiment analysis aims to identify and extract information, such as feelings, attitudes, emotions, and opinions, from a piece of text. Many businesses use this service to improve the quality of their product by analyzing the comments of their customers. One of the main challenges for this task is reaching acceptable performance for those languages with lower resources. To tackle this challenge, [14] trained a model on a high resource language (English) and reused it for four other languages (Russian, Spanish, Turkish, and Dutch) with more limited data in the field of sentiment analysis. A novel deep learning method has been proposed in [71] . This work discusses the significant challenges involved in multilingual sentiment analysis and other methods for estimating sentiment polarity in multilingual for overcoming the problem of excessive dependence on external resources introduced [58] . Also, in [42] a novel technique for the use of language-agnostic sentence representations to adapt the model trained on texts in Polish (as a low-resource language) is presented to recognize polarity in texts in other (high-resource) languages. Intent Detection is the task of detecting users' messages and assigning suitable labels using chat-bots and intelligent systems, and slot filling tries to extract the values of certain types of attributes [34] . Studies show that there is a strong relationship between these two terms, which leads to achieving state-of-the-art performance [106] . Models in this field usually use joint deep learning architectures in attention-based recurrent frameworks. In [15] , [117] a "recurrence-less" model using BERT-Join was proposed that showed strong performances for these tasks. Also, they reached a similar performance for the Italian language without changing the model. Dependency parsing is another big challenge, especially in the field of multilingual NLP. [104] used the BERT transformation approach to generate cross-lingual contextualized word embeddings. This linear transformation learned from contextual word alignments is trained in different languages and showed effectiveness on zero-shot cross-lingual transfer parsing and proved that this method outperforms static embeddings. NER is the task of extracting entities in the text and categorizing them into predetermined categories. Recent self-attention modes presented a state-of-the-art performance in this task, especially for inputs consisting of several sentences. This property became more important when it comes to analyzing data in several languages. [60] with using BERT in five languages, explores the use of cross-sentence information for NER and shows outperforming NER on all of the tested languages and models. For languages with little or no labeled data, [107] proposed a teacher-student learning method for addressing this problem in both single-source and multi-source cross-lingual NER. For evaluating different architectures for the task of name transliteration in a many-to-one multilingual paradigm such as LSTM, biLSTM, GRU, and Transformer, [69] shows improving accuracy in transformer architecture for both encoder and decoder. Question Answering (QA) is the task of building an automatic system to answer questions posed by humans in a natural language [90] . This task is gaining lots of attention, especially in the field of multilingualism, but it is very challenging too. Different approaches for constructing meaning are used in various languages. For example, for the plural form of words in English, we usually use 's' in the end, but in Arabic, the plural form of words is not just adding postfix to words and sometimes the whole structure of word changes. Or some other languages like Japanese don't use space between words [18] . This section provides some of the challenges in the domain of multi-lingual tasks and a set of ideas to be considered as future direction of this research line. We identified three group of challenges in the domain of using transfer learning for multilingual tasks including challenges on (i) Modeling, (ii) practical aspects and (iii) applications. Next we provide details on each group of challenges. challenges of pre-trained models due to the complexity of natural language processing can be grouped as follow: • Various objective tasks that evaluate different features of models. A challenging objective task can help in the manner of creating more general models. However, these tasks should be self-supervised because many captured corpora do not have tagged data. • Due to the increasing use and research on multilingual and cross-lingual models, their vulnerability and reliability have become very important. In Section 3.2, we reviewed some researches in this area and noted the less studied multilingual models. Nowadays, most of the researches in this category, conducted on mBert. Research studies on following problems are affected by the high cost of pre-training models: • General purpose models can learn the fundamental understanding of languages. However, usually need more profound architecture, larger corpora, and Innovative pre-training tasks. • Recent studies have confirmed the performance of Transformers in pre-trained models. Nevertheless, the computational resource requirement of these models limits their application. Therefore, model architecture improvement needs more attention in the research area. Moreover, architecture improvements could lead to a better contextual understanding of the language model, as it could deal with a more extended sequence and recognize context. [120] • Achieve maximum performance of current models: Most existing models can improve performance with increasing model depth, for example, with a more comprehensive input corpus or train steps. • In terms of multilingual tasks, many task do not have enough data resources to gain significant performance in a specific application. • The next big challenge is to successfully execute NER, which is essential when training a machine to distinguish between simple vocabulary and named entities. In many instances, these entities are surrounded by dollar amounts, places, locations, numbers, time, etc., it is critical to make and express the connections between each of these elements, only then may a machine fully interpret a given text. • Another challenge to mention is extracting semantic meanings. Linguistic analysis of vocabulary terms might not be enough for a machine to correctly apply learned knowledge. To successfully apply learning, a machine must understand further, the semantics of every vocabulary term within the context of the document. Based on this study and analysis of the existing efforts in the domain of multi-lingual tasks, the following can be considered as the future direction of this research domain: • Vertical extension: performance improvement of current models by increasing the number of pre-train steps, total parameters number of models, and larger input corpora which further result in higher training costs. In this manner, the requirement of processing power is undeniable. Another suggestion is to analyze the relationship between these hyper parameters of the model and the resulting performance of each model. • Horizontal expansion:performance of pre-trained language model are related to corpora and its variety, so we can suggest expanding current research studies with multilingual corpora pre-training and evaluation in multilingual downstream tasks. Same as Vertical extension changes that result in pre-training, a model usually requires remarkable processing units. • One challenging study field would be pre-training tasks, especially in cross-lingual models. Any progress in this manner would result in a more comprehensive evaluation of models. • Optimization of model architecture design or methods like training process is another deep research way. enhancements in this aspect can result models that could pre-train on massive multilingual corpora with current computing resources. • recently we can see tend to Specific purpose models, for specific domain application like health advice, but there is a gap already in this direction for example in low-resource or real-time computing, need for a newly designed model with specific tasks pre-train objective can be seen. • Robustness of Pre-train models also needs more attention. studies with this subject will provide good insight into the future of this model in the industry. This survey provides an comprehensive overview of the existing studies on leveraging transfer learning models to tackle the multi-lingual and cross-lingual tasks. In addition to the models, we also reviewed the main available datasets in the community and investigated different approaches in term of the architectures and applications to identify the existing research challenges in the domain and later we provide few potential future directions. Mostafa Salehi is supported by a grant from IPM, Iran (No. CS1400-4-268). ETC: Encoding Long and Structured Inputs in Transformers Unsupervised Cross-Lingual Representation Learning at Scale AraBERT: Transformer-based Model for Arabic Language Understanding Natural Language Processing and Why It's So Important -Automated Dreams Angela Fan Marjan, and Ghazvininejad Facebook. Non-Autoregressive Semantic Parsing for Compositional Task-Oriented Dialog N -gram-based Machine Translation Self-supervised Knowledge Triplet Learning for Zero-shot Question Answering. arXiv Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension Longformer: The Long-Document Transformer A neural probabilistic language model Question Answering Systems: Survey and Trends Quasi-recurrent neural networks Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data Multi-lingual Intent Detection and Slot Filling in a Joint BERT-based Model CODAH: An adversarially-authored question answering dataset for common sense BERT for Joint Intent Classification and Slot Filling TyDi QA: A Multilingual Question Answering Benchmark A unified architecture for natural language processing XNLI: evaluating cross-lingual sentence representations Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces Snips Voice Platform: an embedded Spoken Language Understanding system for privateby-design voice interfaces A survey of multilingual neural machine translation BERTje: A Dutch BERT Model Rob-BERT: a Dutch RoBERTa-based Language Model Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling Pre-training of Deep Bidirectional Transformers for Language Understanding. (Mlm) Generating Sequences With Recurrent Neural Networks Toward zero-shot entity recognition in task-oriented conversational agents Annual Meeting of the Special Interest Group on Discourse and Dialogue -Proceedings of the Conference DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications Long Short-Term Memory Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers Improving slot filling performance with attentive neural networks on dependency structures GloVe: Global Vectors for Word Representation ImageNet: A large-scale hierarchical image database SpanBERT: Improving Pre-training by Representing and Predicting Spans XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural Machine Translation in Linear Time A convolutional neural network for modelling sentences Distilling Large Language Models into Tiny and Effective Students using pQRNN Crosslingual deep neural transfer learning in sentiment analysis xSLUE: A Benchmark and Analysis Platform for Cross-Style Language Understanding and Evaluation Cross-lingual ability of multilingual bert: An empirical study Convolutional Neural Networks for Sentence Classification 75 Languages, 1 Model: Parsing Universal Dependencies Universally Adaptation of deep bidirectional multilingual transformers for Russian language. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii A comparison of transformer and recurrent neural networks on multilingual neural machine translation Cross-lingual Language Model Pretraining ALBERT: A Lite Bert for self-supervised learning of language representations Zerodata learning of new tasks FlauBERT: Unsupervised Language Model Pre-training for French MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark Efficient contextual representation learning without softmax layer Agglutifit: Efficient low-resource agglutinative language model fine-tuning How Language-Neutral is Multilingual BERT? Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning Make it possible: Multilingual sentiment analysis without much prior knowledge Exploring Cross-sentence Contexts for Named Entity Recognition with BERT. arXiv Knowledge-driven Self-supervision for Zero-shot Commonsense Question Answering PRe-Training Text Encoders As Discriminators Rather Than Generators. Iclr The evolution of sentiment analysis-A review of research topics, venues, and top cited papers CamemBERT: a Tasty French Language Model of GPUs used for training 1 Billion Word Benchmark? Efficient estimation of word representations in vector space Recurrent neural network based language model Multi-lingual speech recognition with low-rank multi-task deep neural networks Effective Architectures for Low Resource Multilingual Named Entity Transliteration Crossdomain sentiment classification with bidirectional contextualized transformer language models Multilingual Sentiment Analysis PhoBERT: Pretrained language models for Vietnamese Universal dependencies 2.3 Learning NLP Language Models with Real Data Cross-lingual name tagging and linking for 282 languages Bleu: a method for automatic evaluation of machine translation Zero-shot learning for cross-lingual news sentiment classification Deep Contextualized Word Representations Cross-Lingual Learning for Text Processing: A Survey. Expert Systems with Applications Crosslingual learning for text processing: A survey How Multilingual is Multilingual BERT? AlBERTo: Italian BERT language understanding model for NLP challenging tasks based on tweets Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages Train once, test anywhere: Zero-shot learning for text classification Pre-trained Models for Natural Language Processing : A Survey Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. OpenAI Is Multilingual BERT Fluent in Language Generation? Transfer learning in natural language processing A survey of cross-lingual word embedding models Question Answering in Natural Language Processing Cross-lingual transfer learning for multilingual task oriented dialog Probabilistic Language Models 1.0 Crosslingual Zero-and Few-shot Hate Speech Detection Utilising Frozen Transformer Language Models and AXEL The Staggering Cost of Training SOTA AI Models Corpulyzer: A novel framework for building low resource language corpora KEIS@JUST at SemEval-2020 Task 12: Identifying Multilingual Offensive Tweets Using Weighted Ensemble and Fine-Tuned BERT Zero-shot dependency parsing with pre-trained multilingual sentence representations Philip Bachman, and Kaheer Suleman. Newsqa: A machine comprehension dataset Hakkani-Tür Dilek, and Larry Heck. (almost) zero-shot cross-lingual spoken language understanding Decoding with large-scale neural language models improves translation Multilingual is not enough: BERT for Finnish Dimensional sentiment analysis using a regional CNN-LSTM model Cross-lingual BERT transformation for zero-shot dependency parsing A survey of transfer learning A survey of joint intent detection and slot-filling models in natural language understanding Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language The surprising cross-lingual effectiveness of BERT Zero-shot user intent detection via capsule neural networks End-to-End Slot Alignment and Recognition for Cross-Lingual NLU ByT5: Towards a token-free future with pre-trained byte-tobyte models Learning Contextualized Knowledge Structures for Commonsense Reasoning SUPERB: Speech processing Universal PERformance Benchmark Generalized Autoregressive Pretraining for Language Understanding HotpotQA: A dataset for diverse, explainable multihop question answering ERNIE : Enhanced Language Representation with Informative Entities A joint learning framework with bert for spoken language understanding Multilingual end-to-end speech recognition with a single transformer on low-resource languages Multilingual recurrent neural networks with residual learning for low-resource speech recognition Neural architecture search with reinforcement learning