key: cord-0487859-06lclwje
authors: Huertas-Garc'ia, 'Alvaro; Mart'in, Alejandro; Huertas-Tato, Javier; Camacho, David
title: Exploring Dimensionality Reduction Techniques in Multilingual Transformers
date: 2022-04-18
journal: nan
DOI: nan
sha: 2157e1ca6b074880b5249fc81df29aa3ad5eb69a
doc_id: 487859
cord_uid: 06lclwje

Both in scientific literature and in industry,, Semantic and context-aware Natural Language Processing-based solutions have been gaining importance in recent years. The possibilities and performance shown by these models when dealing with complex Language Understanding tasks is unquestionable, from conversational agents to the fight against disinformation in social networks. In addition, considerable attention is also being paid to developing multilingual models to tackle the language bottleneck. The growing need to provide more complex models implementing all these features has been accompanied by an increase in their size, without being conservative in the number of dimensions required. This paper aims to give a comprehensive account of the impact of a wide variety of dimensional reduction techniques on the performance of different state-of-the-art multilingual Siamese Transformers, including unsupervised dimensional reduction techniques such as linear and nonlinear feature extraction, feature selection, and manifold techniques. In order to evaluate the effects of these techniques, we considered the multilingual extended version of Semantic Textual Similarity Benchmark (mSTSb) and two different baseline approaches, one using the pre-trained version of several models and another using their fine-tuned STS version. The results evidence that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively. This work has also considered the consequences of dimensionality reduction for visualization purposes. The results of this study will significantly contribute to the understanding of how different tuning approaches affect performance on semantic-aware tasks and how dimensional reduction techniques deal with the high-dimensional embeddings computed for the STS task and their potential for highly demanding NLP tasks

Natural Language Processing (NLP) includes various disciplines that provide a system with the ability to process and interpret natural language, just like humans use language as a communication and reasoning tool [1] . Due to the recent increases in computational power, parallelization, the availability of large data sets, and recent advances in artificial variables [19, 17] . Additionally, feature extraction methods can be further subdivided into linear and nonlinear according to the variable combinations applied [21] .

Secondly, according to the information available in the datasets, dimensionality reduction techniques can be classified as supervised and unsupervised [17, 18] . Supervised techniques require each data instance in the dataset to be labelled accordingly to the task, whereas unsupervised techniques are task-agnostic approaches and do not require labelled data.

In broad terms, embeddings can be categorised as pre-trained or downstream fine-tuned embeddings [22] , and as pre-computed or on-the-fly embeddings [23, 24] .

The first criteria to categorise embeddings is whether they come from pre-trained models for general tasks or are task-specific. Pre-trained embeddings are widely used as a starting point for downstream applications. A clear example is the current 'Pre-train and Fine-tune' Paradigm of Transformer models [22] . Training these models from scratch is prohibitively expensive in many cases. Alternatively, using self-supervised trained checkpoints of these models and their pre-trained embeddings as a starting point to be later fine-tuned for supervised downstream tasks is widely used. Unlike previous works in the literature that have only focused on reduced pre-trained embeddings [25, 26, 27] , in this work we are interested of the evaluation of the impact of dimensionality reduction on both types of embeddings, pre-trained and downstream fine-tuned embeddings.

Pre-computed embeddings in NLP are widespread, i.e., embeddings that may or may not be adjusted to a task but generated beforehand and not at each use time. A straightforward application of pre-computed embeddings is the semantic search NLP task for locating relevant information from massive amounts of text data [28] . Semantic search task uses semantic textual similarity to compare an input text against a large set of texts from a database to extract relevant related information-usually, a distance metric like cosine similarity ranks which content should be extracted to an input query. However, computing the database embeddings each time a query is introduced is infeasible. Alternatively, it is preferable to compute the embeddings once, store them, and use these pre-computed embeddings for subsequent requests [24] . With this in mind, it is important to note the usefulness of reducing embeddings dimensions which can improve their utility in memory-constrained devices and benefit several real world applications.

As Camastra & Vinciarelli mention [29] , the use of more features than strictly necessary leads to several problems, pointing out that one of the main problems was the space needed to store the data. As the number of available information increases, the compression for storage purposes becomes even more critical. Additionally, for the scope of this work, it cannot be ignored that the application of dimensional reduction techniques for reducing pre-computed embeddings dimensions neither improves the runtime nor the memory requirement for running the models. It only diminishes the needed space to store embeddings and increases the speed to make computations (i.e. to calculate the cosine similarity between two vectors), which also contributes to decrease the considerable impact on the energy and carbon footprints generated during the production use of the models when pre-computed and stored embeddings are required [30] . Research has tended to focus on implementing bigger and complex models rather than analysing methods to adjust the vector space to the desired task while being conservative on the number of dimensions required [30] . An additional problem is that the storage of high-dimensional embeddings is challenging when dealing with large volume datasets [31] .

Recently, the study of Raunak et al. [25, 26] has shed more light on the importance of reducing the size of embeddings produced by Machine Learning and Deep Learning models. More specifically, these authors draw attention to reducing the size of classical GloVe [32] and FastText [33] pre-trained word embeddings using PCA-based post-processing algorithms, achieving similar or even better performance than the original embeddings.

Other works on the potential of reducing pre-computed embeddings dimensions have been carried out [27] exploring the effect of Principal Components Analysis (PCA) [34, 35] and Latent Semantic Analysis (LSA) [36] dimensionality reduction techniques as a post-processing step of pre-trained GloVe word embeddings for text classification tasks. These authors also corroborated the usefulness of PCA for obtaining more accurate results with lower computational costs concluding that the PCA method is more suitable than LSA for dimensionality reduction. In the same way, Shimomoto et al. [37] propose solving topic classification and sentiment analysis by using PCA to transform pre-computed word embeddings of a text into a linear subspace. Their results showed the effectiveness of using the PCA subspace representation for text classification. This fact has already been proved by other authors [38] showing that the storage, memory, and computation required by these large embeddings typically results in low efficiency in NLP tasks, pointing out the importance of other methods such as manifold learning to compress pre-computed and pre-trained GloVe and FastText embeddings.

Additionally, researchers have explored dimensional reduction techniques for visualizing semantic embeddings of small text corpora [39] . The authors explored four dimension reduction strategies on pre-computed embeddings based on PCA and t-distributed stochastic neighbor embedding (t-SNE) [40] , concluding that both methods preserved a significant amount of semantic information in the full embedding. Similarly, other study [41] has focused on metaheuristic methods to find proper values for the perplexity t-SNE parameter and to optimize this word embeddings visualization.

To the best of our knowledge, in the literature, dimension reduction research on embeddings has focused on using classical pre-computed word embeddings, including the popular GloVe or FastText embeddings. These classical word embeddings are static, word-level, and contextual-independent, and their main limitation is that they do not consider what context the word is being used. Moreover, the variety of embedding techniques explored is limited, focusing mainly on PCA. Likewise, these studies do not include multilingualism in their analyses, being limited to the English language.

Hence, the presented work follows the research line proposed by different authors [25, 26, 27, 39] but takes a step forward, including a broader range of techniques and evaluating the capability of dimensionality reduction techniques in both pre-trained and fine-tuned pre-computed embeddings from state-of-the-art contextual-based Transformer models from the recently claimed multilingual point of view.

The performance of Machine Learning (ML) models and, particularly, Deep Learning (DL) approaches is heavily dependent on the choice of data representation (or features) on which they are applied [42] . For that reason, much effort during deploying ML and DL solutions is dedicated to obtain a data representation that can support effective learning, where dimensionality reduction techniques can contribute.

In the literature, different examples of combining dimensionality reduction techniques with Deep Learning complex models can be found. CNNs models and PCA have already been combined for low-dimensional parameterization of complex 3D geomodels [43] . A hybrid architecture composed of PCA and Deep Neural Network models has also been proposed for time series forecasting to predict fine particulate matter concentrations in urban air pollution [44] . Additionally, dimensionality reduction techniques have also been applied as a pre-processing step to improve text document clustering [45] . Furthermore, recently the combination of Deep Learning models and dimension reduction techniques has also been explored in different specific application domains such as health science for cancer classification, where gene expression data is dimensionality reduced before training a Deep Belief Network (DBN) classifier [46] .

Semantics has many applications in a wide range of domains and tasks. Recent developments regarding Information Retrieval tasks [28, 47, 48] have demonstrated the potential of combining semantic-aware models along with traditional baseline algorithms (e.g., BM25) [49] . Moreover, the use of semantic-aware models has proven to be an excellent approach to counteract informational disorders (i.e., misinformation, disinformation, malinformation, misleading information, or any other kind of information pollution) [50, 51, 52, 53] or to build automated fact-checking approaches [54] . Additionally, semantic similarity can be applied to organize data according to text properties, which is formally an unsupervised thematic analysis [55] . Following the same criterion, the measurement of semantic similarity between a sentence and each word in the sentence can be applied to extract the keywords with the highest semantic content from the sentence [56] . All these applications rely on measuring semantic textual similarity (STS), making STS a crucial task in NLP.

A key limitation of these semantic-aware solutions is the the language bottleneck [57] . Language constitutes one of the most significant barriers to be addressed since a model's ability to handle multiple languages is essential for its widespread applications.

Altogether, this work aims to broaden our knowledge of semantic-aware transformer-based models by analysing the impact of different dimensionality reduction techniques on the performance of multilingual siamese fashion Transformers on semantic textual similarity multilingual tasks. The results of this study will significantly contribute to understanding how different tuning approaches affect performance on semantic-aware tasks and how dimensional reduction techniques deal with the high dimensional embeddings computed for the STS task.

The main goal of this research lies in providing a deep analysis of the use of different dimensionality reduction techniques to reduce the size of the output embedding of different Transformer models and their impact in the performance. We

Reduced fine-tuned model Figure 1 : Representation of the approaches followed to evaluate the impact of different dimensionality reduction techniques in multilingual Transformers. Where T represents the pre-trained Transformer model, T' the fine-tuned Transformer model, and R the dimensionality reduction technique.

have followed four approaches, aiming to evaluate and quantify the reduction margin and its effect in the performance while using different training methodologies. These four approaches are displayed in Figure 1 and are described as follows:

• Approach 1 -Pre-trained models. In the first approach we employ and directly evaluate the pre-trained models in the mSTSb task without applying any dimensionality reduction. This approach is used as baseline for general pre-trained models.

• Approach 2 -Fine-tuned models. In this second approach, the pre-trained models are fine-tuned downstream and evaluated in the mSTSb task without applying any dimensionality reduction technique. This approach is used as baseline for fine-tuned models.

• Approach 3 -Reduced pre-trained models. In this approach, the embeddings generated by the pre-trained models from Approach 1 in the mSTSb train split are used to fit the different dimension reduction techniques and evaluate them in the mSTSb test split. Thus, an analysis between the results achieved in Approach 1 and Approach 3 will help to understand the impact of dimensionality reduction techniques in pre-trained models

• Approach 4 -Reduced fine-tuned models. This approach is equivalent to Approach 3 but uses the fine-tuned models in Approach 2, allowing to assess the impact of dimensionality reduction techniques in this type of models.

Next subsections describe in detail the techniques explored to reduce embeddings. Then, the multilingual transformer models used are presented.

Transformer-based models embed textual information into vectors of high dimensionality. These pre-trained multilingual models usually generate embeddings with 768 dimensions, 1024 in case of xlm-roberta-large.

As previously mentioned, dimensionality reduction techniques can be grouped according to two nonmutual exclusive criteria. In this project, we have included a whole range of types of dimensionality reduction techniques, including linear and nonlinear feature extraction and feature selection techniques. Nevertheless, since the Transformers models included in this work are employed in a siamese fashion to determine the degree of similarity between a pair of sentences, they output a pair of non-concatenative embeddings between which the similarity is estimated using the cosine distance. Hence, for each labelled similarity score, there are two non-concatenative embeddings. For this reason, even though we have labelled the data, only unsupervised methods are explored.

The dimensionality reduction techniques explored in this project are:

• Principal Component Analysis (PCA): Principal Component Analysis [34, 35] is a powerful unsupervised linear feature extraction technique that computes a set of orthogonal directions from the covariance matrix that capture most of the variance in the data [58] . This is, it creates new uncorrelated variables that maximize variance, and at the same time, most existing structure in the data is retained.

• Independent Component Analysis (ICA) [59] : Independent Component Analysis is an unsupervised feature extraction probabilistic method for learning a linear transformation to find components that are maximally independent between them and non-Gaussian (non-normal), but at the same time, they jointly maximize mutual information with the original feature space.

• Kernel Principal Components Analysis (KPCA) [60] : Kernel-based learning method for PCA. It makes use of kernel functions to construct a nonlinear version of PCA linear algorithm by first implicitly mapping the data into a nonlinear feature space and then performing linear PCA on the mapped patterns [58] . The kernels considered in this project are the Polynomial, Gaussian RBF, Hyperbolic Tangent (Sigmoid), and Cosine kernels.

• Variance Threshold: Unsupervised feature selection approach that removes all features with a variance below a threshold. Indeed, this technique selects a subset of features with large variances, considered more informative, without considering the desired outputs.

• Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP): The authors of UMAP [61] describe it is an algorithm that can be used for unsupervised dimension reduction based on manifold learning techniques and topological data analysis. In short, it first embeds data points in a new nonlinear fuzzy topological representation using neighbor graphs. Secondly, it learns a low-dimensional representation that preserves the maximum information of this space, minimizing Cross-Entropy. Compared to its counterparts, such as t-SNE, UMAP is fast, scalable, and allows better control of the desired balance between the local and global structure to be preserved. Two main parameters play a vital role in controlling this: (1) the number of sample points that defines a local neighborhood in the first step, and (2) the minimum distance between embedded points in low-dimensional space to be clustered in the second step. Larger values of the number of neighbors tend to preserve more global information in the manifold as UMAP has to consider larger neighborhoods to embed a point. Likewise, larger minimum distance values prevent UMAP from packing points together and preserve the overall topological structure. According to the previous preprocessing steps required before dimensionality techniques, it should be noted that PCA and KPCA assume a Gaussian distribution, and the features must be normalized; otherwise, the variances will not be comparable. Therefore, the StandardScaler is applied beforehand. Regarding ICA, non-Gaussian distribution is assumed, and the data is already withened by the algorithm, so no previous preprocessing step is necessary. For the Variance Threshold, the best standardization method is MinMaxScaler, as it transforms all features to the same scale but does not alter the initial variability. This allows the variance selection threshold set to affect all dimensions equally. Finally, since there is no Gaussian assumption under UMAP and the cosine distance calculation benefits from scaling the features to a given range, MinMaxScaler is applied before UMAP. A summary about the necessary considerations about the above scaling steps and the characteristics of the dimensionality reduction techniques applied in this project are listed in Table 1 .

Finally, it is worth mentioning that these dimensionality reduction techniques and preprocessing algorithms come from scikit-learn v1.0.2 [62] , except for UMAP that belongs to umap-learn v0.5.2 [61] . For the sake of reproducibility, the different parameters and values used in the experiments are presented in Table 2 . Finally, the variance threshold filters for the Variance Threshold technique tested are extracted by previously calculating the variance of each feature (i.e., 768 variances for 768 embedding dimensions), extracting the deciles, and including the maximum and minimum of these variances.

The effects of dimensionality reduction and fine-tuning process were explored in the following pre-trained multilingual models extracted from Hugging Face [63] :

• bert-base-multilingual-cased: BERT [4] transformer model pre-trained on a large corpus of 104 languages Wikipedia articles using the self-supervised masked language modeling (MLM) objective with ∼177M parameters.

• distilbert-base-multilingual-cased: Distilled version of the previous model, being on average twice as fast as this model, totalizing ∼134M parameters [64] .

• xlm-roberta-base: Base-sized XLM-RoBERTa [65] model totalizing ∼125M parameters. XLM-RoBERTa is RoBERTa model [66] , a robusted version of BERT, pre-trained on CommonCrawl data containing 100 languages.

• xlm-roberta-large: Large-sized XLM-RoBERTa [65] model totalizing ∼355M parameters.

• LaBSE: Language-agnostic BERT Sentence Embedding [67] model trained for encoding and reducing the cosine distance between translation pairs with a siamese architecture based on BERT, a task related to semantic similarity. It was trained over 6 Billion translation pairs for 109 languages. The authors also reported that it has zero-shot capabilities as was able to produce decent results for other not seen languages.

The multilingual extended STS Benchmark (mSTSb) [48] train set is used for fine-tuning the multilingual Transformers and fitting the variety of dimensional redcution techniques. This split consists of 16 languanges 2 combined in 31 mono and cross-lingual tasks with 5, 479 pair of sentences each one. Likewise, mSTSb test set is used to evaluate the performance of the models obtained from the different approaches. The mSTSb test set is also composed of 31 multilingual tasks with 1, 379 pair of sentences per task.

To evaluate the performance in mSTSb, the sentence embeddings for each pair of sentences are computed and the semantic similarity is measured using the cosine similarity metric. Then, the Spearman correlation coefficient (ρ or r s ) is computed between the scores obtained and the gold standard scores, as it is recognised as an official metric used for semantic textual similarity tasks [68, 69] .

It is important to note that the mSTSb data variety available for fitting (i.e., train split) totals +183k sentences (i.e., 16 languages with 5, 749 pair of sentences each one). For linear PCA, this dataset is too large to fit in memory. To manage this situation, an Incremental PCA approach [70] is applied, which simply fits the PCA in batches being independent of the number of input data samples, but still dependent on the input data features.

Similarly, KPCA and UMAP are computationally more expensive than the linear counterparts [60, 58] . For this reason, these dimensionality reduction techniques were fitted using a subset of 10k pair of sentences (i.e., 20k sentences), ensuring always the number of data instances is larger than the number of dimensions. In order to perform this subsampling, the following requirements were taken into account: (1) all 16 languages must be equally represented, giving a total of 625 sentence pairs for each language;

(2) all sentences present in the original train split will be present at least once in some language; (3) the representation of the different sentence pairs must be as random as possible.

Following these criteria, we perform a sampling based on assigning sentences to a randomly selected language until we reach the maximum number of sampled data. To avoid any bias in the order in which the sentences are assigned to the languages, the different sentence pairs are shuffled randomly at each iteration. As each language reaches the maximum data, that language is discarded. This ensures a random distribution of samples in each language but includes the full range of sentences present in the original train data.

In summary, the process of applying dimensional reduction techniques (i.e., Approach 3 and 4) for each model has two steps. Firstly, each dimensionality reduction technique is fitted through the embeddings computed by the multilingual models using mSTSb train split. Finally, mSTSb test set is used to evaluate the selected reduced number of features for each technique in each model. A wide range of number of dimensions are explored for each dimensionality reduction technique as shown in Section 5.

The fine-tuning process of the models based on the siamese training strategy for Approach 2, was performed following the methodology described by Reimers et al. [5, 57] . The following hyperparameters were optimized: number of epochs, scheduler, weight decay, batch size, warmup ratio and learning rate. The hyperparameters values explored and results from the experiments performed can be consulted in Weight and Biases 3 .

Additionally, to test if the use of reduced embeddings has a significant impact on the performance in comparison to the baseline approaches, we compare the average Spearman correlation coefficient of the five multilingual siamese Transformer models (see Section 3.2) between each pair of baseline and reduced approaches (i.e., Approach 1 vs Approach 3, Approach 2 vs Approach 4). For this purpose, as we are comparing the same set of models in different approaches, the two-tailed paired T-test using a significance level of 0.05 is conducted to test the null hypothesis of identical average Spearman correlation coefficient scores. 

This section aims to summarise the effect of a wide variety of dimensionality reduction techniques on the performance of multilingual siamese Transformers by comparing the baseline approaches (i.e., Approaches 1 and 2) with the reduced approaches (i.e., Approaches 3 and 4) for each model independently. It must be noted that this work does not pretend to provide a comparative analysis between the different models presented in Section 3.2 or to identify the best model for this task. In contrast, this work is focused on applying these dimensionality reduction techniques for reducing the dimensionality of the models' embeddings. Thus, the application of different dimensionality reduction techniques does not affect the execution or the memory requirement for running the models. It only diminishes the needed space to store embeddings and increases the speed to compute the cosine similarity between them.

Due to space reasons, average results across the 31 monolingual and cross-lingual tasks are presented instead of a breakdown by language. The average of Spearman correlation coefficients is computed by transforming each correlation coefficient to a Fisher's z value, averaging them, and then back-transforming to a correlation coefficient.

As can be seen in Figure 2 , for every model, the pre-trained performance on mSTSb (i.e., Approach 1) is improved using different dimensional reduction techniques. These results evidence that dimension reduction techniques are able to somehow adjust the knowledge present in the pre-trained embeddings to the semantic similarity task. This fact becomes even more significant in the case of LaBSE, a model with zero-shot capabilities trained on a task close to semantic similarity, which also greatly benefits from the use of dimension reduction techniques, increasing the Spearman correlation coefficient to 0.4 points (see Figure 2e and Table 3 ). For the rest of the models, it is equally remarkable that the dimension reduction techniques improve the pre-training performance, almost doubling score in the models with the XLM-RoBERTa architecture.

Clearly, the best technique in Approach 3 is ICA. Not only because it obtains the most remarkable improvement in pre-training performance for all models as shown in Table 3 , but also because it is the technique that most quickly and with the fewest dimensions overcomes the pre-trained models of Approach 1 (see Table 4 ). From the results of this Table 3 resulted in p = 0.041, indicating that the improvement in performance is significant when using ICA as a dimensional reduction techniques for pre-trained embeddings. These findings corroborate the ideas of Raunak et al. [25] , who maintained that reduced word embeddings can achieve similar or better performance than original pre-trained embeddings.

The most likely explanation for these results is the difference in the objective of ICA from the other feature extraction techniques. Even though all of them transform the initial space through combinations of dimensions into a new space, ICA technique is based on the optimization of mutual information [71, 59] . It tries to find a space where the new features (latent variables) are as independent as possible from each other but as dependent as possible from the initial space. Therefore, in the case of ICA, unlike other techniques such as PCA or KPCA, a higher number of components does not necessarily mean an increase in the information retained or an improvement in the result (as can be seen in Figure 2e where there is a decrease from 169 dimensions onwards). This would explain why a low number of dimensions would outperform Approach 1.

Likewise, the fact that it is the technique that achieves the best results in Approach 3 in all models could be due to the assumptions and characteristics of both the ICA and the pre-trained embeddings. First, the pre-trained embeddings probably include non-relevant and noisy variables as these embeddings are not adjusted to the STS task. Secondly, since ICA is a technique in which the original variables are related linearly to the latent variables, but for which the latent distribution is non-Gaussian, the noise present in pre-trained embeddings agnostic of the STS task could be managed appropriately.

Interestingly, these results also emphasize that the issue of non-Gaussianity is more relevant than the nonlinearity issue. Non-Gaussianity would be more important than how the initial variables are combined as the ICA technique outperforms both linear PCA and nonlinear KPCA. This is in good agreement with other studies comparing the performance of PCA and ICA as a method for feature extraction in visual object recognition tasks [72, 73] .

Additionally, the presence of noisy variables in the pre-trained embeddings would be also corroborated by the low scores obtained from the Variance Threshold feature selection technique, which completely depends on the original variables and cannot manage these noisy distributions.

Consequently, ICA shows great properties to obtain compacted embeddings versions of pre-trained models with a large decrease of dimensions that improve the result in the task of semantic similarity at a multilingual level.

For all these reasons, we can understand unsupervised dimensionality reduction, specially ICA, as a method of fitting pre-trained models for downstream tasks. As it will be seen in the next section and as might be expected, this unsupervised dimensionality reduction downstream fitting is not comparable to a supervised fitting such as the fine-tuning of Approach 2. However, downstream fitting by unsupervised dimensionality reduction techniques may present interesting advantages such as the fact that being unsupervised is task agnostic resulting in models with higher generalizability and with a lower number of dimensions. Also, these dimensionality reduction techniques do not require GPUs and apply a more interpretable methodology than a Deep Learning model fine-tuning such as Transformers.

Although it was stated at the beginning of this section that model comparison is not an objective of the paper, different versions of the same architecture have been included for a wider evaluation of the effects of dimensionality reduction techniques (i.e., xlm-roberta-base & xlm-roberta-large, bert-base-multilingual-cased & distilbert-base-multilingualcased) have been included in the study. Based on the complexity and learning potential of the models, one would expect the xlm-roberta-large model to perform better than the xlm-roberta-base model. Similarly, the bert-base-multilingualcased model would be expected to be superior to the distilbert-base-multilingual-cased model. In Approach 1, however, the opposite is true. Only when fine-tuning occurs in the task (Approaches 2 and 4) is it observed how the performance of the models is in line with the expected complexity (see Table 5 ). These results provide wider support for the importance of supervised fine-tuning. Similarly, fine-tuning also alters the impact of dimensionality reduction techniques on the results of multilingual models. Compared to Approach 3, when fine-tuning, the feature selection techniques and nonlinearity becomes more important, ICA becomes less critical, and the minimum number of dimensions that outperform the baseline approach increases.

As can be seen in Figure 3 and Table 5 , the promotion of Variance Threshold feature selection as one of the best techniques for some models in Approach 4 could be attributed to the fact that fine-tuning adjust the embeddings to the task, reducing the presence of noisy variable and taking advantage of the variable selection process. This would be in line with the results obtained in Approach 3, where feature extraction techniques more adequately handled the presence of unadjusted variables. This further supports the argument that they can reduce dimensions and generate new feature representations to help improve performance on learning problems.

Furthermore, Table 6 shows the lack of ICA dominance and the emergence of the KPCA-sigmoid technique as the method that with fewer dimensions improves the baseline Approach 2 (59.32 % ± 29.92 % reduced dimensions retaining 99.00 % ± 2.00 % baseline's performance). It reveals that managing the non-Gaussianity issue is less relevant than the nonlinearity issue after fine-tuning. The fine-tuning process also impacts the reduction capabilities of the dimensionality reduction techniques since considering the technique that retains the maximum performance with the lowest dimensions for each model in Table 6 show that the initial dimensions from the baseline Approach 2 are reduced an average of 54.65 % ± 32.20 %. Although this average reduction is lower than the achieved earlier in the comparison of Approach 3 with the baseline Approach 1, it still remarkably that even after fine-tuning, the multilingual performance can be exceeded with half of the dimensions. Finally, the two-tailed paired T-test comparing Approach 2 vs Approach 4 using the values from Table 5 resulted in p = 0.255, revealing that there is no significant difference in performance using these dimensional reduction techniques.

Finally, we shall conclude by analyzing the results obtained with UMAP. The case of UMAP shows that for the STS task, it is not a suitable technique to reduce the dimensionality of the embeddings since it is the one that retains the lowest percentage of the baseline performances in both pre-trained and fine-tuned embeddings. Considering this fact, it is interesting to note that the potential of this technique resides in the fact that it quickly saturates, i.e., the maximum retained performance is reached with a small number of dimensions in Approach 3 with an average of 94.65 % ± 6.07 % of reduced initial dimensions retaining 49.00 % ± 11.14 % performance with respect to the reference Approach 1, and more notably in Approach 4 with an average of 98.42 % ± 0.72 % of reduced initial dimensions retaining 76.00 % ± 3.74 % performance with respect to the reference Approach 2.

In addition to embedding compression and better generalization capabilities, one of the primary uses of dimensionality reduction techniques is facilitating data visualization. In NLP, data visualization can be helpful to interpret the model learning process, semantic capabilities and its results. For instance, Mikolov et al. [74] explored the projections of high dimensional word embeddings in 2D to extract multiple relationships between words. For this reason, in this work we have also applied the different techniques in 2D and 3D dimension reduction to study their usefulness for the visualization and model's interpretability. A straightforward example of the applicability of these techniques can be found in Figure 4 . In this Figure, we compare the 2D embeddings representation of a random subset pairs of sentences of the STS dataset in the English language with different levels of semantic similarity. The comparison between Figure 4a and Figure 4b allows us to observe the effect of fine-tuning in the bert-base-multilingual-cased model. In the pre-trained case, we observe how the distance between pairs of sentences is similar and does not match the labelled similarity. However, after fine-tuning, we observed a better agreement between the spatial representation and the labelled similarity, demonstrating and supporting the improvement of downstream task fine-tuning.

Besides, the numerical results of these analysis are presented in Table 7 for Approach 3 and Table 8 for Approach 4. In both cases there is no clear dominance of any technique. The most likely explanation of this variety of results is that dimensionality reduction for visualization purposes is highly model-dependent. As anticipated, feature extraction techniques clearly prove to be more useful than feature selection in both approaches, since transforming the initial high-dimensional space into a new reduced latent space can extract more information rather than selecting specific variables, even if these variables are properly adjusted to the task.

Regarding UMAP and visualization, it does not show noteworthy results either. However, it is intriguing to note that it turns out to be the best technique for visualization in some models, which corroborates the saturation capacity hypothesis observed in the previous analyses. The discrepancy of the best visualization technique between models could be attributed to the previous mentioned model dependency. Another possible explanation for this might be that the greatest potential of this technique resides precisely in visualization, as shown by other works where this capability is exploited, such as in Health Science for identifying cell types [75] or in Artificial Intelligence to analyze the activations throughout modern convolutional neural networks [76] . However, we assume the limitations that may derive from the task and the models used in this study, given that the technique is highly dependent on the parameters used, which unfortunately were not possible to be intensively explored.

Altogether, it is advisable to explore all techniques for visualisation purposes and select the one that best suits the desired task.

In this investigation, the goal was to assess the impact of a variety of dimensionality reduction techniques on the performance of pre-computed multilingual siamese fashion Transformers embeddings on semantic textual similarity tasks from mSTSb, to expand our knowledge of semantic-aware transformer-based models. To this end, two different baseline approaches are reduced (i.e., Approach 1 and Approach 2), one using the pre-trained version of the models and the second one further fine-tuning them on the downstream STS task. Particular attention is paid to analysing which techniques best and with the fewest dimensions improve the performance of the baseline approaches.

From the research carried out, it is possible to conclude that dimensionality reduction techniques can help to reduce the number of dimensions of the embeddings while improving the results if using pre-trained embeddings from Approach 1 and preserving the performance when using fine-tuned embeddings from Approach 2. Nevertheless, the dimensionality reduction is more considerable in the pre-trained version with an average of 91.58 % ± 2.59 % compared to the average 54.65 % ± 32.20 % of fine-tuned version. Special attention is given to ICA in the pre-trained scenario, which proved to manage adequately the noisy variables present in not adjusted embeddings. This technique also proved to be a reasonable alternative to fit the models in the downstream task in an unsupervised way, leading to a generalized adjusted version of the models with downstream multitasking capabilities. Nevertheless, it has also been proved that this unsupervised fitting is not comparable to a supervised fine-tuning. On the other hand, the fine-tuned scenario revealed the relevance of feature selection techniques and the significance of nonlinear KPCA techniques for dimensionality reduction.

Our experiments results are consistent with previous results obtained by Raunak et al. [25, 26] corroborating the hypothesis that embeddings reduction can maintain or improve the performance of the original embeddings by extending their evaluation in state-of-the-art contextual-based models from a multilingual approach. In this way, we can establish that dimensionality reduction techniques could be leveraged for contextualized embeddings as well.

This work has also considered the consequences of dimensionality reduction for visualization purposes by extending previous work in the literature [39] . The results corroborate that feature extraction methods are more practical for STS tasks than feature selection. However, selecting the best dimensionality reduction technique for visualization depends on the model, so it is recommended to explore different techniques for this purpose.

To our knowledge, this is the first study to investigate the effect of dimensionality reduction techniques and Transformers models in a multilingual semantic-awareness scenario. Based on the promising findings presented in this paper, continued research into the impact of dimensionality reduction techniques in other high-demanded NLP tasks appears totally justified. Furthermore, in future research, we intend to concentrate on testing the reduced models presented in this work in real-world applications. As previously stated, these findings of multilingual semantic similarity are of direct practical applicability. The combination of dimensionality reduction techniques with Transformers models could also help reduce the embeddings size and to make possible ensemble approaches. Finally, further studies about the multitasking generalization capabilities of ICA for pre-trained models are still required.

A survey of the usages of deep learning for natural language processing

Efficient transformers: A survey

Attention is all you need

Bert: Pre-training of deep bidirectional transformers for language understanding

Sentence-bert: Sentence embeddings using siamese bert-networks

Bertuit: Understanding spanish language in twitter through a native transformer

Natural language processing

Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring

Correlation coefficients and semantic textual similarity

Soft similarity and soft cosine measure: similarity of features in vector space model

Enhancing deep learning sentiment analysis with ensemble techniques in social applications

Deep learning based fusion approach for hate speech detection

A survey of the recent architectures of deep convolutional neural networks

Barcelona (online): International Committee for Computational Linguistics

What happened in CLEF

Hyperspectral Band Selection A review

A review of unsupervised feature selection methods

Implementation of machine-learning classification in remote sensing: an applied review

Hands-on unsupervised learning using Python : how to build applied machine learning solutions from unlabeled data

A review of feature selection and feature extraction methods applied on microarray data

Rethinking network pruning -under the pre-train and fine-tune paradigm

Learning to compute word embeddings on the fly

Billion-scale similarity search with GPUs

Effective dimensionality reduction for word embeddings

4th Workshop on Representation Learning for NLP (RepL4NLP)

On dimensional linguistic properties of the word embedding space

Post-Processing and Dimensionality Reduction for Extreme Learning Machine in Text Classification

An Introduction to Neural Information Retrieval

Feature Extraction Methods and Manifold Learning Methods

Energy and Policy Considerations for Deep Learning in NLP

Keiosk Analyt; Megagon Labs; Naver; PolyAi; Samsung; Bebelscape

57th Annual Meeting of the Association-for-Computational-Linguistics (ACL)

Billion-scale similarity search with gpus

Glove: Global vectors for word representation

Enriching word vectors with subword information

On lines and planes of closest fit to systems of points in space

Principal component analysis: a review and recent developments

Indexing by latent semantic analysis

Text classification based on the word subspace representation

Embedding compression with right triangle similarity transformations

A comparative study of methods for visualizable semantic embedding of small text corpora

Stochastic neighbor embedding

How optimizing perplexity can affect the dimensionality reduction on word embeddings visualization?

Representation learning: A review and new perspectives

3d cnn-pca: A deep-learning-based parameterization for complex geomodels

Applying pca to deep learning forecasting models for predicting pm2

Textual data dimensionality reduction -a deep learning approach

Probabilistic principal component analysis (ppca) based dimensionality reduction and deep learning for cancer classification

Document ranking with a pretrained sequence-to-sequence model

Countering misinformation through semanticaware multilingual models

Simple bm25 extension to multiple weighted fields

Information disorder: Toward an interdisciplinary framework for research and policy making

Data citizenship: Rethinking data literacy in the age of disinformation, misinformation, and malinformation

Unsupervised whatsapp fake news detection using semantic search

CIVIC-UPM at checkthat! 2021: Integration of transformers in misinformation detection and topic classification

Facter-check: Semiautomated fact-checking through semantic similarity and natural language inference

Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics

Keybert: Minimal keyword extraction with bert

Making monolingual sentence embeddings multilingual using knowledge distillation

An introduction to kernel-based learning algorithms

Independent component analysis: recent advances

Nonlinear Component Analysis as a Kernel Eigenvalue Problem

Umap: Uniform manifold approximation and projection

Scikit-learn: Machine learning in Python

Transformers: State-of-the-art natural language processing

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter

Unsupervised cross-lingual representation learning at scale

Roberta: A robustly optimized BERT pretraining approach

Language-agnostic bert sentence embedding

Task-oriented intrinsic evaluation of semantic textual similarity

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Incremental learning for robust visual tracking

Pattern Recognition and Machine Learning (Information Science and Statistics

Enhanced independent component analysis and its application to content based face image retrieval

Multiresolution face recognition

Efficient estimation of word representations in vector space

The single-cell transcriptional landscape of mammalian organogenesis

Activation atlas