key: cord-0185743-1uue3tvy
authors: Gong, Wei; Khalid, Laila
title: Aesthetics, Personalization and Recommendation: A survey on Deep Learning in Fashion
date: 2021-01-20
journal: nan
DOI: nan
sha: 2c5c351861a67fdcf26a701fa753cca4ab53008a
doc_id: 185743
cord_uid: 1uue3tvy

Machine learning is completely changing the trends in the fashion industry. From big to small every brand is using machine learning techniques in order to improve their revenue, increase customers and stay ahead of the trend. People are into fashion and they want to know what looks best and how they can improve their style and elevate their personality. Using Deep learning technology and infusing it with Computer Vision techniques one can do so by utilizing Brain-inspired Deep Networks, and engaging into Neuroaesthetics, working with GANs and Training them, playing around with Unstructured Data,and infusing the transformer architecture are just some highlights which can be touched with the Fashion domain. Its all about designing a system that can tell us information regarding the fashion aspect that can come in handy with the ever growing demand. Personalization is a big factor that impacts the spending choices of customers.The survey also shows remarkable approaches that encroach the subject of achieving that by divulging deep into how visual data can be interpreted and leveraged into different models and approaches. Aesthetics play a vital role in clothing recommendation as users' decision depends largely on whether the clothing is in line with their aesthetics, however the conventional image features cannot portray this directly. For that the survey also highlights remarkable models like tensor factorization model, conditional random field model among others to cater the need to acknowledge aesthetics as an important factor in Apparel recommendation.These AI inspired deep models can pinpoint exactly which certain style resonates best with their customers and they can have an understanding of how the new designs will set in with the community. With AI and machine learning your businesses can stay ahead of the fashion trends.

If we go over the past decade and see how deep learning has achieved significant success in many popular Industries and areas. We observe how perception tasks, including visual object recognition and text understanding and speech recognition, have revolutionized different regions. There is no comparison as to how successful deep learning has been. Still, suppose we want to discuss deep learning in the real terms of the fashion industry. In that case, we see a lot of opportunities and research areas that are still available to work on. As we all know, fashion is an ever-evolving industry. There are new trends that are setting in every second that is passing by. Although clothing design is like one of the most creative realms in the Contemporary World [21] , whether it's because of the considerable creative part of the design process or equivocal information about clothing, the fact remains to be. Internet shopping has also grown incredibly in the last few years, and fashion has created immense opportunities. Exciting applications for image understanding , retrieval and tagging are surfacing, and there are loads of different application areas that they can be used on. For example, text analysis, image analysis, and similarity retrieval can be utilized in fashion. So deep learning is an aspect that we can use to train our computer to perform humanlike tasks such as recognizing speech, identifying images or making predictions. For example, the results described in the apparel design and fashion industry allow users to translate the image into the text that might as well be interpreted as a description of the garment based on its sketch.

We also know that images are an essential aspect because they display content and convey emotions like sadness, excitement, anger, etc. So useful image classification is beneficial, and obviously, it's been used in computer vision in multimedia. Still, if you find research regarding fashion and specifically in terms of aesthetic features or personalization, you will find only a few specific directions. Discussing one of them that is available to describe images inspired by art theories, which are, you know, intuitive, discriminative, and easily understandable. So we know that the effective image classification based on these features can achieve high accuracy compared with the state-of-the-art. For that, we take an example in the paper [63] where they develop an Emotion guided image gallery to demonstrate the proposed feature collection. So the authors achieve mining the interpretable visual features directly affecting human emotional perception from the viewpoint of art theories.

Another example in another paper [2] is where they discussed that content-based image retrieval is done in terms of people's appearance. It's a two-stage process that is composed of image segmentation and region-based interpretation. The modelling of an image is due to an attributed graph and a hybrid method that follows a split and merge strategy. There are a lot of different stuff that is being worked on in this field of computer vision specifically, and image retrieval from databases is usually a formula, in terms of descriptions that combine the Salient features such as colour, texture, shapes etc.

Today, more and more retail companies are trying to understand, to stay ahead of the trend curve and because they want to reshape their business to stay ahead while implementing tech forward approaches and solutions. And data analysis brings diverse opportunities to companies, which allows them to reach their customer goal and offer a smarter experience to them. But the thing is that the lack of profound insights based on reliable statistics is the major challenge of fashion retailers that they face. So for that, computer vision technologies and deep learning can come in very handy. And as we all know, computer vision is still an evolving technology, so we can speak about specific optimization and cost reduction techniques that can come in handy, for example, like how the information regarding what people wear, how customers kind of match their garments and what or which or who influences their taste is essential for fashion retailers. As we can see the Instagram influencers, we see that many people follow them and try to copy their trends and how they are inspiring a lot of followers. Image recognition technology also helps business owners collect data, process it, and gain an actionable insight for Effective Trend forecasting.

For that, in this particular article [10] , we see that the dashboard they developed allows seeing how frequently one specific type of garment appears a day. Like what type of apparel is popular within a particular age range or how people sort of match their attire. Like for example, how a specific jacket is trending or why is it popular among teenagers? Or why is a scarf popular amongst the elders. They developed the graph that shows how certain prevalent types of garments would be over the next season's you know, which could broadly impact the new upcoming trend for the fashion. This kind of analysis also aims to help fashion retailers and brands plan sales and learn to avoid any surplus. The author suggests that in the visual search domain with a focus mainly on image similarity for like, e-commerce and Online shops and understanding images of clothing, it means a lot more than just classifying them into different categories. Because if you don't get a meaningful description of the whole image you classify, then you are losing a lot of information that could come in handy. In this way, one can gain reliable and timely insights into fashion trends across any location. What defines those trends is people's unique choices, like how they choose something and what goes with their personality. The element of personalization is one of the biggest game-changers in this apparel recommendation. By targeting this factor, businesses can attract more customers.

The thing that I like about this deep learning aspect is that it penetrates the industry and, you know, activities where human creativity has traditionally dominated. It adds a futuristic touch to fashion, art , architecture and music so on. Another paper's [9] key finding is that the representation of content and style in the convolutional neural networks are separable. That is, you know if we can manipulate both representations independently to produce new and perceptually meaningful images. If you look, fashion is an entirely new direction for machine learning. So to design clothes one should you know, basically have an understanding of the mechanism of technique, like how certain styles go famous, what things they are having that are attracting millions of followers around and what causes the spread, you know the spread of the Fashion trends and principles and evolution of patterns, so the task of designing or predicting trends can be simplified. The paper under discussion where the author suggests that now designing or predicting Trends can be simplified, thanks to a new class of neural networks. These networks basically can automatically allocate shapes, elements, and types of clothing and further combine them. This allows a whole fresh feel of how you can manipulate the patterns and see which patterns can influence more influence than the others. Now aesthetics play a vital role in the user's pick, and even though personalization is tricky to play with, aesthetics are not. Because everyone appreciates eye-pleasing items and if we can manipulate the role of aesthetics in our fashion recommendation, we can hit the jackpot.

So there are many various aspects of fashion in which deep learning can enhance and help us out. There are multiple domains for improving the current elements and how we can help predict and revolutionize this industry. This survey is organized in the following sections. Sec. 2 reviews the fashion recommendation systems and approaches that come out on top and are the basis for future work. Sec. 3 illustrates the positions for aesthetics in fashion, all it's analysis containing various approaches. Sec. 4 provides an overview of personalization in fashion , different top approaches that have tasks comprising Deep Neural Networks, GAN's, and handling unstructured data. Sec. 5 demonstrates selected applications and future horizons that can be worked on. Last but not least, concluding remarks are given in Sec. 6.

Well, if you indulge in object recognition, you will find that fashion sense is a bit more subtle and sophisticated, you know, which can require specific domain expertise in outfit composition. So, for example, if you refer to an outfit as a set of clothes working together kind of typically for a desired specific style or to find a good Outfit composition, what we need is not only to follow the appropriate dressing course, but it can also have a creative aspect in balancing the contrast of colours and different styles. And although we have seen a relative number of researches that are mainly based on clothes retrieval and recommendation but what we have seen is that none of them consider the problem of fashion outfit composition. On the one hand, you know a fashion concept is often subtle and subjective and is non-trivial to get you to know consensus from ordinary labellers if they are not Fashion experts. On the other hand, there may be a large number of attributes for describing fashion.

It is challenging to obtain exhaustive labels for training. So, as a result, most of the existing studies are kind of, you know, limited to the simple scenario of retrieving similar clothes or choosing individual clothes for a given event. So the paper [39] that is being reviewed proposes a data-driven approach to train a model that can automatically compose a suitable fashion outfit. This approach is motivated by the surge of the increasing online fashion trends, including Pinterest and YouTube, and how teenagers have been addicted to creating every new culture trend on these sites.

So basically what they have done is that they have developed a full automatic composition system that is based upon a scorer by iteratively evaluating all the possible outfit candidates. But this model had some challenges in which they had to look out for possible solutions. For example, one of the challenges that they Encountered was that complicated visual contents of the fashion images? So, you know, there are potentially many kinds of different attributes like color, textures, categories and spectrum's etc and it is impossible to label or even list all possible attributes. So there is this hindrance and second one would be the rich context of fashion outfit for example, clothing outfits can kind of sort of reflect current personality and interest. So if one style is acceptable to a specific group or culture. It may be offensive to the others. So to infer such information they have taken into account not only the pixel information but also the context information in the fashion output.

Approach. So basically for These challenges they proposed different solutions like for the first challenge they have proposed an end-to-end system of encoding visual features through a deep convolutional network which sort of, you know, takes a fashion outfit as an input and processes it and then predicts the user engagement levels. And for the Second Challenge what happens is that a multimode Deep learning framework, which sort of leverages the context information from the image itself and the experiment that they did through that was that the multi-modal approach significantly outperforms the single model. And provides the suitable and more reliable solution for the fashion outfit for scoring tasks and thus the full composition tasks. So these are the contributions that they are enlisting and they are basically proposing an end-to-end trainable system to fuse signals from multi-level hybrid modalities that includes image and metadata of the fashion items and they also collected a large scale of database that are for the fashion outfit related research. Lastly they propose a fashion outfit composition to the solution based on a reliable sort of outfit quality predictor and predicting fashion is never easy, but it is something that they have put forward because many interleaving factors visible or hidden contribute to the process the combinatorial nature of the problem also makes it very interesting and it's a test tone for the state-of-the-art machine learning systems.

As we know that the fashion domain has quite a lot of several intriguing properties that can be personalized and which make personalization recommendations even far more difficult than the traditional domains. So in order to sort of avoid potential bias, like when using explicit user ratings, which are also pretty much expensive to obtain.

2.2.1 Background. So this paper [46] basically suggests the work that approaches fashion recommendations by sort of analyzing the implicit feedback from users in an app. Basically the design criteria is that the system shall be completely unobstructive and thus the recommendation system cannot , you know, rely explicitly on the ratings rather It will be based on the rich history and the interaction between the user and the app. In simple words it relies on the implicit feedback that is you know, the user preference is to be automatically inferred from the behavior.

Though there are still some challenges that can be gathered in this approach that is the most notable interaction a user has with an item is a sign of interest ,so the system therefore never receives a negative feedback and of course, you know an item can be both clicked and loved so it is also multi-faceted and then the different types of feedback will have to be combined into a single numerical value as defined for an experiment. Set a preference score for the recommendation algorithms. It is difficult to evaluate such a system compared to explicit-rating-systems, because the system does not have a target rating to compare its predictions to. So all in all the success basically relies on the implicit feedback system that has a well-defined strategy for inferring user preference from implicit feedback data and combining even types into implicit scores and then evaluating these scores and recommendations by using a suitable metric.

Approach. So basically the authors in order to build this recommendation system took the first step and that was to generate implicit preference scores and to you know translate data that is being captured by a user's interaction with an item into a specific number that can be called employees implicit preference score and that can be also later used to rank it with the other items so that most important factor in this was when they created such numbers to understand the data available and their implications for user preference. So once you can have the data analyzed suitable generalizations can then be furthermore chosen. And then the second step was for defining the penalisation functions. Important properties in the fashion domain that must be captured by the system include seasons and trends, price sensitivity and popularity. In general, when a user triggers an event , e.g. Clicks, we have a range of possible scores to give this event. We use to denote this score, and let and denote the minimum and maximum score possible for event , respectively. We then use a penalisation function ( ) taking a feature value (e.g., related to an item's price), and returns a number between zero and one to adjust the score inside the possible range.

So as mentioned that you know fashion is all about the trend and timing, so the recentness of an event is a natural feature for having the events importance and therefore, penalise the items that the user has not , you know considered recently. So for that they had a look at the number of days since the user did the event in question. Let's say we can denote that event by . And then compare this to the old event that the user has in the database and that can be, you know denoted by .

So this can be known later on, forced to create a linear penalization letting ( ) = / , , but it wasn't fitting well with the idea of Seasons. So as an example what they did was that even if a store may be selling the warm clothes from November to March , they wanted to focus on the recommendations on summer clothes when the season changes so for that they had to, kind of duplicate this behavior and choose a sigmoid function that you know, considers the recentness in a way that could obscure the preference of users that have been absent from the app for some time. So they used linear penalization because you know, it could ensure that the difference in penalization between the two most recent items is equal to the difference between the two old ones.

So for the price what they did was that, you know different users have different price range because they tend to be price sensitive and if an item's price should also be taken into account then what they did was that the users typical price range was used and that was that created a personalized score and penalized that were not in the price layer range preferred by the user. So this procedure was basically done in simple two steps. In the first step what they did was they found the average of all the price items related to a user and on second base they pretty much calculated the difference that was found in the price of an item that triggered the event and the average and then used that to penalize that item. Rregarding the third aspect that they used was popularity. So for the popularity expect what they did was that they considered popularity as a feature by, you know, having a comparison with users Behavior to the rest of the population. So, you know, we can tell it like that that if a user's activities conform to the common standards that are likely to be his or her taste then it is more unique giving significant clues about the items to recommend. So basically they judged each user's behavior by looking at the overall popularity of the items. They pretty much interacted with them and they use a linear pair punishment for items And lastly what they did was they combined all these different penalisation and came over a sum of all models this sort of required setting different weights for different factors. So simply what they did was in order to validate their approach that there were scores built using features and that was you know, Event for the fashion domain and secondly, they distributed the scores over a full range of valid scores and had an average confirming the hypothesis.

As we know that the technology regarding online shopping has been developing rapidly and that online fitting and other clothing intelligent equipment have been introduced in the fashion industry. A lot of different Advanced algorithms have been developed and there are many more currently in the process.

For example the CRESA [44] combined textual attributes, visual features and human visual attention to compose the clothes profile in the recommendation. Recommendation that is based on the content is usually applicable for multiple regions. So for new projects, let's say if the user has according to the individual browsing records, they can recommend results have been proven to be explicit and accessible but the content-based recommendation usually is improper when you kind of apply it in the industry. And obviously this means that the new users that sign up would not be getting any recommendations based on the browsing record.

Approach. Basically what this paper [72] proposes is that the classification process usually needs to consider the quarter sales clothing styles and other factors. So as a result, they basically divided this into four categories where the fashion level is a subjective method that usually needs subjective evaluation on image characters through the expert group. So knowledge background and psychological motivation of the edge experts is involved. And as for the researchers of visual psychological characteristics, there wasn't a quantitative description method by which the objective evaluated results can represent the subjective evaluation results. So what this aims to 

First level Wonderful Second level Great Third level Good Fourth level Common find out is to have a set of objective indexes, which can be used to access the fashion level. This was done by considering all the factors that usually affect the evaluation of personal scoring. So this paper basically regards the weak appearance feature as an important index that can influence the fashion level. So there are many, as you know weak appearance features related to the individual fashion level. But the three major categories that can be known namely if we want to go over are makeup ,accessories and hair colors. So this could include the blush, the lip color, eyebrow color ,hat, any accessories on hand and neck etc. By utilizing all these features what they do is that the SVM classification method is leveraged in this and they evaluate based on whether the human body has weak appearance features. So there is no effective way to sort of establish a fashion level database. But the one established in this paper is a basis of the follow-up studies that can be taken up by the future researchers. So basically the image database is of a pretty much very important significance in all this training and testing of algorithms.

For the extraction of weak feature index, the current face detection methods usually have sort of two categories in which knowledge based ones and statistics based ones are available. So in order to extract the weak facial feature, they find the facial feature points and then they use the facial recognition. This paper basically adopts the Adaptive boosting method for facial feature positioning. So the idea behind is that they have to endure large amounts of unsuccessful training samples making the algorithm learning focus on the difficult training samples in the subsequent study and finally they weight and add the number of weak classifiers selected by the algorithms to Strong classifier.

So all in all what the paper does is that it uses the appearance week feature to sort of characterize consumers' fashion level and what it does is that it draws the conclusion by, you know, comparing 

Weak feature index Make-up Eyebrow, blush, lips, eye shadow Accessories Neck accessories, hand accessories, brooch, nail, hat Hair color Red, yellow, green, blue, brown, black, gray, white the science experiment and expert evaluation. So both categories of evaluation are involved in this study. Basically the fashion level of the users is what they determine which is based on their makeup ,the accessories they are wearing and the hair color they have. So if a person is into red hair color or you know, having a lot of makeup on they can you know access their level that oh, okay so this person is more into fashion. So based on their level they kind of you know just recommend them the things that they like so for example, let's say if a certain person is into dark eye shades and dark lip color and you know, they are having some sort of streaks in their hair and stuff like that. So these May indicate a level that is higher in the fashion aspect and they will obviously recommend the products accordingly.

A lot of multiple semantic attributes built up a fashion product for example sleeves, collars etc. So while making you know these decisions regarding the clothes, a lot of preferences for different semantics attributes, like v neck collar ,deep neck or pointed toes shoes, high heels etc, are looked over. Semantic attributes can not only let you know how one generates a comprehensive representation of products, but they can also help us make an understanding of how the user preferences work. But unfortunately, there aren't any unique challenges that can be inherited in designing efficient solutions in order to integrate semantic attribute information for the fact that we want fashion recommendation.

It is quite difficult to obtain semantic attribute features without the manual attribute annotation and especially in large scale e-commerce. On the other hand if the user preferences are basically classy or sophisticated while traditional methods usually have to transform the item image into a vector directly. So these two aspects make it very very difficult to explain recommendations with current recommendation models. It is very hard on the other hand with these aspects to generate an explainable recommendation with the current recommendation models [29, 64, 65] that are currently being used in the industry.

Approach. So for that this the paper [15] basically proposes a novel semantic attribute explainable recommendation system as a fine-grained interpretable space name semantic attribute space is introduced in which each Dimension corresponds to a semantic attribute. So basically they project the users and items into this space. The users' fine-grained preferences are being able to generate explainable recommendations specifically if they first develop a semantic extraction Network that can be used to extract the region specific attribute representations. Then by this each item is then projected to the semantic attribute space and then you can easily capture the diversity of semantic attribute. The design aspects contain a fine-grained preferences attention FPA module which basically does that it automatically matches and the user preferences for each given attribute in the space and aggregate all these attributes with different weights. So now each attribute has a weight of it's own so in the end what happens is that finally they optimize the SAERS models in Bayesian personalized rank BPR framework, which not only significantly improves and out performs several base lines on the visual recommendation task, but it also sort of provides interpretable insights by highlighting attribute semantics in a personalized manner. Basically, what they have done is that previously as we know that these attempts were made to capture users' visual preferences, but in order to make institutional explanations for the recommendations, they were pretty much very limited on item level. So the paper basically takes a further step to discuss the user preferences on Visual attribute level.

With their semantic attribute explainable recommendations system. They basically bridge the gap and utilize a new semantic attribute visual space in which each Dimension represents an attribute that corresponds to the region that basically different regions of the clothing items are usually split into several semantics attributes via the extraction Network and then they are later projected into the visual space. So later the users are projected according to the Fine graded preferences for clothing attributes. So this all makes it easily for them to obtain the fashion item projection in the semantic feature space. And from there they can use the FPA to project users into the same semantic feature space. Here FPA is the Fine grain preferences attention where they jointly learned the item representation in both Global visual space and semantic attribute visual space under a pairwise learning framework. And with this they are able to generate the explainable recommendations.

Traditional procedures for complementary product hints depend on behavioral and non-visible facts along with consumer co-perspectives or co-buys. However, positive domain names along with style are often visible. Recommendation algorithms are important to many business applications, specially for online shopping. In domain names along with style, clients are seeking out apparel hints that visually supplement their modern outfits, styles, and wardrobe. Which the conventional strategies do now no longer cater to.

Methods. Now we have seen that there are traditional content-based and collaborative recommendation algorithms [1, 38] . But among these collaborative filtering approaches [ 35, 45] are the common ones that primarily rely on behavioral and historical data such as you know, Co purchases , the views and past purchases to suggest new items to customers. So this work basically on providing complimentary item recommendations for a given query item based on visual cues. [20] does is that it proposes a framework in which they harness visual clues in an unsupervised manner in order to learn the distribution that exists between co-occurring complimentary items in real world images. The model runs are nonlinear transformations between two manifolds of source and Target complimentary item categories, for example, a top and a bottom in an outfit. And training it on a large data set they train generative Transformer Network directly on the feature representation space by just casting it as an Adversarial Optimization problem. Now such a conditional generative model can produce multiple novel samples of complimentary items in the feature space for a given query item.Now for that they develop an unsupervised learning approach for complementary recommendation using adversarial feature transform CRAFT by learning the co-occurrence of item pairs in real images. So the Assumption here is that the co-occurrence frequency of item pairs is sort of a strong indication of likelihood of their complementary relationship. So the paper advises a defined and adversarial process to train a conditional generative Transformer Network which can then learn the joint distribution of item pairs by observing samples from the real distribution. Now their approach is quite novel and unique in a certain way that they utilize generative adversarial training with several advantages over traditional generative adversarial network (GAN) [11] based image generation. Well, we know that the quality of visual image generation using GANs has improved a lot but it still lacks the realism required for many real world applications and fashion apparel is one of them. And more importantly if we see that their goal of recommendation systems in certain types of application is often not to generate synthetic images, but they have to recommend real images from a catalog of items. Now we know that an approach that generates synthetic images will still need to perform a search and that will be typically done by searching in the feature space in order to find the most visually similar image in the catalog. Now CRAFT directly generates these features of the recommended items and bypasses the need to generate synthetic images and enable a simpler and more efficient algorithm. So by working in a feature space, what they do is that they can use a simpler Network architecture that improves stability during the training time and avoid common pitfalls such as model collapse [4] . Then what they do is that the encoding which are fixed feature representations are generally derived from pre-trained CNN's. Typically it is advisable to use application specific feature representations, for example, apparel feature embeddings for clothing recommendations, but a general representation such as one trained on ImageNet [7] or MS-COCO [40] offer nice efficient alternatives. So as shown in figure, what basically is happening, is that the source and the target feature encoders and , respectively are fixed and are used to generate feature vectors for training and inference. Now, the architecture resembles traditional Grand designs with two main components , a conditional feature transformer and a discriminator. The role of the feature transformer is to transform the source feature into a complementary target featureˆ. The input to the transformer also consists of a random noise vector sampled uniformly from a unit sphere in a -dimensional space. By design, the transformer is generative since it is able to sample various features in the target domain.

As the transformer consists of several fully-connected layers in which each is followed by batch normalization [22] and leaky ReLU [42] activation layers. The discriminator is commensurate to the transformer in capacity, consisting of the same number of layers. This helps balance the power between the transformer and the discriminator in the two-player game, leading to stable training and convergence.

From a query image, the query feature is extracted by the source encoder, , and multiple samples of transformed features ˆ are generated by sampling random vectors { } . Now basically what it does is that it allows them to generate a diverse set of complementary recommendations by sampling the underlying conditional probability distribution function. And when they performed a nearest neighbor search within a set of pre-indexed target features extracted using the same target encoder, , used during training. Actual recommendation images were retrieved by a reverse lookup that maps the selected features to the original target images. 6 . Illustration of the proposed scheme. They employed a dual autoencoder network to learn the latent compatibility space, where they jointly model the coherent relation between visual and contextual modalities and the implicit preference among items via the Bayesian personalized ranking.

The feature transformer in CRAFT samples from a conditional distribution to generate diverse and relevant item recommendations for a given query. The recommendations generated by CRAFT are preferred by the domain experts over those produced by competing approaches.

It's easy these days where fashion communities are online and we can experience that a lot of fashion experts are publicly sharing their own fashion tips by showing how their outfit compositions work , where each item a top or a bottom usually has an image and context metadata title and category. With such Rich information, fashion data offers us an opportunity to investigate the code in clothing matching. Now we know that the colors, materials and shape are some aspects that affect the compatibility of fashion items and also each fashion item involves multiple modalities and also if we notice that the composition relation between fashion items is rather sparse. Now this makes Matrix factorization methods not applicable.

2.6.1 Previous Methods. The recent advancement in these Fashion aspects has been done, but the previous models [18, 24, 41, 43] proposed were lacking in terms of how they wanted to approach this subject.

Approach. So what this paper [55] proposes is a content-based neural scheme that models the compatibility between fashion items based on the Bayesian personalized ranking BPR framework. Now this scheme jointly models the coherent relation between modalities of items and their implicit matching preference.So basically they propose focusing on modeling the sophisticated compatibility between fashion items by seeking the nonlinear latent compatibility space with neural networks. And they also were able to aggregate the multimodal data of fashion items and exploit the inherent relationship that basically exists between different modalities to comprehensively model the compatibility between fashion items. Now we know that it is not correct to directly measure the compatibility between fashion items from a distinct space due to their heterogeneity. So for that the author's they assume that there exists a little compatibility space that is able to bridge the gap between heterogeneous fashion items where highly compatible fashion items share the similar style material which can show high similarity or functionality should also show high similarity. For example a wide casual T-shirt goes really well with black jeans, but it does not go with a black suit while a pair of high boots prefer skinny jeans rather than flared pants. So they further go along and assume that the subtle compatibility factors lie in a highly nonlinear space that can be learned by the advanced neural network models. So they employ the auto encoders networks to learn the latent space which has been proven to be effective in the latent space learning. [62] To fully take advantage of the implicit relation between tops and bottoms, basically what they did was that they naturally adopt the BPR framework and assumed that bottoms from the positive set B + are more favorable to top than those unobserved neutral bottoms. According to BPR, built a training set:

where the triple ( , , ) indicates that bottom is more compatible than bottom with top Then according to [49] , they got the following objective function,

Taking the modality consistency into consideration, they got the following objective function:

are non-negative trade-off hyperparameters. Θ refers to the set of network parameters (i.e., W andŴ . The last regularizer term is designed to avoid overfitting.

The word aesthetic [47] was basically introduced in the 18th century where it has come to be used to designate among other things a kind of object, a kind of judgment, a kind of attitude or experience and a kind of value. Where aesthetic comes the concept of aesthetic descends usually from the concept of taste. So in the 18th century, the theory of taste emerged in part as a corrective to the rise of rationalism particularly as applied to Beauty and the rise of egoism particularly as applied to virtue.

So how do people usually describe clothing ,so there are words like informal, casual ,formal ,party, where they are usually used. But the recent focus on recognizing or extraction of the features that are available visually in clothing is pretty much different.

To accurately guess that, the authors in the paper [27] describe a way to bridge the gap between visual features and the aesthetic words. So what they basically do is that they formulate a novel three-level framework visual features (VF) -image-scale space (ISS) -aesthetic words space (AWS) and then they leverage the Art field image scale space which serves as an intermediate layer. So firstly they proposed a stacked diagnosing auto encoder Guided by correlative labels SDAEGCL, to map the visual features to the image scale space and then with that accordingly what they do is that the semantic distance is computed by the Wordnet similarity [48] . They map the most often using static words available and being used by people in the online clothing shops to the image scale space. Now, what they do is that they employ the upper body menswear images that they have downloaded from several different online shops as their experimental data and they proposed a 3-level framework that can help to capture the relationship that is standing between visual features and aesthetic words. It is quite important for people to wear aesthetically and properly and specifically given a user input occasion wedding ,shopping or dating ,a system should be able to suggest the most suitable clothing that is from the user's own clothing available. So another paper [41] similar idea was mentioned where the two criterion's are explicitly considered for the system where it is paid heed to wear properly and to wear aesthetically like for example that red T shirt matches better with white pants than green pants and to basically narrow down the semantic Gap that is between the low-level features of clothing and the high-level occasion categories. From where these clothing attributes are treated as latent variables in the support Vector machine based recommendation model. But nevertheless the matching rules cannot reveal the aesthetic effects holistically and lacked Interpretability.

Approach. So the paper [27] basically aims to bridge the gap between visual features and aesthetic words of clothing where in order to capture the intrinsic and holistic relationship between them they sort of introduce a middle layer ,intermediate layer and form a novel three-level framework, which is based on the proposed Theory by Kobayashi [33] . Where two dimensional space warm cool and hard soft aspects are applied in the art design.

Basically the contribution of the papers is that they build an association between clothing images and aesthetic words by proposing a three-level framework. It basically does a novel notation of using the 2D continuous image scale space as a layer that is intermediate with a very strong ability of description thus it facilitates the deep and high-level understanding of aesthetic effects. And secondly what it does is that the paper proposes a stacked denoising auto-encoder Guided by correlative labels SDAEGCL to implement mapping of visual features to the image scale space and that can amend the random error existing in initial input and make full use of the information of both labeled and unlabeled data and moreover we can also find that the stack methods improve the representation capability of model by adding more hidden layers.

So basically Kobayashi proposed 180 keywords into different 16 categories of Aesthetics and defined their coordinate values in the image scale space. But as in fashion, there are some words that are unrelated like alert ,robust, sad, happy. These are not something that we usually use to describe clothing. So first the authors sort of removed manually all these not often used words and established a static word space for clothing containing 527 words. Now in order to illustrate how to map the aesthetic words (∀ ∈ ) to the image-scale space . To determine the coordinate value , ℎ of an aesthetic word ∈ , the authors basically first define the 180 keywords as keyword ( = 1, 2, · · · , 180) and calculate the semantic distances between and each keyword using WordNet::Similarity . Then what they do is that they basically pick 3 keywords with the shortest distances 1 , 2 and 3 , marking the coordinate values of these 3 keywords as 1 1 , ℎ 1 , 2 2 , ℎ 2 3 3 , ℎ 3 . Afer that they take the reciprocals of distances 1 rec 2 , rec 3 as weights (e.g. rec 1 = 1 1 ), the weighted arithmetic mean 1 of 1 , 2 and 3 can also be regarded as the coordinate value , ℎ of . The formula is shown as follows:

So by this way what they do is that for each ∈ , they basically calculate its coordinate value in the image-scale space as , ℎ . To label an input clothing image with an aesthetic word, they use the proposed SDAE-GCL to predict its coordinate value ( , ℎ ) in . Then, after that they find a word ∈ whose corresponding coordinate value has the shortest Euclidean distance to the . Thus, can be regarded as the aesthetic word of image

Now we know that most existing methods sort of rely on conventional features in order to represent an image. Such features that can be extracted by convolutional neural networks are the scale-invariant feature, transform algorithm, color histogram and so on but one important type of feature is the aesthetic feature and as we have already discussed it before it plays an important role in clothing and specially in clothing recommendation since users largely depend on whether the clothing is in line with their aesthetics or not.

Methods. Now we have seen in some papers [13, 16, 43, 59] in which there was a recommendation for different fashion garments for an unfinished outfit. But their goal was different from the one mentioned in this paper. That is basically that they focused on clean per-garment catalog photos and the recommendations were mostly restricted to retrieve garments from a specific data set. Now the only feature in those recommendation systems was that they were adding to the Garment. Most prior fashion work addresses recognition problems, like matching street-to shop [28, 32, 61, 68] But in this case, what they are doing is that they are saying that some problems demand going beyond seeking an existing garment and adding to it and for that, they said that there are garments which are detrimental and it should be taken off. You know like cuff the jeans above the ankle or how to adjust the presentation and detail of them within a complete outfit to improve its style.

So in order to bridge the gap there are a lot of different methods but we are going to discuss another one [71] which introduces the intense static information. Which is highly relevant with user's preference into the clothing recommendation system. So what they basically do, is that the aesthetic feature extracted by the pre-training on network, which is a brain inspired deep structured trained for the assessment task of Aesthetics. So for that they consider the aesthetic preference which varies significantly from user to user as different people have different sorts of reference in Aesthetics. So they proposed a new tensor factorization model that incorporates the static features in a very personalized manner. So what they do is that they conduct different experiments and demonstrate that the approach they are putting forward captures the static preference of the user. It significantly outperforms the already available state-of-the-art recommendation methods.What happens is that usually when we are shopping for clothing on the web. We used to look through product images before making a certain decision before buying that thing and product images usually provide a lot of information including design, color schemes ,patterns structure and so on. We can get an estimation of the thickness and quality of a product from its images. As such product images play a lot of key roles in the clothing recommendation task. So what the authors in this paper do is that they leverage this information and enhance the performance of the existing clothing recommendation systems.

However, an important factor regarding aesthetics is that it has been considered not much in previous researchers' research. So basically what happens is that while most user's concern regarding clothing is that the product should be good looking. What happens is that the author's use the static Network to extract relevant features that is between an aesthetic network and a CNN. That are demonstrated and they proposed a brain inspired deep Network, which is a deep structure trained for image aesthetic assessment that inputs several raw features that are indicative of aesthetic feelings like hue, saturation, color, duotones ,complementary color etc. And what is it that extracts high-level aesthetics from these barely barely raw features.

So the paper works on BDN that is utilized to strike the holistic feature in order to represent the static elements of a clothing. And as different people prefer different aesthetic tastes. So to capture the diversity of the aesthetic preference among different consumers and over different times. They exploit tensor factorization as a basic model. Now, there are several ways to decompose a tensor however, there are certain drawbacks in existing models [34, 50, 51] . So what they do is that they address the clothing recommendation task better and propose a dynamic collaborative filtering DCF model that is trained with coupled matrices to mitigate the sparsity problem. And then afterwards they combined the models with Bayesian personalized ranking optimization criteria and evaluated the proper performance on an Amazon clothing dataset. So basically what they are doing is that they are proposing an novel DCF model to portray the purchase events in three dimensions: user, items, and time and then incorporate the aesthetic features into DCF and train it. And of course, they are leveraging the novel aesthetic features in recommendation to capture consumers specific aesthetic preference and they compare the effect with several conventional features to demonstrate the necessity of the aesthetic features.

So in order to illustrate the hybrid model that integrates image features into the basic model the DCFA. They first introduced the basic tensor factorization model DCF. So the basic model is the impact of time on aesthetic preference. So what they do is that they proposed a context-aware model as the basic model to account for the temporal factor. What they do is that they use P × Q × R tensor a to indicate the purchase events among the users clothes and time dimensions. So if a user P purchase an item Q in the time interval R Then A is 1 otherwise, it will be 0. so for that the tensor factorization has been widely used to predict all the missing entries 0 elements in which can be used for recommendation. So as the previous models have some limitations what they do is that they proposed a new tensor factorization method in which a user makes a purchase by deciding a product and there are two primary factors. So the first one is that if the product fits the users preference and the appearance is good looking or appealing to that specific user. And if the time is correct that if it's in the season and fashionable, for example, of course winter clothing cannot be recommended or aesthetically fine if it's being recommended in the summer season, so for user , clothing , and time interval , they use the scores 1 and 2 to indicate how the user likes the clothing and how the clothing fits the time respertively. 1 = 1 when the user likes the clothing and 1 = 0 otherwise. Similarly, 2 = 1 if the clothing fits the time and 2 = 0 otherwise. The consumer will buy the clothing only if 1 = 1 and 2 = 1, so,Â = 1 & 2 . To make the formula differentiable, they approximately formulated it asÂ = 1 · 2 . And the presented 1 and 2 in the form of matrix factorization:

The prediction is then given by:

We can see that in Equation that the latent features relating users and clothes are independent with those relating clothes and time. Though 1 -dimensional vector V * and 2 -dimensional vector W * are all latent features of clothing , V * captures the information about users" preference intuitively whereas W * captures the temporal information of the clothing. The model is more expressive in capturing. The underline related patterns in purchases. Moreover this model is efficient and easy to train compared with the Tucker decomposition.

We know that the physical attributes of a product are very much influencing the buying behavior. [56] We also know that the aesthetic calls intuitively while we shop. so it may not even be you know, the person might not even be aware of making multiple decisions on every product, for example, you know like the style but not the color of the product. Various aspects of our life influence the style of how we dress. Every look that we wear tells a different story about us. So basically it communicates a certain image representation which is you know decoded by others within their own cultural context. So it is sort of possible that the Aesthetics of a garment is similar for all in a particular society.

3.3.1 Background. So when we look into a garment, what are the main things that we should or we usually look into. Queries like. so can I wear it? , What occasion it would suit and how does it make me feel? And also another precise preference is , you know included in this aspect and how does it reflect their own personality. So these are just a few of the questions that we usually, ask ourselves when we are out shopping and when we want to wear clothes that are aesthetically pleasing. But as we have seen in this new modern era that minimalism is getting into every aspect of life and people are tending to move towards simpler versions, but aesthetically pleasing ones. As Coco Chanel has said: "before you leave the house look in the mirror and take one thing off" Fig. 9 . Overview of our Fashion ++ framework. We first obtain latent features from texture and shape encoders and . Our editing module ++ operates on the latent texture feature and shape feature s. After an edit, the shape generator first decodes the updated shape feature ++ back to a 2D segmentation mask m ++ , and then we use it to region-wise broadcast the updated texture feature t ++ into a 2D feature map u ++ . This feature map and the updated segmentation mask are passed to the texture generator to generate the final updated outfit ++ .

So minimal outfit edits in an already used outfit they can use to change the existing outfit and improve its fashionability. Whether it can be removing an accessory selecting a blouse with a higher neckline or you know, just tucking your shirt in or simply, you know, changing the pants to a darker color. So these all small adjustments are accountable for a more stylish outfit that is more aesthetically pleasing to a large group of people or to your own self as well.

Approach. So motivated by these observations which made the authors of this particular paper [17] go for the minimal edits for fashion outfit improvement. So minimally editing an outfit and getting an algorithm must impose alternations to the garments and accessories that are slight, yet visibly improve the overall fashionability. So basically what they're doing is that a minimal edit need not strictly minimize the out amount of change rather it incrementally adjust in an outfit as opposed to starting from scratch. So basically, it can be a recommendation regarding which garment you need to you know, replace or take off or you know to swap out or simply, you know, just wear the same garment in a better way.

And also it is well known that clothing fashion is sort of just intuitive and often a habitual trend in the style in which you know, an individual usually dresses but it is sort of not clear which visual stimulus places higher or lower significance or influence on the updation of clothing and fashion trends. So another paper [74] that we have seen in which they have employed machine learning techniques in order to analyze the influence that the visual stimuli of different clothing fashion are having on the fashion trends and specifically classification-based model was proposed by them that quantified the influence of different visual stimuli in which each stimuli influenced was quantified by, you know, it's a corresponding accuracy in fashion classification. So experimental results also, demonstrated that if they were quantifying style color and texture so out of those three on clothing fashion updates the style holds a higher influence than the color. And the color holds a higher influence than the texture. So all of these are very important in determining the Aesthetics as well.

So basically the main idea and approach for this model. Is that the activation maximization method. That works on localized encodings from a deep image generation Network. So what they basically do is that you give them an original outfit and they map it's composing pieces for example, you know, the bag, boots, jeans. blouse to their respective codes. And then what they do is that they use a discriminative fashionability model for the editing in which it gradually updates the encodings in the direction that maximizes the outfit score so when they do this, they are hence improving its style. And also the update trajectory offers various ranges of edits starting from you know, the least changed and going towards the item that is most fashionable from you know, which users can choose a preferred endpoint. The approach basically says that it provides its outputs in two formats:

(1) Retrieved garments from an inventory that would best achieve its recommendation.

(2) And the second one is rendering of the same person in the newly adjusted look generated from the edited outfits encoding

So basically, what they do is that they present an image generation framework, which is comprised of outfit images into their garment regions and factorizes shape/fit and texture in support of the later objectives. So the framework is basically about coordination of all composing pieces defines and outfits look. What they do is that they can control which parts like the pants or the skirts or you know shirts and then aspects like the length of their sleeve, color, the pattern and neckline to change and sort of, you know, keep the identity and fashion irrelevant factors unchanged. So what they want to do is they want to explicitly model their spatial locality and to perform minimal edits. So what they needed to do was to control the piece's textures as well as their shapes. So basically what textures comprise in outfits is for example, like in denim with solid patterns gives more casual look or like leather with red colors, give more street style look. So with the same material color and pattern of garment and how they are worn, you know, like tucked in or pulled out and skinny or baggy pants and you know, what sort of cut they have v-neck or turtleneck or you know boatneck. So the Garment will compliment a person's silhouette in different ways. So what they do is that they account for all of these factors and devise an image generation framework that gives control over individual pieces accessories body parts and also factorize the shapes from the texture.

For computing an edit the main steps are: calculating the desired edit, and generating the edited image. For calculation of an edit, they basically took an activation maximization approach where they iteratively alter the outfit's feature such that it increases the activation of the fashionable label according to . Formally, let z (0) := {t 0 , s 0 , . . . , t −1 , s −1 } be the set of all features in an outfit, and z (0) ⊆ z (0) be a subset of features corresponding to the target regions or aspects that are being edited ( . ., shirt region, shape of skirt, texture of pants). The updated outfit's representation is as follows:z

wherez ( ) denotes the features after updates, z ( ) denotes substituting only the target features in z (0) withz ( ) while keeping other features unchanged, = 1 | z ( ) denotes the probability of fashionability according to classifier , and denotes the update step size. Each gradient step yields an incremental adjustment to the input outfit.

This Approach makes slight yet noticeable improvements better than baseline methods in both quantitative evaluation and user studies and it effectively communicates to users through image generation and supports all possible edits from swapping, adding, removing garments to adjusting outfit presentations through qualitative examples.

Mark Twain has said that the "Finest Clothing made is a person skin", but of course society demands something more than this. Now, we know that fashion has a tremendous impact on our society and clothing is basically something that reflects the person's social status and thus puts pressure on how they are to dress to, you know, fit a particular occasion. For this the authors of this particular paper [53] analyze the fashion of clothing of a large social website in which their main aim is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements that they can make in order to improve their image and appeal.

Methods. Now the approach these authors have suggested is also somewhat related to recent approaches [8, 12, 23, 31] that were aimed at modeling the human perception of what beauty actually is. So in papers these authors basically address the questions of what makes a particular image memorable and interesting or you know popular to viewers. So this line of work usually contains mining of large image data sets in order to you know, find a relation of visual cues to popularity scores. But in this paper what they do is that they tackle the problem of predicting fashionability. So they are going a step further from the previous work by identifying High-level semantic properties that cause a particular aesthetic score which can be then conveyed to the user so that they can improve their outfit or their look. And this work is very much closest to [30] which was able to infer whether our faces are memorable or not and then upon that results modify it such that it becomes. Although this is quite different as their domain is different and it is also different in formulation.

Approach. So they are modeling the perception of fashionability. And for that what they have done is that they have proposed a conditional random field model that jointly reasons about several fashionability factors such as the type of outfit and garments that an individual is wearing and the type of user and the photograph setting for example, the scenery and fashionability score. And based on that they give the recommendation to user in which they convey which garments or scenery the individual should change in order to improve fashionability. This paper predicts how fashionable a person looks on a particular photograph. So the fashionability is then affected by the clothes the subject is wearing and also by a large number of other factors such as how appealing they are in a scene that is containing that person and how that image was taken and how appealing visually the person is ,their age and also the garment itself being fashionable is not a perfect indicator of someone's fashionability as people typically judge how well the garments aligned with someone's look, body, characteristic or even personality. So the model proposed exploit several domain inspired features which include beauty, age and mood inferred from the image. And the scene and the type of photograph and if available metadata in the form of where the user is from, how many online followers he/she has the and the sentiment of comments by other users. For this they have to create their own data set from different online sources. And if we see our daily lives we can see how much of an impact fashion has in it. So this also proves the growing interest in clothing related applications in Vision community. Early work [26, 52, 66, 67, 69] that was focused was mainly on clothing parsing in terms of diverse set of garments types.The paper's objective was basically to be able to predict fashionability of a given post, but they also wanted to build a model that can understand fashion at a higher level. So for that purpose what they did was they made a Conditional Random Field (CRF) to learn the different outfits , types of peoples and settings. Now here the word setting is basically something that describes the location where the picture is taken and both at a scenic and geographic level. They use their own fashion data set fash-ion144k Images and metadata to produce accurate predictions of how fashionable a certain person is.

More formally, let ∈ {1, · · · , } be a random variable capturing the type of user, ∈ {1, · · · , } the type of outfit, and ∈ {1, · · · , } the setting. Further, we denote ∈ {1, · · · , 10} as the fashionability of a post x. They represented the energy of the CRF as a sum of energies encoding unaries for each variable as well as non-parametric pairwise potentials which reflected the correlations between the different random variables. It is defined as:

Output. An exciting property of this specific model was that it could be used for outfit recommendation.What they basically did was they used to take a post as an input and estimated the outfit that maximizes the fashionability while the kept the other variables fixed. So basically what was happening was that they were predicting what the user should be wearing in order to increase their looks instead of their current outfit. And this can be just one example of the flexibility of the approach. They proposed other thoughts such as what would be the low fitting outfit and what would be the best place to go with the current outfit or you know, what type of users this outfit fits the most, this can be done with this same model.

One of the key aspects in fashion is personalization. So personalization is basically something that is intended for a certain individual based on their likes and dislikes and what they cater as good for them. And we know that fashion industry included e-commerce worldwide is supposed to hit the 35 billion dollar Mark by 2020 this year and there's a need for applications which can help the user in making Intelligent Decisions on their day-to-day purchases or a system that can recommend them a model or something that is personalized to their liking.

So for this purpose the use of deep neural networks for this challenge is needed and we are going to discuss one of a system that is dubbed as FashionNet [14] that consists of basically two components: a feature Network for the feature extraction function and a matching Network for the compatibility computation. The former one is achieved through a deep convolutional Network and the second one for that they adopt a multi-layered fully connected Network structure and design, and compare the three alternative architectures for FashionNet and to achieve personalized recommendations, what they do is that they develop a two stage training strategy, which uses the fine-tuning technique to sort of transfer a general compatibility model to the model that embeds personal preference.

Methods. Now we know that existing recommender systems are heavily dependent on the collaborative filtering techniques CF which basically uses historical ratings given to the item by users as the sole source of information for their learning expect and the performance is very much sensitive to the sparsity level of user item metrics. The recent progress of deep neural networks provides promising solution to the representation problem of image content [3, 36, 37, 57 ] .

This specific paper explores the Deep use of neural networks for outfit recommendation and specifically for the personalized outfit recommendation. Now for this they encounter two key problems. The first one was modeling of the compatibility among multiple fashion items and obviously the second one was capturing users personal interest.

So for that the former one was solved by first mapping the item images to a latent semantic space with convolutional neural network and for the second one they adopt a multi-layer fully-connected network structure. And they also studied alternative architectures that combine feature learning and compatibility modeling. Different ways for the other problem. What they do is that they encode user-specific information in terms of parameters of the network. Although we know that each user may have his own unique personal taste and they follow some general rules for making outfits. But besides that the usual small number of training samples for individual users makes it very much important to borrow training data from other users that share similar tastes. So with these observations in mind, what they do is that they adopt a two-stage strategy for the training of their model network; the first stage basically learns a general compatibility model from outfits of users. And in the later stage, what they do is that they fine-tune the general model with the specific data that they get from the user in fine-tuning. It is an important technique for training deep neural networks for applications that have limited number of training samples.

Approach. So in their approach they basically assume that heterogeneous fashion items can be grouped into n categories. Let's take an example where the three most social categories for fashion are usually shoes, tops and bottoms and outfit is a collection of fashion items which are Fig. 11 . Network architectures usually coming from different categories. So an outfit can consist of a bottom, top and a pair of shoes. So given some historical data what they did was that for any user outfit pair they pretty much assigned a rating score as the score kind of reflected the level of affection the user has for the outfit. So the higher the score then obviously the more appealing the outfit is for the users and those outfits that had the highest score were recommended to the users. So basically the rating system was used and the rating for a user outfit pair is determined by how well the items in the outfit go with each other. So if you know a pair of red shirts and you know, let's say black slacks or tight jeans and maybe they go well instead of, you know, something with a yellow skirt and red shirt. So we basically see the author's design appropriate deep neural network structure to model the interactions among these items and they achieve Personalization by developing a two-stage training strategy and embed the user specific preferences in the parameter of the network.

So what they basically do is that they explore three different network architectures and naming them as fashionet A ,B and C and without the loss of generality. They assume an outfit consists of three items: top, bottom and pair of shoes. So in fashionNet A the images of the items are first concatenated to create a new image with nine color channels, and the compounded images are then forwarded to a widely used CNN model VGGNet. The output layer is a fully connected layer with softmax function as its activation function. So in this architecture the components of representation learning and compatibility measure are fully integrated. The two steps are carried out simultaneously right from the first convolution layer. Now in fashionNet B we see that they apply representation learning and compatibility measures sequentially and the images are first of all mapped to a feature representation through a feature Network. So the same CNN model is used for items from different categories. To model the compatibility they concatenate the features of all items and feed them to three fully connected layers. So in this work what they show that this network structure also has the capacity for approximating the underlying compatibility among multiple features.

Now for fashionNet C , what they do is that both FashionNet A and B try to directly model the compatibility among multiple items. They sort of come across difficulties when trying to capture the High order relationships and the data is significantly expanded when we concatenate all the items. Due to the dimensionality issue a huge number of training samples may be required for a good model to be learned and we know that users on the internet have contributed so many outfit ideas. It is still minor compared to the number of all possible outfits. So in order to overcome this problem what the authors propose is that a prior restraint in fashionNet C. They assume that the compatibility of a set of items is mainly determined by how well a pair of these items go with each other. Then all the outfits from the final layers regarding the probabilities that the item pairs are matched while are added together to get a final score as for the whole outfit. The learning task is formulated as a learn to rank problem.A training sample contains two outfits, e.g. + , + , + and − , − , − , where the former is preferable to the latter. A two-tower structure to train the networks and rank loss is used to minimize this following equation.

In the training expect what happens is that for an individual user they usually have a small number of training outfits. And furthermore, although each user may have their own preference.

There are some rules that should be followed by most people for making an outfit. For example t-shirts and jeans are usually paired up. With these observations. What they do is that they design a two stage procedure to train the deep network for personalized outfit recommendation. So the first stage is basically that they learn a general model for compatibility. Here they discard the information of the user and mix the outfit created by different users all together. And then they create a new neutral outfit by mixing randomly selected fashion items. Now, this is reasonable in order to assume that items in a user created outfit are more compatible than those in neutral outfit. So for that ,training samples can be made by pairing a user-generated outfit with a neutral one. So they initialize the parameters in VGGNet that would be trained on imagenet and initialize the other layers with random numbers drawn from gaussian distribution.

Then furthermore these are optimized for the whole network using the mixed data set and in the second stage we see that the authors train using the specific model for personalized recommendations so we can say that for each user what they did was they first initialize the network with the certain parameters that were obtained by the previous general training and then they use each user's own personal data to fine grain or fine tune the parameters. We know that fine-tuning is very important in this aspect. It sort of helps the data insufficiency problem in a lot of different applications. So for fashionNet A they saw that they fine-tune the whole network in this stage and for fashionNet B and C. There were two strategies used. The first one was to fine-tune the whole network. So both the feature Network and the matching network will have personalized parameters. Now this one resulted in different feature representations of each item for different users. The second method was to freeze the feature Network and only fine-tune the matching Network. So the features will keep the same and the user-specific information will be carried only by the matching Network and this will save a lot of computation during testing and which is quite a favorable aspect in terms of practice.

In the end they found that the performance of FashionNet A was inferior to the other two architectures namely FashionNet B and C. When all the possible reasons for fashionNet B and C to obtain such an advantage was that the representation learning incompatibility modeling was performed in them separately so that they were able to use different network structures in order to achieve different functionalities. So these kinds of networks are easier to design and optimize in this case.

For personalization another approach is the generative adversarial training. So for that we go over another paper [70] in which they propose an approach in which a convolutional network is first used to map the query image into a latent Vector presentation. Now this latent representation all together with another Vector which characterizes users style preference as an input are taken into the generator Network in order to generate the target image item.

Although there are few works [18, 65] that have shown the personalized model is more capable of picking outfits that suit or a model to generate new items images for some category for a user that was personalized. But no query item was provided in their settings. They did not consider the compatibility between items.

Method. Now, discriminator networks are built to guide the generation process. One of them is the classic real fake discriminator. And the other is a matching Network which simultaneously models the compatibility between fashion items and also learns the preference representations.When the given inventory is limited. It's a possibility there. There are no good items enough to complement the query and when we have the inventory that is too large then generating the recommendation may face some efficiency problems. So this paper basically suggests that existing items can be synthesized images of new items that are compatible to a given one. So basically this solves the deficit problem for small inventories and for large inventory when targeting real items is necessary. We can adjust search items that are similar with the synthesized ones. Which is pretty much more efficient in terms than the exhaustive compatibility valuations, since similarity search can be very fast with techniques like hashing. Now aside from General compatibility they are also considering the personal issue. Personalization comes in here, which is an important trend as we have already discussed. Now given the same query item different persons would like to choose different items which goes with their own personal style. So while personalized recommendations have been prevalent in areas, like movies, songs and book recommendations, but for fashion, they are still not user-specific. So basically what this paper suggests is that the proposed system is personalized using the generative adversarial training framework GAN's. Generative adversity networks have pretty much achieved a great success in synthesizing realistic images for different applications. So they apply this technique and they first use an encoder Network to map the query image into a latent Vector representation. And then this representation together with another vector that characterizes user style preference is taken into the input as for the generator Network that generates the target item. So basically the approach goes like this: the task of personalized fashion design is basically to develop a fashion item for a specific individual given an input query item. So there are two general requirements for this design that they have: the first one is the realness requirement which practically means that the design item should look realistic. And then the second thing comes is the compatibility requirement that is basically that the design item should be compatible with the query item. The generator uses an encoder-decoder architecture. One of the discriminators is for real/fake supervision. And the other one is for compatibility prediction

Now we know that a lot of challenges in e-commerce usually come up from the fact that new products are continuously being added to the catalog. So the challenge invoked is properly personalizing the customers experience forecasting demand and planning the product range.

4.3.1 Background. The paper [75] in discussion is about a global e-commerce company that creates and curates clothing and beauty products for fashion lovers. So over the years they have a lot of products and this amounts to more than 1 million unique Styles. So for each product different divisions within the company produce and consume different product attributes, so mostly the attributes are manually curated and there could be cases in which information is sometimes missing or wrongly labeled. However, sometimes incomplete information still carries a lot of potential value for the business; the ability to have a systematic and quantitative characterization of a product is basically one of the key aspects for the company to make data-driven decisions that can be used across a set of problems including personalization. So the paper basically shows how to predict Fig. 13 . Schematic view of the multi-task attribute prediction network a consistent and complete set of product attributes that will illustrate how this enables them to personalize the customer experience by providing more relevant products.

Approach. So basically the model that they proposed attracts attribute values from product images and textual descriptions. In terms of image processing what they do is that fashion is predominantly a visual business and visual features are at the core of many data science products. They use image features for many of their applications. So in order to minimize the computational cost what they did was they implemented a centralized visual feature generation pipeline. That uses a pre-trained convolutional neural network to extract product representation from images. Now for the text processing what they did was that the CNN's were originally applied to images which are treated as matrices of pixel color values. And it's a possibility to apply these convolutions to other types of matrices as well and in particular paragraphs of text. So similarly, they process images to produce product representations they also used the same technique for text descriptions. In multi modal Fusion, they say that the image and the text representations simply are concatenated together within a neural network, which is trained to predict the product attributes. This is pretty much straight forward because it's a common way to fuse the different inputs. That works well in practice. Now the primary focus of the paper design was to find a solution that deals with missing labels at scale. Because in the paper, they also argue that the foundational piece to solve all of the problems is having consistent and detailed information about each product, which is rarely available. So they show this by having a quantitative understanding of the products. Can be used to improve recommendations in a Hybrid recommender system approach.They say that they could have chosen to build a separate model for each attribute, but then they would have to maintain multiple models in production. And in terms of independent models would also be oblivious to the correlations between attribute values and they would also only work well for common attributes, where there must be enough training data. Alternatively they said that they could have built a single model to predict all attributes at once also, but however few products are fully annotated and there would have not been enough data to train such a model. So because of these reasons what they did was they chose to cast attribute prediction as a multitask learning problem. This means training a neural network for each attribute but sharing most of the parameters between Networks.

Approach. The hybrid approach incorporates several state-of-the-art advances in recommender systems and not only incorporates new products, but also enhances the recommendations that customers receive overall. Their approach creates an embedding of products, i.e. a representation of all the products in their catalogue in a high-dimensional vector space. In this vector space, products with similar styles and attributes will be closer than unrelated ones. When producing personalised recommendations, the algorithm also assigns a vector to every customer. The items with the highest inner product with the customer vector are the recommended ones. The position of products and customers in this space is determined not only by the customer-product interactions, but also by the augmented product attributes. This ensures that newly added products are positioned correctly in the space and can be recommended to the right customers.

Another paper [5] proposes a personalized outfit generation POG model. Basically what happens in this model is that they connect the user preferences regarding individual items and then the outfits with transformer architecture. So the extensive offline and online experiments they did provided them with strong quantitative evidence that the method they proposed found alternative methods regarding port compatibility and personalization metrics. So basically what happens is that they can generate compatible and personalized outfits based on user recent behavior. So specifically for this they use a Transformer encoder decoder architecture that models both signals from user preference and outfit compatibility. And this is interestingly one of the first study to generate personalized outfits based on user historical Behavior within encoder decoder framework. They also developed a platform named IDA where POG. has been deployed in order to help out without regeneration and recommendation at a very large scale application Ifashion.

There are several methods for generating a fashion outfit that is likeable by the user and usually these methods fall into basically two types. So the first type is basically the one in which they focus on calculating a pairwise compatibility metric [43, 54, 60] . And the second type is in which they present modeling and outfit as a set or an ordered sequence. And then there are models [39] in which they classify a given outfit as popular or unpopular or train a bi-directional LSTM model [13] sequentially generate outfits. Now we can see that all these methods generally use a simple pooling of item vectors in order to represent an outfit and they have to rely on the order of the outfits item. So this is noted that these methods belonging to either category hardly considers all the interactions between the items in an outfit. And it is quite unreasonable to consider an outfit as an ordered sequence because you know shuffling of items in the outfit itself should make no difference on its compatibility. 

Approach. So what they are trying to say is that they want to explicitly incorporate this into their modeling architecture by which they require that each item should have a different interaction weight with respect to other item in one outfit and they have given example like a red shirt should have a higher interaction with you know, blue jeans or black jeans, but a smaller weight with a pair of white gloves.

So the model they propose in this is basically what they do, is that they build a three-step process in which the first step has the items that are to be embedded and in the second they build FOM which learns compatibilities of items within an outfit and lastly the third stage once their training is completed. They use the result to pretrained FOM to initialize POG Transformer architecture. Representing these items using a multi model embedding model. So for every fashion item f they compute a non linear feature embedding f . The concept of fashion basically relies on Visual and textual information So basically in previous models [13, 39] what they did was the authors used the image and text to learn the multimodal embeddings. But in this scenario, what they do is that they use a multi-modal embedding model that takes the following input for every item (1) Dense vector encoding the white background picture of the item from a CNN model, (2) Dense vector encoding the title of the item obtained from a TextCNN network, which has been pre-trained to predict an item's leaf category based on its title (3) D ense vector encoding a collaborative filtering signal for the item using Alibaba's proprietary Behemoth Graph embedding platform. So this platform is used for generating item embeddings based on the co-occurrence statistics of items in recorded user click sessions in the taobao application. So the generation model works like this, it generates personalized and compatible outfit by introducing user preference signals. Taking the advantage of encoder-decoder structure, it translates an user's historical behaviors to a personalized outfit. Let U denote the set of all users and F be the set of all outfits. They have used a sequence of user behaviors = { 1 , . . . , , . . . , } to characterize an user, where are the clicked items by the user. = { 1 , . . . , , . . . , } is the clicked outfit from the same user, where are the items in the outfit. At each time step, it predicts the next outfit item given previous outfit items and user's click sequence on items . Thus for pair ( , ) the objective function of POG can be written as:

where Θ ( , ) denotes the model parameters. Pr(·) is the probability of seeing +1 conditioned on both previous outfit items and user clicked items.

In POG the encoder basically what it does is that it takes the user clicked input items and then it gives a special token like [start] . And then the decoder generates an outfit one item at a time. So at each step what happens is that the model is basically autoregressively consuming the previously generated items as input.The generation basically stops when a special token [end] appears. So basically what happens is that there in the end an outfit is given that is generated by composing the output items. So in the figure, you can also see that the encoder is termed as PER Network and then the decoder is as Gen Network. So the PER's natural is basically that it provides a user preference in terms of signal and then the Gen Network what it does is that it generates outfits based on both personalization signal and compatibility signal. So basically the general network is initialized using the aforementioned pre trained FOM.

Social media has been a great source for fashion recommendation and fashion promotion. It provides us with an open and new data source for personalized fashion analysis. 4.5.1 Background. So this paper [73] basically studies the problem of personalized fashion recommendation by gathering the data from different social media. That is they recommend new outfits to social media users that fit their fashion preferences. They present an item to set metric learning framework that basically learns to compute similarity that exists between a set of historical fashion items of a user to a new fashion item. For extracting features from a multi model street view fashion item the author basically proposes an embedding module that performs multi-modality feature extraction and cross Modality gated fusion. By studying the problem of personalized fashion recommendation with social media data that they are seeking to recommend new fashion outfits based on the activities that are being carried by the social network users.

A lot of different studies [18, 19, 24, 25, 32] are done for clothing retrieval and recommendation. But leveraging the user's interaction on social media for data for fashion recommendation is very much still challenging and is quite less explored. And usually what we can gather from social media is online activities like a street view selfie with additional word description. So this gives that the granularity of such data is much coarser than you know, that is unexplored. And most models [39, 58] are not directly applicable to the task due to their lack of supervision. 4.5.3 Proposed Approach. So paper basically proposes a self supervisor approach for effective and personalized fashion recommendation in which they divide into two categories the pictures in which the selfie posts of users a set that reveals their personal fashion preferences or outfit items that are to be recommended items. So they proposed that to learn an item to set metric that measures similarities between a set and items for personalized recommendation. They minimize the item to set distance for the set and items of a user and while making sure they maximize such distances for certain items of different users. And benefiting from this framework they are able to perform personalized recommendations without requiring any sort of additional supervision. Now we know that metric learning is well studied in literature and learning such an item to set metric is previously unexplored. And therefore pose new challenges because we know that the user can have interest in more than one fashion style and not the one that is being depicted in their picture. So the item to set similarity cannot be captured by an over simplified average of multiple items by similarities. Which therefore states that the nearest neighbor item to set metric is difficult to learn as it is susceptible to noise and outliers.

So in highlight what their contribution is that they present a fashion recommendation system built on personal social media data and their system recommends personalized outfits for using few constraint street view selfie post of the users. They also proposed a self supervise scheme in which they enable the training of the system. The approach is based on a novel item to set a metric learning framework that basically needs only the user selfie pose as the supervision. For this they design a multi model embedding module that better fuses the social media data for obstruction of fashion features.

Built upon the item-wise measurement , , they propose an item-to-set similarity metric ( , ), which measures how dissimilar an item is to a set of items = { 1 , · · · , } . The itemtoset metric aims to predict how similar a outfit candidate is to a set of user selfies for personalized fashion recommendation.

To design a metric that better captures the multiple interests of a user while facilitating robust training, the paper proposes a generalized item-to-set distance. Specifically, given a set and a query , they first assign an importance weight to each item ∈ before feature averaging and distance computation. The importance weight is computed using an importance estimator = ( ; , ) Such a item-to-set distance is defined by:

To reduce the influences of noise and outliers when computing the distance, basically what they did was that they further considered an intra-set importance weight:

where MLP outputs a scalar from an input vector, and stat( ) is a vector that captures the statistics of the set along all feature dimensionalities 2 . In this way, we compare each item with the set to eliminate the outliers from the sets. Now as we know that there are different individuals that focus on different particular aspects of fashion items and the item to set metric itself should be user specific. So for that issue what they did was that for the minimalist fashion style users the items that distance was made more sensitive to the amount of colors that are used but for users of the artsy style the item to set distance should focus more on unusual prints and the complexity of accessories. So they extended the similarity metric equation to a user specific metric in which they performed a user specific space transformation before the distance computation. In particular, given the set , we compute a scaling vector ( ) which indicates the scaling factor at each feature dimension:

Using the space transformation, they extended the item-to-set metric to a set-specific metric. Specifically, they defined a user-specific item-to-set metric:

where ⊙ represents vector elementwise multiplication. It filters out the feature dimensions that a user focuses less on before the distance computation. This procedure helps the recommendation system to be more user-specific.

In the post coronavirus era one of the industries that is obviously undoubtedly incorporating advanced technologies at much faster speed than ever before is fashion. And thanks to AI and computer vision power tools, new and engaging experiences are being born for both retailers and consumers. The e-commerce customer experience is completely incorporated with AI Solutions like online site navigation, search, retrieval ,target marketing, labeling, personalized offers ,size fitting, recommendations and online fitting rooms and also style recommendation analytics and much more. So by using computer vision and AI the image pixels are automatically taken and then they generate semantic data from them, which is very crucial for the e-commerce stores.

One of the things that is the basic thing is the discovery of the products that the visual search should be easy enough for the Shoppers to find what they are looking for and should also be benefiting the retailers as well so that they can take the advantage of users behavior and then show them the recommendations and can get more profit from this aspect as the stores are getting more online this post covid era. So the AI technology enables fashion brands to sort of gain insight as to which product features their customers would like to prefer. Now an interesting aspect [6] is that we can see that the fashion industry is at over 3 trillion dollars that contributes to the healthy portion of the global GDP and in the 21st century, we can see that AI or machine learning or specifically deep learning in the fashion industry is changing every expectation of this forward-looking business. So the use of AI let alone in the fashion industry of 2020 has so entrenched that 44 percent of the fashion retailers that are not using AI today are facing bankruptcy. So you can take this as an example and as a result of this Global spending on AI Technologies by fashion and retail industry is expected to reach 7.3 billion each year by the year 2022 and that's just in two years.

AI powered fashion designing can be based to get the preferred customer color textures and other style preferences and then they can be used ahead in order to design the apparel, the textile itself. Regarding The factoring process, what they can do is that they can use AI tools to identify the Super fast changing trends and supply the latest fashion accessories to the retail shelves, which is pretty much faster than the traditional retailing system. And a lot of leading fashion brands like Zara, Topshop and Achieve and are already using this and they are pretty much quicker in providing instant gratification to retail customers by recognizing seasonal demand and Manufacturing the right supply of the latest clothing and obviously virtual merchandising is something that has enabled technologies like augmented reality and virtual reality and now are closing the gap that is between online and in-store shopping. So this is also really popular regarding this system. And this is something that can be worked in the recommendation systems. As a lot of people would like to experience the virtual reality and augmented reality aspect in terms of the clothes fitting and checking out the online buying experience and making it more human-like.

As the advancements in deep learning, CV and AI are getting stronger day by day their usage in the fashion industry has also become a very popular topic. From product personalization or better designing there are multiple ways in which AI and machine learning Technologies are impacting the global fashion industry and they are increasing the investment by Leading fashion brands in these Technologies are a proof of their immense potential. They provide enhanced customer service, Virtual merchandising, smart manufacturing process and improved inventory management and need less Manpower through Automation and provide reduction in returned products which also improves customer satisfaction. And one of the biggest things is personalization, which is pretty much the key of business success and thanks to deep learning Technologies like AI and ML along with business analytics is enabling fashion business to keep track of fashion trends and purchasing behavior of individual customers. So now it may be a trend or it may be a season prediction. You can do anything with these powerful tools and the fashion industry is magnified. And this is a field that has the potential to grow and ever expand, so any future research in this line that will be done would be something that paves way ahead for more jaw dropping phenomenon.

Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions

High-Level Clothes Description Based on Colour-Texture and Structural Features

Return of the Devil in the Details: Delving Deep into Convolutional Nets

POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion. 2662-2670

Countants. 2020. AI and Machine Learning For Fashion Industry -Global Trends and Benefits

ImageNet: a Large-Scale Hierarchical Image Database

High level describable attributes for predicting aesthetics and interestingness

Designing Apparel with Neural Style Transfer

Fashion and Technology: How Deep Learning Can Create an Added Value in Retail

Generative Adversarial Nets

The Interestingness of Images

Learning Fashion Compatibility with Bidirectional LSTMs

FashionNet: Personalized Outfit Recommendation with Deep Neural Network

Explainable Fashion Recommendation: A Semantic Attribute Region Guided Approach

Creating Capsule Wardrobes from Fashion Images

Fashion++: Minimal Edits for Outfit Improvement

Collaborative Fashion Recommendation: A Functional Tensor Factorization Approach

Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network

Ambrish Tyagi, and Amit Agrawal

AI and Machine Learning for Fashion

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

What Makes a Photograph Memorable? IEEE transactions on pattern analysis and machine intelligence

Fashion Coordinates Recommender System Using Photographs from Fashion Magazines

Large Scale Visual Recommendations From Street Fashion Images

Parsing Clothes in Unrestricted Images

Learning to Appreciate the Aesthetic Effects of Clothing

Getting the Look: Clothing Recognition and Segmentation for Automatic Product Suggestions in Everyday Photos

Visually-Aware Fashion Recommendation and Design with Generative Image Models

Modifying the Memorability of Face Photographs

What makes an image popular?

Where to Buy It: Matching Street Clothing Photos in Online Shops

Tensor Decompositions and Applications

Advances in Collaborative Filtering. 77-118

ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems

Gradient-Based Learning Applied to Document Recognition

Content-based multimedia information retrieval: State of the art and challenges

Mining Fashion Outfit Composition Using An End-to-End Deep Learning Approach on Set Data

Microsoft COCO: Common Objects in Context

Hi, magic closet, tell me what to wear! 1333-1334

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Image-Based Recommendations on Styles and Substitutes

Content-Based Filtering Enhanced by Human Visual Attention Applied to Clothing Recommendation. 644-651

Content-Boosted Collaborative Filtering for Improved Recommendations

Learning to Rank for Personalised Fashion Recommender Systems via Implicit Feedback

The Concept of the Aesthetic

WordNet::Similarity -Measuring the Relatedness of Concepts

BPR: Bayesian Personalized Ranking from Implicit Feedback

Pairwise Interaction Tensor Factorization for Personalized Tag Recommendation

Tensor Decomposition for Signal Processing and Machine Learning

A High Performance CRF Model for Clothes Parsing

Neuroaesthetics in fashion: Modeling the perception of fashionability

Neural Compatibility Modeling with Attentive Knowledge Distillation

NeuroStylist: Neural Compatibility Modeling for Clothing Matching

The Aesthetics of Fashion Part 2

Going deeper with convolutions

Recommending Outfits from Personal Closet

Learning Type-Aware Embeddings for Fashion Compatibility

Learning Visual Clothing Style with Heterogeneous Dyadic Co-Occurrences

Runway to Realway: Visual Analysis of Fashion

Structural Deep Network Embedding

Interpretable Aesthetic Features for Affective Image Classification

A Hierarchical Attention Model for Social Contextual Image Recommendation

Visually Explainable Recommendation

Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items

Parsing clothing in fashion photographs

Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set

Clothing Co-Parsing by Joint Image Segmentation and Labeling

Personalized Fashion Design

Aesthetic-based Clothing Recommendation

Jiaxun Tang, and Zhijun Fang. 2017. Fashion Evaluation Method for Clothing Recommendation Based on Weak Appearance Feature

Personalized Fashion Recommendation from Personal Social Media Data: An Item-to-Set Metric Learning Approach

Who Leads the Clothing Fashion: Style, Color, or Texture? A Computational Study

Product Characterisation towards Personalisation: Learning Attributes from Unstructured Data to Recommend Fashion Products