Teaching CLIP Some Fashion. Training FashionCLIP | by Federico Bianchi | Medium

By Jessie Hobb On Mar 8, 2023

Training FashionCLIP, a domain-specific CLIP model for Fashion

This is a short blog post describing FashionCLIP. If you are a data scientist you probably have to deal with both images and text. However, your data will be very specific to your domain, and standard models might not work well. This post explains how domain-specific vision and language models can be used in a domain-specific setting and why using them can be a promising way to create a search engine or a (zero-shot) classifier.

FashionCLIP is a new vision and language model for the fashion industry and supports practitioners in solving two tasks:

Categorization: zero-shot classification of product images;
Search: efficient retrieval of products given a query.

While FashionCLIP is the result of many people working hard, this blog post is mainly my summary and my personal view of the amazing experience I had while building this, and does not necessarily represent the view of all other authors and their organizations.

Models

We currently release the model in two different formats:

Fashion is one of those industries that can benefit the most from AI products. Indeed, due to the nature of the domain, the existence of different catalogs, and client-specific datasets it is often difficult to build solutions that can be applied seamlessly to different problems.

Imagine two data scientists at a major fashion company: Mary, and Luis. The two have to deal with an ever-changing system, and its operations require constant care:

Mary is building a product classifier to help with categorization at scale: her model takes a product and selects one among a list of categories (shoes, dress, etc.);
Luis is working on product matching to improve the search experience: his model takes a query in one of the supported languages (e.g., “a red dress”), and gives back a list of products matching the query.

As every practitioner knows, any new model in production brings to life a complex life cycle and somehow brittle dependencies:

Mary’s model needs to be constantly re-trained as inventory grows and categories shift;
Luis’ model depends on the quality of product meta-data.

Same company, different use-cases, different models.

What if there was another way?

Today we try to take a step forward, showing how we can build a general model for Fashion data. We describe FashionCLIP a fine-tuned version of the famous CLIP model, tailored to treat Fashion data. Our recent paper on FashionCLIP has been published in Nature Scientific Reports.

Chia, P.J., Attanasio, G., Bianchi, F. et al. Contrastive language and vision learning of general fashion concepts. Sci Rep 12, 18958 (2022). https://doi.org/10.1038/s41598-022-23052-9

FashionCLIP came to life through a collaboration with Farfetch, a giant (and real) luxury e-commerce traded on the NYSE. FashionCLIP is a joint work with people from both industry (Coveo, Farfetch) and academia (Stanford, Bocconi, Bicocca). Model weights are available online in HuggingFace format. An example of usage can be found on Patrick’s repo.

We will first go over the use case and explain some more in-depth details of the model. Finally, we will share the code we have been using to train the model and how to access the weights.

FashionCLIP is a general model to embed images of fashion products and their description in the same vector space: each image and each product will be represented by a single dense vector.

Why are we putting them in the same vector space? So that they can be compared. This principle is the key to the success of a model like CLIP.

FashionCLIP is derived from the original CLIP. The idea is pretty straightforward. If you take:

A ton of images with captions;
An image encoder (this could be a CNN or ViT);
A text encoder (this could be a transformers-based language model).

You can train a model (with a contrastive loss) to put the embedding of an image close to its caption embedding and far from irrelevant captions. In the GIF you show an example in 2 dimensions. The concept generalizes to N dimensions.

FashionCLIP embeds descriptions and images in the same vector space. This is useful for zero-shot classification and image retrieval. *Image by the author using Farfetch catalog.*

The end result is a multi-modal space, allowing you to move between visual and textual interactions using novel images and novel text descriptions: if you have some text, you can retrieve corresponding images (as in product search); if you have some images, you can rank captions based on semantic similarity (as in classification).

To fine-tune CLIP, you need a good dataset. We jointly worked with Farfetch to train CLIP with high-quality images and captions. The dataset (soon to be openly released) comprises more than 800K samples.

We train the model for a couple of epochs and check the performance on several benchmarks encompassing zero-shot classification, probing, and retrieval. Before seeing the results, let’s take a deeper look at what we can do now that we have a trained FashionCLIP.

We will not delve deeper into CLIP itself. If you want to know more about CLIP, I have a dedicated blog post here:

The two key tasks that FashionCLIP can tackle are:

Image Retrieval
Zero-shot Classification

Retrieval: From Text to Image

We first move from text to image: we encode a search query (“A red dress”) with FashionCLIP text encoder and retrieve the closest image vectors through a simple dot product. The greater the value of the dot product, the more similar the text and the image are. In the GIF below, the search is done on 4 product images as an example.

*For retrieval, we can pre-compute image embeddings on the target catalog. At runtime, we encode the query and rank images through a simple dot product. Image by the author using Farfetch catalog.*

While “red dress” is a simple query for which the search engine may not need additional input, things get quickly get interesting with slightly more ambiguous queries, such as “light red dress” vs “dark red dress”, in which “light” and “dark” are modifiers of the same color:

*FashionCLIP helps disambiguate geometric features. Image by the author using Farfetch catalog.*

Even more interesting is FashionCLIP’s ability to capture items represented within clothes. Product descriptions often fail to explicitly mention figurative patterns, FashionCLIP instead is able to recognize printed items, even in a cartoonish-like shape, like the cat hanging on a bag in the t-shirt below:

*FashionCLIP recognizes figurative items printed on t-shirts. Image by the author using Farfetch catalog.*

While we have not evaluated this capability in detail, we believe this might come from the “knowledge” possessed by the original CLIP, which is partially kept during fine-tuning.

Of course, information is better encoded in descriptions (e.g., brands are often mentioned in the description) than in any semantic nuances FashionCLIP may capture. However, its capabilities in augmenting standard learn-to-rank signals without behavioral data may greatly improve the search experience, especially for cold-start scenarios.

Classification: From Image to Text

We now go from image to text for classification: we encode the image of a fashion item we want to classify with FashionCLIP’s image encoder and retrieve the closest label vectors through a dot product:

*For zero-shot classification, we compute the image embeddings of the query item and the text embedding of the target labels. Image by the author using Farfetch catalog.*

The trick of CLIP-like models is treating labels not as categorical variables, but as semantically meaningful labels.

In other words, when “classifying”, we are asking the question “which of these texts is the best caption for this image?”.

Thanks to CLIP pre-training and the infinite possibilities of natural language, we now have a classifier that is not confined to any specific set of labels, categories, or attributes: while, of course, the first application could be using this classifier on new products in the Farfetch catalog, we can re-use the same model on other datasets with different labels or purposes, e.g.:

if a supplier doesn’t categorize shoes as “high-heel shoes” vs “flat shoes”, we can add that attribute;
If merchandisers are creating new views on the catalog — for example, matching items to styles — we can classify existing products according to new dimensions (“elegant”, “streetwear”, etc.).

The generalization abilities of CLIP come of course at the expense of some precision: that is, if we train a new classifier in a supervised fashion to solve the use cases above, they all will be a bit better than FashionCLIP. As usual, there is no one-size fits all with real-world ML, and the trade-off between one model or many can be assessed in different ways depending on the importance of the use case, training time, labeling costs, etc.

Performance

We compare FashionCLIP to CLIP on two different tasks on various datasets. More details about the setup are found in the paper, the scope of this section is just to show that there is a boost in performance when using FashionCLIP in place of CLIP for fashion-related tasks.

For Zero-Shot Classification we use three different datasets (KAGL, DEEP, and FMNIST) that should serve as out-of-distribution datasets (we know from other experiments that we work much better than CLIP on in-domain data, but this is expected).

Weighted Macro F1 score on different datasets (out-of-domain data). FashionCLIP shows a significant improvement over CLIP on these datasets.

Zero-shot results confirm that our model works as expected!

For Image Retrieval, we use a portion of the original dataset that we left out during training. Note that this obviously gives us an advantage with respect to CLIP as this dataset is going to be in-domain for us. However, it is still an interesting experiment. The following results confirm that our model is best:

Precision at 5 and at 10 on our internal test set (in-domain data). FashionCLIP has a much better retrieval performance.

Torch Implementation and HuggingFace weights

Thanks to Patrick’s work, FashionCLIP is very easy to use. You can simply load the model and run zero-shot classification with a simple method, all with python!

fclip = [...load FCLIP ...]test_captions = [
"nike sneakers", "adidas sneakers", "nike blue sneakers", 
"converse", "nike", "library", "the flag of italy",
"pizza", "a gucci dress"
]
test_img_path = 'images/16790484.jpg'
fclip.zero_shot_classification([test_img_path], test_captions)

And you can also do image retrieval!

candidates = fclip.retrieval(['shoes'])
print(candidates)

The Conclusion of a Long Journey

Building FashionCLIP has been a long and fun adventure with old and new friends from some of the coolest places on earth. The results always taste better when you get them with your friends. Also, some of us have been working for years together and actually never met in real life!

On a more pragmatic note, we hope that FashionCLIP can open up unprecedented opportunities for companies quickly iterating in internal and external fashion use cases: for example, while you may end up building a devoted style classifier, using FashionCLIP for your proof of concept will go a long way in proving the value of the feature without investing upfront in a new model life-cycle support.

When we consider the growing number of SaaS players for intelligent APIs in retail — Coveo, Algolia, Bloomreach — the importance of vertical models cannot be underestimated: since B2B companies grow with accounts, robustness, and re-usability matter more than pure precision. We envision a near future in which FashionCLIP — and DIYCLIP, ElectronicsCLIP, etc. — will be a standard component of B2B Machine Learning players, enabling quick iteration, data standardization, and economies of scale on a completely different level than what has been possible so far.

I also gave a talk last year at Pinecone about FashionCLIP:

The talk I gave at Pinecone about building models like FashionCLIP.

An Additional Demo

What’s the power of Open Source? Pablo saw the model and reached out with a UI to help us test the difference between the standard HuggingFace CLIP vs the FashionCLIP we just released — I then used Kailua to test the search using FashionCLIP with a couple of queries:

Using FashionCLIP for search. GIF by author, images from the H&M dataset. The demo is available here.

Cool, isn’t it?

Limitations, Bias, and Fairness

We acknowledge certain limitations of FashionCLIP and expect that it inherits certain limitations and biases present in the original CLIP model. We do not expect our fine-tuning to significantly augment these limitations: we acknowledge that the fashion data we use makes explicit assumptions about the notion of gender as in “blue shoes for a woman” that inevitably associate aspects of clothing with specific people.

Our investigations also suggest that the data used introduces certain limitations in FashionCLIP. From the textual modality, given that most captions derived from the Farfetch dataset are long, we observe that FashionCLIP may be more performant in longer queries than shorter ones.

From the image modality, FashionCLIP is also biased towards standard product images (centered, white background). This means that the model might underperform on images that do not have the same structure.

FashionCLIP has been a long journey, but there are a couple of things we did while we waited for the official release.

GradedRecs

We built on top of our work in FashionCLIP to explore recommendations by traversing the latent space. Check out our paper if you’re interested!

Fairness in Recommender System Evaluation

If you are interested in related industry tasks, such as recommendations, we ran a challenge last year on a well-rounded evaluation of recommender systems.

The challenge was meant at understanding how we can build evaluations that are not focused only on point-wise metrics (e.g., accuracy). You can find some details and an introductory blog post here