Build A Text Recommendation System with Python

Kshitij Verma
11 min readJan 27, 2022

Natural Language Processing is one of the most exciting fields of Machine Learning. It enables our computer to understand very dense corpus, analyze them, and provide us the information we are looking for.

In this article, we’ll create a recommendation system that acts like a vertical search engine [3]. It enables to search for documents within a very constrained number of documents that are highly tight with one topic. To leverage the power of NLP, we’ll combine search methodology with semantic similarity.

This article includes the dataset, theory, code, and results for different NLP models.

Why should we use machine learning?

  1. Machine Learning and Deep Learning are good at providing representation of textual data that captures word and document semantics, allowing a machine to say which words and documents are semantically similar. The use of deep learning enables more relevant results to its end users, increasing user satisfaction and the efficacy of the product. In this article, we’ll work on a solution that is tailored beautifully to your data and can, later on, compare your data with the user query to provide well-ranked results.
  2. For most common programming languages, like Python, many open source libraries provide tools to create and train very easily complex machine learning models on your own data. In this article, we’ll see how quick it is to build and train models on your own machine.

Use case

Imagine you want to watch a movie, but you’ve already watched all the ones on your bucket list. Today you are feeling like watching a movie where a beautiful woman is involved in a crime. The recommendation system we’ll build will match your ideal movie description with a database of movie descriptions and suggest the top three movies that match your description.

This is very simple, to build this pipeline you’ll need:

  1. a dataset that contains the collection of text items you want to recommend.
  2. a sentence cleaner algorithm
  3. a matching algorithm.

Preparing the data

The dataset as a lookup table

For this toy experiment, we use the movie dataset https://www.kaggle.com/rounakbanik/the-movies-dataset.

It contains the metadata of 44, 512 movies released before July 2017. Fortunately, we won’t watch all of them because we will take the one that matches our precise desires.

In our experiment, we’ll use only the table ``movies_metadata.csv`` which contains attributes such as budget, genre, webpage link, original title, description overview, release date, spoken language, IMDb vote average, ….

Our recommendation system will use the movie description overview sentence and apply a machine learning model to represent each sentence as a numerical feature vector. After applying an ML model to each sentence, we can concatenate these feature vectors to create an embed matrix that represents our whole dataset. This matrix is very important, it will be our lookup table for every query we make to the system.

Image by author

Let’s say we have 100 movie description sentences and our embed vector size is 300, then our embed matrix size will be (100 x 300). When the user inputs a sentence, we embed its query sentence into a 300-dim vector with the same model and we compute the cosine distance between each of the 100 rows and the embed query vector.

Loading the dataset

Download the .csv file and load it as a data frame in your python script or notebook.

Clean sentence

We’ll do basic pre-processing. We want to give a uniform text to the model, so we clean the sentences before embedding them into vectors. It would help the model to focus on the content rather than the formatting to find the relevant patterns in the data.

Let’s review our cleaning methodology below.

  • remove non-alphanumeric character/ punctuation,
  • remove the too long and too short sentences,
  • remove stopwords,
  • lemmatize,
  • tokenize

We use regex pattern matching to remove all non-alphabetical-numeric characters from the movie descriptions.

We create our own tokenizer. The tokenizer transforms a string into a list of strings where each element is one word.

word_tokenizer(‘the beautiful tree has lost its leaves’)
>>> [‘the’, ‘beautiful’, ‘tree’, ‘has’, ‘lost’, ‘its’, ‘leaves’]

We use a lemmatizer that converts a word into a generic form:

from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
[stemmer.lemmatize(w) for w in word_tokenize(sentence)]
>>> ['the', 'beautiful', 'tree', 'ha', 'lost', 'his', 'leaf']

We remove too short and too long sentences from our dataset, by filtering with MIN_WORDS and MAX_WORDS. We remove common words (STOPWORDS) which won’t help to extract the specificity of a given sentence.

Result of cleaning

We converted the column “sentence” into two columns:

  • clean_sentence where the sentence is mainly alphanumerical text.
  • tok_lem_sentence where the data is lemmatized and tokenized.

Loading, Training, Prediction

Now our dataset is ready to be processed and we can compare different models that offer different compromises:

  • the baseline/ basic TF-IDF
  • Word2Vec, a simple feed-forward network
  • spaCy, very flexible and powerful NLP library
  • transformers, the state-of-art Deep Learning models.

I introduce below a function that I will use to rank the best recommendation given the cosine distance between vectors.

You can notice in l.11 that we use an average on the distance matrix. The shape of the distance matrix is (number of words in query sentence, number of sentence in our vocabulary). We compute the distance from each word of the query to each sentence of our database and take the average on the whole query. Then we can take the three-sentence with the lowest distance to the query.

Baseline model, TF-IDF

What is TF-IDF?

A text corpus often contains words that appear often and don’t contain any useful discriminatory information. Tf-idf is designed to down weight these frequently occurring words in the feature vectors. It can be defined as the product of the term frequency (frequency of one word in a given document) and the inverse document frequency (occurrence of this word among all the documents) [1]. Term frequency / Document frequency measures the relevance of a term in a given document. In our case, one document is one sentence.

Applying TF-IDF

Once we fit our data on the Tf-idf model, we can generate an embedding 22,180 dimensions vector for each movie description. These features are stored in a feature matrix tfidf_mat where each row is a movie description record embedded into a feature vector.

When we get a query from user input, we’ll embed it into the same vector space and we’ll compare one by one the query sentence feature embed_query to the sentence vectors of embed matrix tfidf_mat.

The function that finds the best index from the distance matrix takes the average of the cosine distance for each embedded sentence and ranks the results. Don’t pay attention yet to the masking argument, we will use it for another model.

This model is very simple to use and can be set up with only a handful of lines of code, the training is very fast as well.

On the query sentence: ‘a crime story with a beautiful woman’,

Top-3 results

- A beautiful vampire turns a crime lord into a creature of the night. Innocent Blood

- A vampire lures beautiful young women to his castle in Europe. Requiem pour un vampire

- The story of a young woman clinging on to her dream to become a beauty contest queen in a Mexico dominated by organized crime. Miss Bala.

A model for your data, Word2Vec

Word2Vec is a powerful and efficient algorithm that can capture the semantics of the word in your corpus. Word2vec takes during training a corpus containing many documents/sentences and outputs a series of vectors one for each word in the text. a word vector contains the semantic meaning of this word, meaning that two that are close in the vector space share a similar meaning. King and queen are close like cake and coffee are. Word2Vec can be easily used to find synonyms for example. During the training, each word vector is weighted by the neural network according to the probability of occurrence inferred from the training dataset. How this probability is computed depends on the architecture you chose (Continuous Bag Of Words or skip-gram). In the end, the word2vec model is in fact a very simple 2 layers neural network, but we won’t care about the output, we’ll extract the hidden state where the information is encoded [3]. The advantage of word2vec is that it takes a high dimensional sparse word representation (like tf-idf or hot encoded vectors) and maps it into a dense representation, hopefully in a smaller dimension.

Applying word2vec

We use the gensim library to train a word2vec model on our corpus data. We first need to provide the vocabulary (the list of the words we want to vectorize) and then train the model on few epoch. By default, the model trains a CBOW.

The main downside of Word2Vec is that it can’t produce vectors for words that were not originally in the vocabulary used training. This is why we need to provide a utils function is_word_in_model that removes the words from the query sentence that are unseen in the training set.

We use their library function n_similarity to compute efficiently the distance between the query and dataset sentences.

Top-3 results

- A tragic love story set in contemporary Shanghai. The film stars Zhou Xun in a dual role as two different women and Jia Hongsheng as a man obsessed with finding a woman from his past. 苏州河

- A glowing orb terrorizes a young girl with a collection of stories of dark fantasy, eroticism and horror. Heavy Metal.

- Fritz Lang’s psycho thriller tells the story of a woman who marries a stranger with a deadly hobby and through their love he attempts to fight off his obsessive-compulsive actions. Secret Beyond the Door

Good compromise model, spaCy

spaCy is a python open-source library contains many pre-computed models for a variety of languages (see the list of 64+ language here). Because we need to load the vector associated with each word, we use a trained pipeline for which we have access to the feature vectors, the en_core_web_lg. They called their NN Tok2Vec and we can use pre-trained weights that contain 685k keys as 685k unique vectors of dimension 300 trained on webpages corpus. SpaCy has their own Deep Learning library and models. You can read more about their default CNN+HashEmbedding model here. Their pipeline includes tokenizer, lemmatizer, and word-level vectorization so we only need to provide the sentences as strings.

We use a mask l.12 in case the word did not have a word in the pretrained corpus.

Top-3 results:

Even if spaCy is expected to be more powerful than TF-IDF, for this sentence, it gives the same results as TF-IDF.

The most powerful, Transformers

If we want a robust and accurate method, we can use Deep Learning. Fortunately, many libraries have released their own pretrained weights and we don’t need to spend months to train a very deep model on the whole internet.

We’ll use BERT architecture. The cost is higher than previous models but the understanding of the context is more acute too.

Sentence_transformers library

SentenceTransformers’ team developed their own high-level pipeline to facilitate the use of transformers in Python. You can explore their list of models and recommendations here. I personally chose paraphrase-MiniLM-L6-v2 that satisfies “a quick model with high quality”.

Top-3 results

- The story of a young woman clinging on to her dream to become a beauty contest queen in a Mexico dominated by organized crime. Miss Bala.

- Following the gruesome murder of a young woman in her neighborhood, a self-determined woman living in New York City — as if to test the limits of her own safety — propels herself into an impossibly risky sexual liaison. Soon she grows increasingly wary about the motives of every man with whom she has contact — and about her own. In the Cut.

- Story of the love between a struggling American artist and a beautiful Chinese prostitute in Hong Kong. The World of Suzie Wong

We can see the improvement in the results compared with previous models.

Code your own BERT pipeline with HuggingFace

SentenceTransformer uses HuggingFace architecture in the backend. HuggingFace leads the research as an open-source provider of NLP technologies. They expose models, datasets, and code bases for free. Anyone can train their model and push the result to their database.

We’ll use tokenizer (AutoTokenizer) and model (AutoModel) that are retrievable for the sentence-transformer/paraphrase-MiniLM-L6-v2.

We can choose to train on CPU or GPU and we use fit_transform to generate our embed matrix using a batch of data not to overload the memory. Again in the transform method, we use mean_pooling function to infer a vector for each word and average them for each sentence. This pooling is more complex than previously because the attention mechanism in BERT uses masks and we apply them to the result before averaging.

The code is slightly longer than when we use SentenceTransformer but we have more flexibility too. The inferring method we need to design ourselves is also more transparent

Training and Predicting (CPU or GPU)

Training takes time so if you have access to a GPU, use it to speed up your training. For this dataset, training took 12min on CPU for 20,000 sentence rows on my 16Gb 8 CPUs local computer, while training took only 6min when I use my local GPU (GeForce MX150) with the same training configuration.

Example queries:

QUERY: ‘A dog is playing with his friends’

BERT MODEL: ‘The story of a man who rescues a German shepherd and how the two become fast friends.’ [My Dog Tulip]

QUERY: ‘a heroe movie in europe’

BERT MODEL: ‘A cop in a dystopian Europe investigates a serial killings suspect using controversial methods written by his now disgraced former mentor.’ [Forbrydelsens element]

(we may conclude that it did not exist in the database)

QUERY: ‘the story of a waitress’’

BERT MODEL: ‘A waitress, desperate to fulfill her dreams as a restaurant owner, is set on a journey to turn a frog prince back into a human being, but she has to do face the same problem after she kisses him.’ [The Princess and the Frog]

Conclusion

Once we loaded and vectorized an initial dataset, we can use a new sentence as a query and retrieve the top-3 closest items from the dataset that match this query sentence.

For that, we have used the simple TF-IDF model that uses the frequency of the words and in the sentence to create vectors. We used the Word2Vec model, a simple neural network. We used spaCy which can process and compute vectors for many different languages with state-of-the-art performance.

Finally, we used transformer models to leverage the latest deep learning method for vectorization. Deep Learning techniques could achieve the best (subjectively) retrieval results but they are slower. Pick the method that fits your trade-off best!

--

--