Movie Recommendations with Neo4j. Building a simple movie recommender… | by Dimitris Panagopoulos | Feb, 2023

By Jessie Hobb On Feb 25, 2023

Building a simple movie recommender with Python and Neo4j

Image created by author using stable diffusion and code described in https://bytexd.com/get-started-with-stable-diffusion-google-colab-for-ai-generated-art/

Creating recommendations is a common use case of machine learning. In this post, we will demonstrate how to use a graph database to create a simple movie recommendation system. The proposed methods are not state-of-the-art. But using graph databases is easy to implement and easy to explain. They could form the starting point for a simple recommender that could be used to serve results fast and/or be used as a baseline for evaluating more complex systems.

If a reader would like to experiment, then he/she can use Neo4j’s sandbox and Google’s colab to get a system ready in just one or two minutes. For this article, we will be using data from GroupLens.org (i.e., the “1M Dataset”). We will also use a small data set to create a minimal graph with only a few nodes so that can easily check calculations. All code and data for the minimal graph can be found in the author’s GitHub.

Please note that:

Neo4j Graph Data Science plug-in should be installed in Neo4j (it is already installed in Neo4j’s sandbox)
In Python, “neo4j-driver” and “graphdatascience” libraries should be installed.

To install Python libraries in (2) you can use pip

!pip install neo4j-driver
!pip install graphdatascience

After loading the necessary libraries, the first step is to connect to Neo4j. This is done with the following snippet

DB_ULR = 'bolt://xxxxx:xxxx'
DB_USER = 'neo4j'
DB_PASS = 'xxxxx'
gds = GraphDataScience(DB_ULR, auth=(DB_USER, DB_PASS))

In case we are using Neo4j’s sandbox, we can find the URL and password in the “Connect via drivers” tab.

As mentioned in the introduction, we are going to use data on movie ratings. In especial, we are going to use MovieLens 1M dataset. This dataset contains 1 million ratings from 6000 movies on 4000 movies. It consists of three separate text files:

movies.dat: data on movies in the form of MovieID::Title::Genres
users.dat: data on users in the form of UserID::Gender::Age::Occupation::Zip-code
ratings.dat: data on ratings in the form of UserID::MovieID::Rating::Timestamp

First five rows of movies, user and ratings text files

We will create two kinds of nodes. One that represents users and another one representing movies. We will also create a relationship between user nodes and movie nodes to represent the fact that a user has rated a movie. As an attribute of this relationship, we will use the rating score. The graph database schema is shown below.

Using the graph data science library is pretty straightforward to load pandas dataframes to Neo4j. For example, the code below loads users.dat

gds.run_cypher('create constraint if not exists for (n:User) require (n.id) is node key')
create_customer_res = gds.run_cypher('''
unwind $data as row
merge (n:User{id: row.UserID})
set n.Gender = row.Gender
set n.Age =  row.Age 
return count(*) as custmers_created
''', params = {'data': users.to_dict('records')})

Ratings.dat file is quite big and it cannot be loaded all at once. Hence, we need to split the dataframe and load it into chunks.

for chunk in np.array_split(ratings,200):
if i%10 == 0:
print(i)
create_rated = gds.run_cypher('''
unwind $data as row
match (u:User{id: row.UserID}), (m:Movie{id: row.MovieID})
merge (u)-[r:RATED]->(m)
set r.Rating = row.Rating
return count(*) as create_rated
''', params = {'data': chunk.to_dict('records')})
i = i+1

Minimal example graph

To help the reader understand the methods we are going to use, we will use the following minimal graph as an example. It has:

three user nodes numbered 1, 2 and 3
four movie nodes
seven rating relationships, in parenthesis one can see the actual rating

Using cypher, it is easy to find movies that are similar to a given one. Given a movie m1, one can find all users that have rated it with a top score (5) and then return all other movies those users have also rated as excellent. Using the number of paths that connect m1 to each of the other movies, we can calculate a similarity score.

For example, the cypher query for finding similar movies to “Toy Story (1995)” is the following.

# Check similar movies
similar_movies = gds.run_cypher('''
MATCH(m1:Movie)-[r1]-(u:User)-[r2]-(m2:Movie)
WHERE m1.Title CONTAINS 'Toy Story (1995)'
AND m2.Title<>'Toy Story (1995)'
AND r1.Rating=5 AND r2.Rating=5
RETURN m2.Title,m2.Genres,count(DISTINCT(u)) as number_of_paths
ORDER BY common_users DESC
''')
similar_movies.head()

In our minimal graph example, this will return “Jumanji (1995)” which is connected to “Toy Story (1995)” with two paths. One passing through user 1 and the other passing through user 2.

When we use our regular graph with one million ratings, the five most similar movies to “Toy Story (1995)” are

Star Wars: Episode IV — A New Hope (1977)
Toy Story 2 (1999)
Raiders of the Lost Ark (1981)
Star Wars: Episode V — The Empire Strikes Back…
Shawshank Redemption, The (1994)

While for “Matrix, The (1999)” the top 5 is

Star Wars: Episode IV — A New Hope (1977)
Star Wars: Episode V — The Empire Strikes Back…
Raiders of the Lost Ark (1981)
American Beauty (1999)
Sixth Sense, The (1999)

Some readers might have mixed feelings about those results. Which is understandable. Popular movies, with high ratings, tend to dominate when we use this method. And the truth is that one million ratings are not enough to build a recommender. Experience with using MovieLens with various recommendation methods has shown that increasing the number of ratings, improves recommendations. Still, one should note that we are able to find similar movies by leveraging graphs just by using a simple query. A more sophisticated method would be to use a similar method as described in the next section for finding similar users.

Using Neo4j we can apply collaborative filtering to recommend movies to a user. A high-level description of the collaborative filtering method is that the process of recommending new movies to a user is done in two steps:

we find similar users to our user,
we use ratings of users found in step (1) to suggest new movies.

Calculating user similarity

We are going to use Jaccard similarity to detect similar users. In our graph-theoretic setting, Jaccard similarity between two nodes is the ratio of the number of nodes both of them are connected to divided by the number of nodes that are connected to at least one of them (excluding the two nodes we are calculating the similarity of).

In our minimal graph example, user nodes 1 and 2:

are both connected to “Toy Story (1995)” and “Jumanji (1995)”
are connected to “Toy Story (1995)”, “Jumanji (1995)” and “Waiting to Exhale (1995)”

Hence Jaccard similarity of users 1 and 2 is 2/3.

Similarly, users 1 and 3:

are both connected to “Toy Story (1995)”
are connected to “Toy Story (1995)”, “Jumanji (1995)”, “Waiting to Exhale (1995)” and “GoldenEye (1995)”

Hence Jaccard similarity of users 1 and 3 is 1/4.

Neo4j’s Graph Data Science library can calculate Jaccard’s similarity. First, we need to create a subgraph (or projection as Neo4j calls it) of the nodes and relationships we want to take under consideration when calculating Jaccard similarity.

# Create projection
create_projection = gds.run_cypher('''
CALL gds.graph.project(
'myGraph',
['User', 'Movie'],
{
RATED: {properties:  'Rating'}
} 
);
''')

Then, we calculate Jaccard similarity and store the results in a pandas dataframe.

# Get user similarity
users_similarity = gds.run_cypher('''
CALL gds.nodeSimilarity.stream('myGraph')
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).id AS UserID1, gds.util.asNode(node2).id AS UserID2, similarity
ORDER BY similarity DESCENDING, UserID1, UserID2
''')

First five rows of pandas contain user similarities for our minimal example graph

Finally, we create a new relationship between user nodes that have as an attribute the calculated similarity between them.

# Create Similar relationship
i=1
for chunk in np.array_split(users_similarity.query('UserID1>UserID2'),10):
print(i)
create_similar = gds.run_cypher('''
unwind $data as row
match (u1:User{id: row.UserID1}), (u2:User{id: row.UserID2})
merge (u1)-[r:SIMILAR]->(u2)
set r.Similarity=row.similarity
return count(*) as create_rated
''', params = {'data': chunk.to_dict('records')})
i = i+1

To recommend movies for a user (user1) we calculate a rank for movies the user has not rated using a weighted average rating of movies other users have seen where the weight is Jaccard similarity.

The formula for calculating the weighted average rating

We also add the logarithm of the number of paths that connect the user (user1) to a movie. This is because we want to boost movies that are connected to (user1) with more than one user. The corresponding cypher query is

# Check similar movies
similar_movies_for_user = gds.run_cypher('''
MATCH (u1:User)-[r1:SIMILAR]-(u2)-[r2:RATED]-(m:Movie)
WHERE id(u1)=$id
AND NOT ( (u1)-[]-(m))
RETURN m.Title,m.Genres,Sum(r1.Similarity*r2.Rating)/sum(r1.Similarity)+log(count(r2)) as score
ORDER BY score DESC
''',params = {'id':2})

For our minimal graph example, the result for user 3 is:

Jumanji (1995) with a score of 5.69
Waiting to Exhale (1995) with a score of 3.00

Top 10 movie recommendations (left) for a user and top 10 rated movies (right) for a user

Conclusions

Hopefully, this article has demonstrated the benefits of using a graph database to quickly create a recommendation engine. While this is not state-of-the-art, it is easily implemented and maintained. As a side bonus, I hope the article has also provide some useful tricks for combining Python with Neo4j.

Citation of dataset used:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872