When reading the k-nearest neighbors theory lesson, we also read an introduction to recommender systems. In this guided project we will learn how to build a simple movie recommender system using k-nearest neighbors algorithm.
This project contains 2 datasets with different features for the same 5000 movies, so you should merge them.
You will not be forking this time, please take some time to read this instructions:
Once you are finished creating your movie recommender system, make sure to commit your changes, push to your repository and go to 4Geeks.com to upload the repository link.
Movie recommender system
Can we predict which films will be highly rated, even if they are not a commercial success?
This dataset is a subset of the huge TMDB Movie Database API, containing only 5000 movies from the total number.
tmdb_5000_credits zip file to download: https://github.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/blob/main/tmdb_5000_credits.zip
Import the necessary libraries and import the dataset.
Explore the dataset by looking at the first rows and the number of rows and columns.
Merge both dataframes on the 'title' column.
We will work only with the following columns:
As there are only 3 missing values in the 'overview' column, drop them.
As you can see there are some columns with json format. With the following code, you can view what genres are included in the first row.
We will start converting these columns using a function to obtain only the genres, without a json format. We are only interested in the values of the 'name' keys.
Repeat the process for the 'keywords' column.
For the 'cast' column we will create a new but similar function. This time we will limit the number of items to three.
You can see how our dataset is coming along:
The only columns left to modify are 'crew' and 'overview'. For the 'crew', we will create a new function that allows to obtain only the values of the 'name' keys for whose 'job' value is 'Director'. To sum up, we are trying to get the name of the director.
Finally, let's look at the first row of the 'overview' column:
For the 'overview' column, we will convert it in a list by using 'split()' methode.
For the recommender system to do not get confused, for example between 'Jennifer Aniston' and 'Jennifer Conelly', we will remove spaces between words with a function.
Now let's apply our function to the 'genres', 'cast', 'crew' and 'keywords' columns.
We will reduce our dataset by combining all our previous converted columns into only one column named 'tags' (which we will create). This column will now have ALL items separated by commas, but we will ignore commas by using lambda x :" ".join(x).
Look how it looks now by showing the first tag:
We will use KNN algorithm to build the recommender system. Before entering the model let's proceed with the text vectorization which you already learned in the NLP lesson.
If you wish to know the 5000 most frequently used words you can use cv.get_feature_names()
Let's find the cosine_similarity among the movies. Go ahead and run the following code lines in your project to see the results.
Finally, create a recommendation function based on the cosine_similarity. This function should recommend the 5 most similar movies.
Check your recommender system by introducing a movie. Run to see the recommendations.
As always, use your notebook to experiment and make sure you are getting the results you want.
Use you app.py file to save your defined steps, pipelines or functions in the right order.
In your README file write a brief summary.