4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills
Difficulty
beginnerRepository
Click to openVideo
Not available
Live demo
Not available
Average duration
2 hrs
Technologies
When reading the k-nearest neighbors theory lesson, we also read an introduction to recommender systems. In this guided project, we will learn how to build a simple movie recommender system using the k-nearest neighbors algorithm.
This project contains 2 datasets with different features for the same 5000 movies, so you should merge them.
You will not be forking this time, please take some time to read this instructions:
Once you are finished creating your movie recommender system, make sure to commit your changes, push to your repository and go to 4Geeks.com to upload the repository link.
Movie recommender system
Can we predict which films will be highly rated, even if they are not a commercial success?
This dataset is a subset of the huge TMDB Movie Database API, containing only 5000 movies from the total number.
Dataset links:
tmdb_5000_movies: https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv
tmdb_5000_credits zip file to download: https://github.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/blob/main/tmdb_5000_credits.zip
Step 1:
Import the necessary libraries and import the dataset.
1import pandas as pd 2import numpy as np 3import matplotlib.pyplot as plt 4import seaborn as sns 5%matplotlib inline 6 7movies = pd.read_csv('../data/tmdb_5000_movies.csv') 8credits = pd.read_csv('../data/tmdb_5000_credits.csv')
Step 2:
Explore the dataset by looking at the first rows and the number of rows and columns.
1movies.head() 2 3movies.shape 4 5credits.head() 6 7credits.shape
Step 3:
Merge both dataframes on the 'title' column.
1movies = movies.merge(credits, on='title')
Step 4:
We will work only with the following columns:
-movie_id
-title
-overview
-genres
-keywords
-cast
-crew
1movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]
Step 5:
As there are only 3 missing values in the 'overview' column, drop them.
1movies.isnull().sum() 2 3movies.dropna(inplace = True)
Step 6:
As you can see, there are some columns in json format. With the following code, you can view what genres are included in the first row.
1movies.iloc[0].genres 2 3>>>>[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]
We will start converting these columns using a function to obtain only the genres, without a json format. We are only interested in the values of the 'name' keys.
1import ast 2 3def convert(obj): 4 L = [] 5 for i in ast.literal_eval(obj): 6 L.append(i['name']) 7 return L
1movies.dropna(inplace = True)
1movies['genres'] = movies['genres'].apply(convert) 2movies.head()
Repeat the process for the 'keywords' column.
1movies['keywords'] = movies['keywords'].apply(convert)
For the 'cast' column we will create a new but similar function. This time we will limit the number of items to three.
1def convert3(obj): 2 L = [] 3 count = 0 4 for i in ast.literal_eval(obj): 5 if count < 3: 6 L.append(i['name']) 7 count +=1 8 return L
1movies['cast'] = movies['cast'].apply(convert3)
You can see how our dataset is coming along:
1movies.head(1)
The only columns left to modify are 'crew' and 'overview'. For the 'crew', we will create a new function that allows obtaining only the values of the 'name' keys for whose 'job' value is 'Director'. To sum up, we are trying to get the name of the director.
1def fetch_director(obj): 2 L = [] 3 for i in ast.literal_eval(obj): 4 if i['job'] == 'Director': 5 L.append(i['name']) 6 break 7 return L
1movies['crew'] = movies['crew'].apply(fetch_director)
Finally, let's look at the first row of the 'overview' column:
1movies.overview[0]
For the 'overview' column, we will convert it into a list by using 'split()' methode.
1movies['overview'] = movies['overview'].apply(lambda x : x.split())
Step 7:
For the recommender system to not get confused, for example between 'Jennifer Aniston' and 'Jennifer Conelly', we will remove spaces between words with a function.
1def collapse(L): 2 L1 = [] 3 for i in L: 4 L1.append(i.replace(" ","")) 5 return L1
Now let's apply our function to the 'genres', 'cast', 'crew', and 'keywords' columns.
1movies['cast'] = movies['cast'].apply(collapse) 2movies['crew'] = movies['crew'].apply(collapse) 3movies['genres'] = movies['genres'].apply(collapse) 4movies['keywords'] = movies['keywords'].apply(collapse)
Step 8:
We will reduce our dataset by combining all our previously converted columns into only one column named 'tags' (which we will create). This column will now have ALL items separated by commas, but we will ignore commas by using lambda x :" ".join(x).
1movies['tags'] = movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']
1new_df = movies[['movie_id','title','tags']] 2 3new_df['tags'] = new_df['tags'].apply(lambda x :" ".join(x))
Look how it looks now by showing the first tag:
1new_df['tags'][0] 2 3>>>>'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'
Step 9:
We will use KNN algorithm to build the recommender system. Before entering the model let's proceed with the text vectorization which you already learned in the NLP lesson.
1from sklearn.feature_extraction.text import CountVectorizer 2cv = CountVectorizer(max_features=5000 ,stop_words='english')
1vectors = cv.fit_transform(new_df['tags']).toarray()
1vectors.shape
If you wish to know the 5000 most frequently used words you can use cv.get_feature_names()
Step 10:
Let's find the cosine_similarity among the movies. Go ahead and run the following code lines in your project to see the results.
1from sklearn.metrics.pairwise import cosine_similarity 2cosine_similarity(vectors).shape
1similarity = cosine_similarity(vectors)
1similarity[0]
1sorted(list(enumerate(similarity[0])),reverse =True , key = lambda x:x[1])[1:6]
Step 11:
Finally, create a recommendation function based on the cosine_similarity. This function should recommend the 5 most similar movies.
1def recommend(movie): 2 movie_index = new_df[new_df['title'] == movie].index[0] ##fetching the movie index 3 distances = similarity[movie_index] 4 movie_list = sorted(list(enumerate( distances)),reverse =True , key = lambda x:x[1])[1:6] 5 6 for i in movie_list: 7 print(new_df.iloc[i[0]].title)
Step 12:
Check your recommender system by introducing a movie. Run to see the recommendations.
1recommend('choose a movie here')
Step 13:
As always, use your notebook to experiment and make sure you are getting the results you want.
Use your app.py file to save your defined steps, pipelines or functions in the right order.
In your README file, write a brief summary.
Solution guide:
https://github.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/blob/main/solution_guide.ipynb
4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills
Difficulty
beginnerRepository
Click to openVideo
Not available
Live demo
Not available
Average duration
2 hrs
Technologies