4Geeks logo
4Geeks logo
About us

Learning library

For all the self-taught geeks out there, here our content library with most of the learning materials we have produces throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Full-Stack Software Developer

Data Science and Machine Learning - 16 wks

Search from all Lessons

Social & live learning

The most efficient way to learn: Join a cohort with classmates just like you, live streams, impromptu coding sessions, live tutorials with real experts, and stay motivated.

← Back to Projects

Continue learning for free about:

K-nearest neighbors Project Tutorial

Goal

4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills

Difficulty

beginner

Repository

Click to open

Video

Not available

Live demo

Not available

Average duration

2 hrs

Technologies

  • When reading the k-nearest neighbors theory lesson, we also read an introduction to recommender systems. In this guided project, we will learn how to build a simple movie recommender system using the k-nearest neighbors algorithm.

  • This project contains 2 datasets with different features for the same 5000 movies, so you should merge them.

🌱 How to start this project

You will not be forking this time, please take some time to read this instructions:

  1. Create a new repository based on machine learning project by clicking here.
  2. Open the recently created repostiroy on Gitpod by using the Gitpod button extension.
  3. Once Gitpod VSCode has finished opening, you start your project following the Instructions below.

🚛 How to deliver this project

Once you are finished creating your movie recommender system, make sure to commit your changes, push to your repository and go to 4Geeks.com to upload the repository link.

📝 Instructions

Movie recommender system

Can we predict which films will be highly rated, even if they are not a commercial success?

This dataset is a subset of the huge TMDB Movie Database API, containing only 5000 movies from the total number.

Dataset links:

tmdb_5000_movies: https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv

tmdb_5000_credits zip file to download: https://github.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/blob/main/tmdb_5000_credits.zip

Step 1:

Import the necessary libraries and import the dataset.

1import pandas as pd 2import numpy as np 3import matplotlib.pyplot as plt 4import seaborn as sns 5%matplotlib inline 6 7movies = pd.read_csv('../data/tmdb_5000_movies.csv') 8credits = pd.read_csv('../data/tmdb_5000_credits.csv')

Step 2:

Explore the dataset by looking at the first rows and the number of rows and columns.

1movies.head() 2 3movies.shape 4 5credits.head() 6 7credits.shape

Step 3:

Merge both dataframes on the 'title' column.

1movies = movies.merge(credits, on='title')

Step 4:

We will work only with the following columns:

-movie_id

-title

-overview

-genres

-keywords

-cast

-crew

1movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

Step 5:

As there are only 3 missing values in the 'overview' column, drop them.

1movies.isnull().sum() 2 3movies.dropna(inplace = True)

Step 6:

As you can see, there are some columns in json format. With the following code, you can view what genres are included in the first row.

1movies.iloc[0].genres 2 3>>>>[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]

We will start converting these columns using a function to obtain only the genres, without a json format. We are only interested in the values of the 'name' keys.

1import ast 2 3def convert(obj): 4 L = [] 5 for i in ast.literal_eval(obj): 6 L.append(i['name']) 7 return L
1movies.dropna(inplace = True)
1movies['genres'] = movies['genres'].apply(convert) 2movies.head()

Repeat the process for the 'keywords' column.

1movies['keywords'] = movies['keywords'].apply(convert)

For the 'cast' column we will create a new but similar function. This time we will limit the number of items to three.

1def convert3(obj): 2 L = [] 3 count = 0 4 for i in ast.literal_eval(obj): 5 if count < 3: 6 L.append(i['name']) 7 count +=1 8 return L
1movies['cast'] = movies['cast'].apply(convert3)

You can see how our dataset is coming along:

1movies.head(1)

The only columns left to modify are 'crew' and 'overview'. For the 'crew', we will create a new function that allows obtaining only the values of the 'name' keys for whose 'job' value is 'Director'. To sum up, we are trying to get the name of the director.

1def fetch_director(obj): 2 L = [] 3 for i in ast.literal_eval(obj): 4 if i['job'] == 'Director': 5 L.append(i['name']) 6 break 7 return L
1movies['crew'] = movies['crew'].apply(fetch_director)

Finally, let's look at the first row of the 'overview' column:

1movies.overview[0]

For the 'overview' column, we will convert it into a list by using 'split()' methode.

1movies['overview'] = movies['overview'].apply(lambda x : x.split())

Step 7:

For the recommender system to not get confused, for example between 'Jennifer Aniston' and 'Jennifer Conelly', we will remove spaces between words with a function.

1def collapse(L): 2 L1 = [] 3 for i in L: 4 L1.append(i.replace(" ","")) 5 return L1

Now let's apply our function to the 'genres', 'cast', 'crew', and 'keywords' columns.

1movies['cast'] = movies['cast'].apply(collapse) 2movies['crew'] = movies['crew'].apply(collapse) 3movies['genres'] = movies['genres'].apply(collapse) 4movies['keywords'] = movies['keywords'].apply(collapse)

Step 8:

We will reduce our dataset by combining all our previously converted columns into only one column named 'tags' (which we will create). This column will now have ALL items separated by commas, but we will ignore commas by using lambda x :" ".join(x).

1movies['tags'] = movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']
1new_df = movies[['movie_id','title','tags']] 2 3new_df['tags'] = new_df['tags'].apply(lambda x :" ".join(x))

Look how it looks now by showing the first tag:

1new_df['tags'][0] 2 3>>>>'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

Step 9:

We will use KNN algorithm to build the recommender system. Before entering the model let's proceed with the text vectorization which you already learned in the NLP lesson.

1from sklearn.feature_extraction.text import CountVectorizer 2cv = CountVectorizer(max_features=5000 ,stop_words='english')
1vectors = cv.fit_transform(new_df['tags']).toarray()
1vectors.shape

If you wish to know the 5000 most frequently used words you can use cv.get_feature_names()

Step 10:

Let's find the cosine_similarity among the movies. Go ahead and run the following code lines in your project to see the results.

1from sklearn.metrics.pairwise import cosine_similarity 2cosine_similarity(vectors).shape
1similarity = cosine_similarity(vectors)
1similarity[0]
1sorted(list(enumerate(similarity[0])),reverse =True , key = lambda x:x[1])[1:6]

Step 11:

Finally, create a recommendation function based on the cosine_similarity. This function should recommend the 5 most similar movies.

1def recommend(movie): 2 movie_index = new_df[new_df['title'] == movie].index[0] ##fetching the movie index 3 distances = similarity[movie_index] 4 movie_list = sorted(list(enumerate( distances)),reverse =True , key = lambda x:x[1])[1:6] 5 6 for i in movie_list: 7 print(new_df.iloc[i[0]].title)

Step 12:

Check your recommender system by introducing a movie. Run to see the recommendations.

1recommend('choose a movie here')

Step 13:

As always, use your notebook to experiment and make sure you are getting the results you want.

Use your app.py file to save your defined steps, pipelines or functions in the right order.

In your README file, write a brief summary.

Solution guide:

https://github.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/blob/main/solution_guide.ipynb

Goal

4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills

Difficulty

beginner

Repository

Click to open

Video

Not available

Live demo

Not available

Average duration

2 hrs