← Back to Projects

NLP Project Tutorial

Goal

4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills

Difficulty

beginner

Repository

Click to open

Video

Not available

Live demo

Not available

Average duration

2 hrs

Technologies

  • In our last exploring NLP notebook we built an email spam detector using Natural Language Processing techniques and the Support Vector Machine (SVM) algorithm for classification.

  • In this project, we will again build a spam detector but this time using URLs instead of emails.

🌱 How to start this project

You will not be forking this time, please take some time to read this instructions:

  1. Create a new repository based on machine learning project by clicking here.
  2. Open the recently created repostiroy on Gitpod by using the Gitpod button extension.
  3. Once Gitpod VSCode has finished opening you start your project following the Instructions below.

🚛 How to deliver this project

Once you are finished creating your URL spam classifier, make sure to commit your changes, push to your repository and go to 4Geeks.com to upload the repository link.

📝 Instructions

URL Spam detector

We will use a URL dataset which you can find in the following link https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/url_spam.csv

Step 1:

Load your dataset and do the necessary transformations on your target variable.

Step 2:

Use NLP techniques to preprocess the data. Here is another idea on how to exclude some words by creating new columns:

1df['len_url'] = df['url'].apply(lambda x : len(x)) 2df['contains_subscribe'] = df['url'].apply(lambda x : 1 if "subscribe" in x else 0) 3df['contains_hash'] = df['url'].apply(lambda x : 1 if "#" in x else 0) 4df['num_digits'] = df['url'].apply(lambda x : len("".join(_ for _ in x if _.isdigit())) ) 5df['non_https'] = df['url'].apply(lambda x : 1 if "https" in x else 0) 6df['num_words'] = df['url'].apply(lambda x : len(x.split("/"))) 7 8target = 'is_spam' 9features = [f for f in df.columns if f not in ["url", target]] 10X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=0)

Step 3:

Use Support Vector machine to build a url spam classifier.

Step 4:

As always, use your notebook to experiment and make sure you are getting the results you want.

Use you app.py file to save your defined steps, pipelines or functions in the right order.

In your README file write a brief summary.

Solution guide:

https://github.com/4GeeksAcademy/NLP-project-tutorial/blob/main/solution_guide.ipynb

Goal

4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills

Difficulty

beginner

Repository

Click to open

Video

Not available

Live demo

Not available

Average duration

2 hrs