4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills
Difficulty
beginnerRepository
Click to openVideo
Not available
Live demo
Not available
Average duration
2 hrs
Technologies
In our last exploring NLP notebook we built an email spam detector using Natural Language Processing techniques and the Support Vector Machine (SVM) algorithm for classification.
In this project, we will again build a spam detector but this time using URLs instead of emails.
You will not be forking this time, please take some time to read this instructions:
Once you are finished creating your URL spam classifier, make sure to commit your changes, push to your repository and go to 4Geeks.com to upload the repository link.
URL Spam detector
We will use a URL dataset which you can find in the following link https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/url_spam.csv
Step 1:
Load your dataset and do the necessary transformations on your target variable.
Step 2:
Use NLP techniques to preprocess the data. Here is another idea on how to exclude some words by creating new columns:
1df['len_url'] = df['url'].apply(lambda x : len(x)) 2df['contains_subscribe'] = df['url'].apply(lambda x : 1 if "subscribe" in x else 0) 3df['contains_hash'] = df['url'].apply(lambda x : 1 if "#" in x else 0) 4df['num_digits'] = df['url'].apply(lambda x : len("".join(_ for _ in x if _.isdigit())) ) 5df['non_https'] = df['url'].apply(lambda x : 1 if "https" in x else 0) 6df['num_words'] = df['url'].apply(lambda x : len(x.split("/"))) 7 8target = 'is_spam' 9features = [f for f in df.columns if f not in ["url", target]] 10X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=0)
Step 3:
Use Support Vector machine to build a url spam classifier.
Step 4:
As always, use your notebook to experiment and make sure you are getting the results you want.
Use you app.py file to save your defined steps, pipelines or functions in the right order.
In your README file write a brief summary.
Solution guide:
https://github.com/4GeeksAcademy/NLP-project-tutorial/blob/main/solution_guide.ipynb
4Geeks Coding Projects tutorials and exercises for people learning to code or improving their coding skills
Difficulty
beginnerRepository
Click to openVideo
Not available
Live demo
Not available
Average duration
2 hrs
Technologies