Follow the instructions below:
Once you have finished solving the exercises, be sure to commit your changes, push them to your repository, and go to 4Geeks.com to upload the repository link.
Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.
In this project you will practice with a dataset to create a review classifier for the Google Play store.
The dataset can be found in this project folder under the name playstore_reviews.csv
. You can load it into the code directly from the link:
1https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv
Or download it and add it by hand in your repository. In this dataset, you will find the following variables:
package_name
. Name of the mobile application (categorical)review
. Comment about the mobile application (categorical)polarity
. Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the package_name
variable should be removed.
When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.
However, we cannot work with plain text; it must first be processed. This process consists of several steps:
1df["column"] = df["column"].str.strip().str.lower()
X_train
, X_test
, y_train
, y_test
.1vec_model = CountVectorizer(stop_words = "english") 2X_train = vec_model.fit_transform(X_train).toarray() 3X_test = vec_model.transform(X_test).toarray()
Once we have finished we will have the predictors ready to train the model.
Start solving the problem by implementing a model, from which you will have to choose which of the three implementations to use: GaussianNB
, MultinomialNB
or BernoulliNB
, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.
After training the model in its three implementations, choose the best option and try to optimize its results with a random forest, if possible.
Store the model in the appropriate folder.
Which other models of the ones we have studied could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.
Note: We also incorporated the solution samples on
./solution.ipynb
that we strongly suggest you only use if you are stuck for more than 30 min or if you have already finished and want to compare it with your approach.