Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Search from all Lessons


LoginGet Started
← Back to How to's
Edit on Github

Pandas Data Cleaning

Written by:

In data science or Machine learning, one of the most important tasks performed when working with large amounts of data is data cleaning, data cleaning is the process of removing typing errors, inconsistencies, and missing values from a dataset. In this article, we'll see how to clean a dataset using pandas step by step.

What is Data Cleaning?

Data cleaning is the process of removing errors or inconsistencies from a dataset. To clean a dataset of information in data science or machine learning we have to make sure that there are no missing values, typing errors, type inconsistencies, duplicate rows, None, NaN or empty values, unnecessary columns, and many things more. The Pandas library offers us a powerful toolkit for data cleaning. It contains a variety of functions that help us deal with these errors and fix them.

In the following example, you will see how to clean a mock dataset of information step by step with the pandas library. This dataset is a simulation of a customer list.

Data cleaning tutorial step by step

First, you need to import pandas in Python

1import pandas as pd

After that, use the following code to create the user's data set:

1df_users = pd.DataFrame({ 2 "user_id": [234, 235, 236, 237, 237, 238, 239, 240, 241, 242, 242], 3 "Name": ["Tom", "Alex--", "..Thomas", "John", "John", "Paul/", "Emma9", "Joy", "Samantha_", "Emily", "Emily"], 4 "Last_name": ["Smith", "johnson", "brown", "Davis", "Davis", "None", "wilson", "Thompson", "Lee", "Johnson", "Johnson"], 5 "age": [23, 32, 45, 22, 22, 50, 34, 47, 28, 19, 19], 6 "Phone": ["555/123/4567", "333-234-5678", "444_456_7890", "111-222-3333", "111-222-3333", None, "333/987/4567", "222/345_987", "(777) 987-6543", "777-888-9999", "777-888-9999"], 7 "Email": ["smith@email.com", "johnson@hotmail.com", "brown@email.com", "davis@mail.com", "davis@mail.com", "John@gmail.com", "wilson@mail.com", "thompson@email.com", "lee@email.com", "emily@hotmail.com", "emily@hotmail.com"], 8 "Not_Useful_column": [None, None, None, None, None, None, None, None, None, None, None] 9}) 10 11print(df_users)

The user's data set will display in this way

1 user_id Name Last_name age Phone Email Not_Useful_column 20 234 Tom Smith 23 555/123/4567 smith@email.com None 31 235 Alex-- johnson 32 333-234-5678 johnson@hotmail.com None 42 236 ..Thomas brown 45 444_456_7890 brown@email.com None 53 237 John Davis 22 111-222-3333 davis@mail.com None 64 237 John Davis 22 111-222-3333 davis@mail.com None 75 238 Paul/ None 50 None John@gmail.com None 86 239 Emma9 wilson 34 333/987/4567 wilson@mail.com None 97 240 Joy Thompson 47 222/345_987 thompson@email.com None 108 241 Samantha_ Lee 28 (777) 987-6543 lee@email.com None 119 242 Emily Johnson 19 777-888-9999 emily@hotmail.com None 1210 242 Emily Johnson 19 777-888-9999 emily@hotmail.com None

Here we use the pandas DataFrame() function to create a mock dataset, this dataset contains 7 columns and 11 rows, the columns are, a user_id which is the user's unique id, a Name column, a Last_name column, the user's age, the user's Phone number, the user's Email, and finally a non-useful column called Not_Useful_column which we will use as an example of how to delete an unnecessary column from a dataset.

As you can see in the example dataset, the data has some inconsistencies in the columns, a few unnecessary symbols in the Name column, some values in the Last_name column are not capitalized, and each of the values in the Phone column have different syntax which makes it difficult to work with them.

1. Delete duplicated rows

The first thing to do when cleaning a dataset is to check and delete duplicated rows. As you can see in the dataset example the rows in index 3 and 4 as well as in index 9 and 10 have the same values, to delete these duplicated rows use the following code:

1df_users = df_users.drop_duplicates()

df_users

1 user_id Name Last_name age Phone Email Not_Useful_column 20 234 Tom Smith 23 555/123/4567 smith@email.com None 31 235 Alex-- johnson 32 333-234-5678 johnson@hotmail.com None 42 236 ..Thomas brown 45 444_456_7890 brown@email.com None 53 237 John Davis 22 111-222-3333 davis@mail.com None 65 238 Paul/ None 50 None John@gmail.com None 76 239 Emma9 wilson 34 333/987/4567 wilson@mail.com None 87 240 Joy Thompson 47 222/345_987 thompson@email.com None 98 241 Samantha_ Lee 28 (777) 987-6543 lee@email.com None 109 242 Emily Johnson 19 777-888-9999 emily@hotmail.com None

As you can see the function drop_duplicates removes all the duplicated rows, in our example dataset, the function successfully removed the rows at index 4 and 10.

2. Delete unnecessary columns

After deleting the duplicated rows, we have to delete all the columns that we're not gonna need, in our case the Not_Useful_column column. To delete this column use the following code:

1df_users = df_users.drop(columns="Not_Useful_column")

df_users

1 user_id Name Last_name age Phone Email 20 234 Tom Smith 23 555/123/4567 smith@email.com 31 235 Alex-- johnson 32 333-234-5678 johnson@hotmail.com 42 236 ..Thomas brown 45 444_456_7890 brown@email.com 53 237 John Davis 22 111-222-3333 davis@mail.com 65 238 Paul/ None 50 None John@gmail.com 76 239 Emma9 wilson 34 333/987/4567 wilson@mail.com 87 240 Joy Thompson 47 222/345_987 thompson@email.com 98 241 Samantha_ Lee 28 (777) 987-6543 lee@email.com 109 242 Emily Johnson 19 777-888-9999 emily@hotmail.com

The drop method with the columns="Not_Useful_column" parameter removes the Not_Useful_column column from our example dataset, but it must be assigned back to the df_users variable to replace its values with those of the new dataset.

3. Correct syntax errors in columns

Now that we have removed all the duplicated rows and unnecessary columns, we have to check column by column to see if any of them need to be corrected. Some of the values in the Name column have unnecessary symbols at the beginning or end of the name, we can correct these errors with the following code:

1df_users["Name"] = df_users["Name"].str.strip("[-_./0-9]{1}")

df_users

1 user_id Name Last_name age Phone Email 20 234 Tom Smith 23 555/123/4567 smith@email.com 31 235 Alex johnson 32 333-234-5678 johnson@hotmail.com 42 236 Thomas brown 45 444_456_7890 brown@email.com 53 237 John Davis 22 111-222-3333 davis@mail.com 65 238 Paul None 50 None John@gmail.com 76 239 Emma wilson 34 333/987/4567 wilson@mail.com 87 240 Joy Thompson 47 222/345_987 thompson@email.com 98 241 Samantha Lee 28 (777) 987-6543 lee@email.com 109 242 Emily Johnson 19 777-888-9999 emily@hotmail.com

The strip() method removes empty values at the beginning and at the end of a string, if we pass a specific value as a parameter it will look for that value at the beginning or at the end and remove it. In this example, we call the strip() method in the Name column df_users["Name"].str.strip() and pass it as parameters the values that we want to remove in a string [-_./0-9]{1}, then we access the dataset of the Name column df_users["Name"] and we assign it the new values.

Now that we have cleaned the Name column, we have to clean the Last_name column, as you can see in the dataset example, some of the values in the Last_name column do not have the first letter capitalized, to fix this use de following code:

1df_users["Last_name"] = df_users["Last_name"].apply( 2 lambda element: element[0:1].upper() + element[1:] if type(element) == str else element 3)

df_users

1 user_id Name Last_name age Phone Email 20 234 Tom Smith 23 555/123/4567 smith@email.com 31 235 Alex Johnson 32 333-234-5678 johnson@hotmail.com 42 236 Thomas Brown 45 444_456_7890 brown@email.com 53 237 John Davis 22 111-222-3333 davis@mail.com 65 238 Paul None 50 None John@gmail.com 76 239 Emma Wilson 34 333/987/4567 wilson@mail.com 87 240 Joy Thompson 47 222/345_987 thompson@email.com 98 241 Samantha Lee 28 (777) 987-6543 lee@email.com 109 242 Emily Johnson 19 777-888-9999 emily@hotmail.com

The apply function is used to apply a lambda function to each element in a column of a dataset, in this example, it is used to apply a function to the Last_name column that capitalize the first letter of each string with the syntax element[0:1].upper() + element[1:] but we have to check that the current value is of type string, if it's not then we have to return the same value, this can be done with the syntax if type(element) == str else element.

4. Set a unique pattern for a column

After cleaning the Name column and the Last_name column, we have to set a single pattern for the Phone column, as shown in the example dataset there are different patterns in the values of this column and we have to set a single one for all of them. To do this use the code:

1df_users["Phone"] = df_users["Phone"].str.replace(r"[^0-9]", '', regex=True) 2 3df_users["Phone"] = df_users["Phone"].apply( 4 lambda item: f"{item[0:3]}-{item[3:6]}-{item[6:]}" if type(item) == str and len(item) == 10 else None 5)

df_users

1 user_id Name Last_name age Phone Email 20 234 Tom Smith 23 555-123-4567 smith@email.com 31 235 Alex Johnson 32 333-234-5678 johnson@hotmail.com 42 236 Thomas Brown 45 444-456-7890 brown@email.com 53 237 John Davis 22 111-222-3333 davis@mail.com 65 238 Paul None 50 None John@gmail.com 76 239 Emma Wilson 34 333-987-4567 wilson@mail.com 87 240 Joy Thompson 47 None thompson@email.com 98 241 Samantha Lee 28 777-987-6543 lee@email.com 109 242 Emily Johnson 19 777-888-9999 emily@hotmail.com

Here we want to establish the pattern 000-000-0000 on all the values in the Phone column, for this first, we have to delete all the values that are not a number with the replace function and the syntax r"[^0-9]", '', regex=True, after this, we want to set this pattern on all the phone numbers in this column, for this, we use a lambda function and the f-string (format string) expression f"{item[0:3]}-{item[3:6]}-{item[6:]}" which is used to embed expressions inside string literals, but we do this only if the current value has a string type and have ten characters, as you can see in the index 5 of the Phone column we have a None value and in the index index 7 of the same column we have a number with only 9 digits, none of these characters meet the condition so they are replaced by a None value that will be removed later.

5. Delete rows with None values

Now that we have cleaned all the rows in the dataset, we have to delete the rows that contain None values and redefine the index of the rows in the dataset.

1# Delete columns with None values 2df_users = df_users.dropna() 3 4# reset the index of the columns 5df_users = df_users.reset_index(drop=True)

df_users

1 user_id Name Last_name age Phone Email 20 234 Tom Smith 23 555-123-4567 smith@email.com 31 235 Alex Johnson 32 333-234-5678 johnson@hotmail.com 42 236 Thomas Brown 45 444-456-7890 brown@email.com 53 237 John Davis 22 111-222-3333 davis@mail.com 64 239 Emma Wilson 34 333-987-4567 wilson@mail.com 75 241 Samantha Lee 28 777-987-6543 lee@email.com 86 242 Emily Johnson 19 777-888-9999 emily@hotmail.com

The dropna function deletes all the rows of a column containing None values and the reset_index function with the drop=True parameter reset the index of the rows and delete the old ones.

Now we have a perfectly clean and consistent dataset to start working with. This was a simple example of how to clean the data of a dataset.

Conclusion

Data cleaning is a very important step before starting to work with a dataset in data science or machine learning, it ensures that the data has no syntax errors, None or NaN values, duplicated rows, unnecessary columns, and many more things. A dataset can be clean in many ways not just the ones seen in this article, Pandas offers a wide variety of functions that helps us with this process. If you want to learn more about pandas in Python I recommend you to visit the following article which is a pandas tutorial where you can find coding exercises, and video tutorials that will help you to learn and improve your skills in the Pandas library.