lesson 3.13.3 cleaning the data

Handling Missing Values and Inconsistent Formatting

Just like a detective solving a mystery with smudged or incomplete clues, a Data Scientist must often work with real-world data that is messy, incomplete, or contains errors. Before they can uncover the amazing stories hidden within a dataset, they first have to clean it up. In this lesson, you will become a Data Detective, learning how to spot and fix these messy clues—such as missing information, typos, and odd data points. You will learn the spreadsheet skills needed to clean up a dataset, ensuring it's perfect for analysis, which is a vital first step on the Data Science & Analytics pathway.

Learning Outcomes

The Building Blocks (Factual Knowledge)

Recall that real-world data is often "messy".

Describe what is meant by missing values, inconsistent formatting, and outliers in a dataset.

The Connections and Theories (Conceptual Knowledge)

Explain why data cleaning is a crucial step for ensuring accurate analysis.

Analyse how incomplete or biased training data can lead to unfair outcomes in an AI system.

The Skills and Methods (Procedural Outcomes)

Apply spreadsheet tools like sorting and filtering to identify inconsistencies and outliers in a dataset.

Use spreadsheet tools like 'Find and Replace' to correct inconsistent data.

Digital Skill Focus: Sorting, filtering, find and replace using spreadsheets.

In our last lesson, we saw how to import a huge dataset from a CSV file. But here's a secret that all Data Scientists know: real-world data is almost never clean and perfect! It's often messy, with missing pieces, spelling mistakes, and strange entries.

This is where data cleansing comes in. It's the process of finding and fixing errors in a dataset to ensure the data is accurate and consistent. This is probably the most important step in the entire Data Science Lifecycle, because if you start with bad data, you will always end up with bad conclusions. This is a famous idea in computing called...

"Garbage In, Garbage Out" (GIGO)

The clues to dirty data

There are three main types of messy data you'll need to hunt for:

Missing Values

These are simply empty cells where data should be. Imagine a survey where someone forgot to answer a question. This can mess up calculations like averages.

To find missing values, use the sort function, lowest to highest. Any blank cells will be grouped together either at the top or the bottom of the column.

Inconsistent Formatting

This is when the same piece of data is entered in different ways. For example, in a 'Location' column, you might find "London", "london", and "LDN". A computer treats these as three completely different places, which will ruin any attempt to sort or count the data correctly.

To identify inconsistent formatting, use the filter function. Look for separate categories which should be the same. Use find and replace to replace the 'non-standard'.

Outliers

These are data points that are wildly different from all the others. If you were recording the ages of students in your class and one entry was "150", that would be an outlier! It's likely a typo and would completely skew your results.

To find outliers, use the sort function. Outliers will either be at the top or the bottom of the list depending on whether they are too big or too small.

Task The Cleaner 🧹

Your teacher has shown you how to spot problems in a little tiny dataset but it's way too easy to find inconsistencies in something that small. Let's have a little more challenge...

Get Organised

Download the video games spreadsheet. It should end up in your downloads folder.

Open the spreadsheet. Remember to click Enable Editing on the yellow bar if it appears.

Missing Values

Use the tools you have seen to find the missing values in the dataset. Don't just look for them - try to use the built in tools. Check your answers when you are finished.

Solutions 🤫

Fortnite (Row 8): The Global Sales (millions) cell is empty.

Wii Sports (Row 9): The Release Year cell is empty.

BioShock (Row 17): The Release Year cell is empty.

Uncharted 4: A Thief's End (Row 21): The Global Sales (millions) cell is empty.

League of Legends (Row 51): The Global Sales (millions) cell is empty.

Dota 2 (Row 52): The Global Sales (millions) is 0, which is technically a missing/misleading value as sales aren't tracked this way for a free-to-play game.

Among Us (Row 56): The Global Sales (millions) is 0, another misleading value for a free-to-play game.

Now, fix all the errors.

Inconsistent Formatting

This category covers capitalization, abbreviations, and spelling variations. Use the spreadsheet tools to find the mistakes. Check your answers when you are finished.

Solutions 🤫

Platform Column

N64 (Row 2): Inconsistent with Nintendo 64 used in Row 1.

ps3 (Row 5): Lowercase. Should be PlayStation 3.

pc (Row 8): Lowercase. Should be PC.

switch (Row 12): Lowercase. Should be Nintendo Switch or Switch.

Playstation 4 (Row 22): The 's' in 'station' is lowercase. Inconsistent with other PlayStation entries.

snes (Row 25): Lowercase. Should be SNES.

Playstation (Row 38): The 's' in 'station' is lowercase.

xbox 360 (Row 60): Lowercase. Should be Xbox 360.

Genre Column

sports (Row 9): Lowercase. Should be Sports.

first-person shooter (Row 17): Inconsistent with the abbreviation FPS used elsewhere (e.g., Row 26).

first-person shooter (Row 27): Inconsistent with FPS.

Survival horror (Row 42): 'h' is lowercase. Inconsistent with Survival Horror in Row 40.

action-adventure (Row 44): Lowercase. Inconsistent with Action-Adventure.

Battle royale (Row 55): 'r' is lowercase. Inconsistent with Battle Royale.

Now, fix all the errors.

Outliers & Factual Errors

These are values that are clearly typos or factually incorrect.

Solutions 🤫

Super Mario Bros. (Row 11): The Release Year is 1885. This is a significant outlier and a typo; the correct year is 1985.

Halo: Combat Evolved (Row 59): The release year is correct, but the listed platform is "Xbox". While technically true, for consistency with the rest of the dataset that specifies the generation (e.g. Xbox 360), this could be considered an inconsistency.

Now, fix all the errors.

Outcome: I can identify different types of errors in a simple dataset.

What's the big deal?

Let's think about those missing values for a moment. What if we hadn't found the release year for 'Wii Sports'? If we calculated the average release year of all the games, the missing value would mean our result was wrong.

Now, what if we used this messy data to train an AI? Imagine a company wants to build an AI to predict which new games will be successful. They feed it our dataset. The AI will learn from the data we give it.

Discussion Question: If our data is missing the sales figures for all action games, what will the AI 'learn' about them? How would this create a biased, unfair system for developers who want to make action games?

This is a really important idea called algorithmic bias. If the data we use to teach an AI is incomplete, messy, or unfair, the AI's decisions will also be incomplete, messy, or unfair. Cleaning data isn't just about being tidy; it's about being fair.

Today you have learnt how to identify and fix common errors in real-world datasets, and you've seen why this is a critical step for any data analyst and for building fair AI systems.