lesson 3.13.3 cleaning the data
Handling Missing Values and Inconsistent Formatting


Just like a detective solving a mystery with smudged or incomplete clues, a Data Scientist must often work with real-world data that is messy, incomplete, or contains errors. Before they can uncover the amazing stories hidden within a dataset, they first have to clean it up. In this lesson, you will become a Data Detective, learning how to spot and fix these messy clues—such as missing information, typos, and odd data points. You will learn the spreadsheet skills needed to clean up a dataset, ensuring it's perfect for analysis, which is a vital first step on the Data Science & Analytics pathway.
Learning Outcomes
The Building Blocks (Factual Knowledge)
The Connections and Theories (Conceptual Knowledge)
The Skills and Methods (Procedural Outcomes)
Recall that real-world data is often "messy".
Describe what is meant by missing values, inconsistent formatting, and outliers in a dataset.
The Connections and Theories (Conceptual Knowledge)
Explain why data cleaning is a crucial step for ensuring accurate analysis.
Analyse how incomplete or biased training data can lead to unfair outcomes in an AI system.
The Skills and Methods (Procedural Outcomes)
Apply spreadsheet tools like sorting and filtering to identify inconsistencies and outliers in a dataset.
Use spreadsheet tools like 'Find and Replace' to correct inconsistent data.
Digital Skill Focus: Sorting, filtering, find and replace using spreadsheets.
In our last lesson, we saw how to import a huge dataset from a CSV file. But here's a secret that all Data Scientists know: real-world data is almost never clean and perfect! It's often messy, with missing pieces, spelling mistakes, and strange entries.
This is where data cleansing comes in. It's the process of finding and fixing errors in a dataset to ensure the data is accurate and consistent. This is probably the most important step in the entire Data Science Lifecycle, because if you start with bad data, you will always end up with bad conclusions. This is a famous idea in computing called...
"Garbage In, Garbage Out" (GIGO)
The clues to dirty data
There are three main types of messy data you'll need to hunt for:
1
Missing Values
To find missing values, use the sort function, lowest to highest. Any blank cells will be grouped together either at the top or the bottom of the column.
2
Inconsistent Formatting
To identify inconsistent formatting, use the filter function. Look for separate categories which should be the same. Use find and replace to replace the 'non-standard'.
3
Outliers
To find outliers, use the sort function. Outliers will either be at the top or the bottom of the list depending on whether they are too big or too small.

Task The Cleaner 🧹
Your teacher has shown you how to spot problems in a little tiny dataset but it's way too easy to find inconsistencies in something that small. Let's have a little more challenge...
1
Get Organised
Download the video games spreadsheet. It should end up in your downloads folder.
Open the spreadsheet. Remember to click Enable Editing on the yellow bar if it appears.
2
Missing Values
Use the tools you have seen to find the missing values in the dataset. Don't just look for them - try to use the built in tools. Check your answers when you are finished.
Solutions 🤫
Fortnite (Row 8): The Global Sales (millions) cell is empty.
Wii Sports (Row 9): The Release Year cell is empty.
BioShock (Row 17): The Release Year cell is empty.
Uncharted 4: A Thief's End (Row 21): The Global Sales (millions) cell is empty.
League of Legends (Row 51): The Global Sales (millions) cell is empty.
Dota 2 (Row 52): The Global Sales (millions) is 0, which is technically a missing/misleading value as sales aren't tracked this way for a free-to-play game.
Among Us (Row 56): The Global Sales (millions) is 0, another misleading value for a free-to-play game.
Now, fix all the errors.
3
Inconsistent Formatting
This category covers capitalization, abbreviations, and spelling variations. Use the spreadsheet tools to find the mistakes. Check your answers when you are finished.
Solutions 🤫
Platform Column
Genre Column
N64 (Row 2): Inconsistent with Nintendo 64 used in Row 1.
ps3 (Row 5): Lowercase. Should be PlayStation 3.
pc (Row 8): Lowercase. Should be PC.
switch (Row 12): Lowercase. Should be Nintendo Switch or Switch.
Playstation 4 (Row 22): The 's' in 'station' is lowercase. Inconsistent with other PlayStation entries.
snes (Row 25): Lowercase. Should be SNES.
Playstation (Row 38): The 's' in 'station' is lowercase.
xbox 360 (Row 60): Lowercase. Should be Xbox 360.
Genre Column
sports (Row 9): Lowercase. Should be Sports.
first-person shooter (Row 17): Inconsistent with the abbreviation FPS used elsewhere (e.g., Row 26).
first-person shooter (Row 27): Inconsistent with FPS.
Survival horror (Row 42): 'h' is lowercase. Inconsistent with Survival Horror in Row 40.
action-adventure (Row 44): Lowercase. Inconsistent with Action-Adventure.
Battle royale (Row 55): 'r' is lowercase. Inconsistent with Battle Royale.
Now, fix all the errors.
4
Outliers & Factual Errors
These are values that are clearly typos or factually incorrect.
Solutions 🤫
Super Mario Bros. (Row 11): The Release Year is 1885. This is a significant outlier and a typo; the correct year is 1985.
Halo: Combat Evolved (Row 59): The release year is correct, but the listed platform is "Xbox". While technically true, for consistency with the rest of the dataset that specifies the generation (e.g. Xbox 360), this could be considered an inconsistency.
Now, fix all the errors.
Outcome: I can identify different types of errors in a simple dataset.

What's the big deal?
Let's think about those missing values for a moment. What if we hadn't found the release year for 'Wii Sports'? If we calculated the average release year of all the games, the missing value would mean our result was wrong.
Now, what if we used this messy data to train an AI? Imagine a company wants to build an AI to predict which new games will be successful. They feed it our dataset. The AI will learn from the data we give it.
Discussion Question: If our data is missing the sales figures for all action games, what will the AI 'learn' about them? How would this create a biased, unfair system for developers who want to make action games?
This is a really important idea called algorithmic bias. If the data we use to teach an AI is incomplete, messy, or unfair, the AI's decisions will also be incomplete, messy, or unfair. Cleaning data isn't just about being tidy; it's about being fair.
Out of Lesson Learning
Last modified: October 17th, 2025
