lesson 3-13-2 getting the data
Importing a Large CSV File for Analysis
Welcome, Data Detectives! 🕵️♀️ Last lesson, we learned about the mountains of 'Big Data' that exist in the world. But a detective can't solve a case without getting the evidence into the lab! Today, you'll learn a crucial skill for any Data Scientist: how to take a raw data file from the outside world and bring it into your 'digital laboratory' — a spreadsheet. We'll be working with a special type of file called a CSV, which is how professionals share huge amounts of data. Let's get our first case file open and see what clues we can find!
Learning Outcomes
The Building Blocks (Factual Knowledge)
The Connections and Theories (Conceptual Knowledge)
The Skills and Methods (Procedural Outcomes)
Recall that structured data has a pre-defined format.
Describe a CSV (Comma-Separated Values) file as a plain text format for storing table-like data.
Describe a delimiter as a character used to separate data fields.
The Connections and Theories (Conceptual Knowledge)
Explain why the CSV format is useful for exchanging data between different software applications.
Explain where "Getting the Data" fits within the Data Science Lifecycle (the 'Acquire' or 'Collect' stage).
Digital Skill Focus: Data Import
The Skills and Methods (Procedural Outcomes)
Apply the process of importing data from a CSV file into a spreadsheet.
Apply simple sorting and filtering tools to perform an initial exploration of a large dataset.
Understanding CSVs
In the world of data, evidence doesn't always come in a neat and tidy box. One of the most common ways that data professionals share large amounts of information is by using a CSV file.
CSV stands for Comma-Separated Value and a CSV File is a Comma-Separated Value File. You can tell a file is a CSV file because of it's .csv file extension.
Actually, it's a simple text file, but it has a special power. If you were to open a CSV file in a basic text editor like Notepad, you'd see something like this:
Film,Director,Year,Genre
Inception,Christopher Nolan,2010,Sci-Fi
The Dark Knight,Christopher Nolan,2008,Action
Parasite,Bong Joon Ho,2019,Thriller
As you can see, it's just text where each piece of data is separated by a comma. The character used to separate the data is called a delimiterI have no idea what this means. While commas are the most common, other delimiters like tabs or semicolons can also be used. This simple, structured data format means that almost any data analysis software can understand it, making it the perfect way to share datasets.

Now that we know what a CSV file is, let's get our hands on a real CSV file and import it into our own "data lab".

Task 1 Data Detective: Importing and Querying Your First CSV File
1
Gather your resources
Download this CSV file of the top 1000 films from IMDB. It will very likely end up in your 'Downloads' folder.
DO NOT click the file, either from the download notification or by double clicking it in your documents!
Your computer might default to opening .csv files in Microsoft Excel (or equivalent). Your teacher will show you how to change the default for CSV files so that they open in Notepad/Text Editor by default.
2
Import the data
Open up Microsoft Excel (if it's not already opened).
In the top menu, click on the Data tab.
Find and click the From Text/CSV button.
A window will open. Navigate to your 'Downloads' folder and select the CSV file you just downloaded.
Click Import.
An import preview window will appear. It should have correctly guessed that the delimiter is a comma. Check that the preview looks like a neat table with columns.
Click the Load button. The data will now appear in your spreadsheet!
Your spreadsheet software might also format your data like a table and add filter buttons.
Whoops!!
There may be one error in row 967. Can you figure out what the error is and fix it?
3
Initial Investigation
1
Scroll to the bottom of the dataset. How many rows are there?
2
Click the filter arrow on the 'Release Year' or equivalent date column and choose 'Sort Largest to Smallest' (or 'Sort descending') to find the most recent entries.
3
Let's do some searching!
In how many films has 'Al Pacino' been 'Star1'? (10)
How many Certificate 'A' films are there in the list? (197)
How many films have had more than or equal to 1,500,000 (1.5 million) votes? (12)
Top Tip!
Unless you want to combine filters, make sure you clear the filters from 'Data > Sort & Filter > Clear'. This is a very common mistake with Excel because it's actually quite difficult to see when the filters are active.
4
EXTENSION: Runtime?
There is an issue with the 'runtime' column because it's actually text and not number so we can't apply number filters to it.
1
Insert a column to the right of the 'Runtime' column.
2
Name the column 'Minutes'
3
In cell F2, type the following formula exactly...
=VALUE(SUBSTITUTE(E2," min",""))
4
Drag the formula down to the end of the data (row 1001).
5
Now you should be able to add number filters to column F:
How many films are there which are at least 200 minutes long? (17)
How many films are there which are shorter than 60 minutes long? (1)
Outcome: I have successfully imported a large CSV dataset into my spreadsheet. I have explored the data by sorting it and applying filters to find specific information.

Great. Now that we've managed to import and do some digging around, let's think about why this process is so useful.

Task 2 Debrief
With your shoulder partner, discuss and answer the following questions on this worksheet.
1
Why was it important to check the 'delimiter' in the import options?
2
Why is a CSV file a better format for sharing this data than a PDF or a Word document?
3
What was one interesting pattern or piece of information you discovered when you sorted or filtered the data?
4
This lesson was the 'Acquire' step of the Data Science Lifecycle. What do you think the next step, 'Clean', might involve?
Outcome: I have reflected on the data import process and can explain my initial findings from exploring the dataset with a partner.

Out of Lesson Learning
Last modified: October 2nd, 2025