CS44 : Saving it for later


Saving data in a file is OK but you still have to store, retrieve, amend and search for the data yourself. There is a better way luckily which we've touched on before - that database. Hang on though, what about unstructured data? Oh, no, enter BIG DATA!

We are learning ...
  • About practical file handling techniques
  • About data modelling
  • About the concept and features of relational databases
  • How to use SQL to define, retrieve, update, insert and delete data 
  • About the functions of the database management system (DBMS)
So that we can ...
  • Have practical experiences handle files; (NEW SECTION < What does data look like)
    - Describe different access methods for files
    - Handle simple text files
    - Define a file in terms of records and fields (delimited) (NEW SECTION)
    - Distinguish between master and transaction files (NEW SECTION)
    - Describe serial, sequential / indexed sequential and direct (random) file access (NEW SECTION)
    - Experience handling binary files
  • Produce a data model from given data requirements for a simple scenario involving multiple entities
  • Produce entity relationship diagrams for a data model in standard format
    - entity ( PK, FK, NKA, ... )
  • Explain the concept of a relational database
    - Data consistency
    - Data redundancy
    - Data independence
  • Define terms associated with databases
    - Entity
    - Attribute
    - Primary key
    - Composite key
    - Foreign key
    - Multi-level indices
  • Understand why databases are normalised
    - Normalise relations to third normal form
  • Use SQL to :
    - Define relations
    - Retrieve, update, insert and delete data from multiple tables in a relational database
    - Manage issues of referential integrity
  • Describe the function of a database management system
  • Understand the benefits and drawbacks of client server databases
    - Simultaneous access for multiple clients
    - Update anomalies due to concurrent access and methods of resolution (record locks, serialisation, timestamp ordering, commitment ordering)
    - ACID (Atomicity, Consistency, Isolation, Durability)
  • §Show familiarity with :
    - Fact-based models for representing data
    - Graph schema for capturing the structure of a dataset (nodes, edges and properties)

    §Show familiarity with :
    - Fact-based models for representing data
    - Graph schema for capturing the structure of a dataset (nodes, edges and properties)

    §Show familiarity with :
    - Fact-based models for representing data
    - Graph schema for capturing the structure of a dataset (nodes, edges and properties)

    §Show familiarity with :
    - Fact-based models for representing data
    - Graph schema for capturing the structure of a dataset (nodes, edges and properties)

    Show familiarity with
    - Fact-based models for representing big data
    - Graph schema for capturing the structure of a dataset (nodes, edges and properties)
    - Use of distributed systems for managing big data.


Activity 1 What is a database?

A database ...
  • Is an organised, persistent collection of data;
  • Is a dynamic data structure, not a flat one like a 'flat file' such as a CSV or TSV file.
Databases can be ...
  • Organised / sorted in different ways;
  • Displayed / printed in a variety of styles and layouts;
  • Searched for specific information to produce complex results;
  • Updated by adding / changing / deleting items in often quite complex ways;
  • Backed up easily and automatically;
  • Local or distributed;
  • Stored / moved easily.

Not quite ...

Task 1.1
 What use are databases?

The database is quite possibly the most important large scale computational 'device' that exists in the world today. Everything that you do in your life depends, on some level, on a database. Whether it's shopping, learning, relaxing or socialising, database are key (pun intended).

In your notebook : Create a mindmap of as many real world activities that a database could be involved in. You might want to read the information on this webpage to help you.

OUTCOME : List of database applications.


A database is actually a collection of objects. Some of the objects hold data, some define its structure, some aid sorting and searching the data and some determine how it is displayed.




Activity 2 Data models 

A data model is a method of describing the data, its structure, the way it is interrelated and the constraints that apply to its organisation. A conceptual data model is a representation of the data requirements of an organisation constructed in a way that is independent of any software used to construct the database.


Data models are generally constructed based on rules first proposed by Peter Chen in 1976, where the data model views the world as distinct entities, described by attributes and relationships between them ...
  • Entity : An object, person, event or thing of interest to an organisation about which data has to be stored.
  • Attribute : A relevant property or characteristic of an entity by which it can be identified and described.
  • Relationship : A link or association between entities based on their attributes.


Data models are represented using Entity Relationship Diagrams, or ERDs.

Task 2.1
 Generating a data model

Think about a school. Maybe it's the school you are in now or one that you went to in the past. What 'makes' up the school? Think about all the physical things and all the conceptual things as well. These are the entities of the school data model. Each entity must have a series of attributes which describe it and a series of ways in which they are related to each other.


In your notebooks : Write a list of all the Entities you could find in a school and describe their attributes and the relationships that exist between them.

OUTCOME : Written description of a school based on entities, attributes and relationships.


Entity occurrences

Any entities you identify in a complex system like this are always singular, i.e. 'Pupil', not 'Pupils'. You will see this written incorrectly in many places online and in textbooks (shame). The entity describes the characteristics of one instance of the entity - a so called entity occurrence.


Importantly, there has to be a way of uniquely identifying each entity occurrence, in this case by the pupils name. In practice, something more complicated is often used, like an ID or Code (which would allow the pupil to change his / her name without compromising the 'uniqueness' of the entity occurrence).

More on relationships

There are three main types of relationship that can exist between entities ...
  • One to One (1:1)
  • One to Many (1:M)
  • Many to Many (M:M)
... plus all three relationships can be optional (zero or zero or many) or mandatory (one or one or many). Since the relationships are binary (i.e. have two sides), that gives us 36 different combinations. The relationships are shown using special arrows, one side of which are shown in the diagram.

https://drive.google.com/file/d/0B83yXMOilskaeGJ5UXpDdmNHUW8/view?usp=drive_web
Click to enlarge

The entities themselves are shown using boxes with at least the entity name and often with the attributes as well, though we'll look at that in a second. Each binary relationship can be 'read' from Left To Right (L2R) or Right To Left (R2L). Consider the following examples before attempting the task which follows.

https://drive.google.com/file/d/0B83yXMOilskaUVpzTWxVUVEyRlk/view?usp=drive_web
Click to enlarge

Task 2.2
 What relationships, brother?

In your notebooks : Sketch entity relationships for the following binary pairs.
  • Customer and order
  • Husband and wife (in western society)
  • Product and barcode
  • Network user and network permission
  • Film and film star
  • Household and main residence
  • GP and registered patient
  • Customer and product
  • Artist and album
  • Author and book
  • Student and course
OUTCOME : Simple, binary entity relationship diagrams for some common situations.


More about attributes

Data about each entity occurrence is stored as values in attributes within the entity itself. There are three 'types' of attribute which we use to describe both the entity occurrence and define the relationship between each one.
  • Primary Key Attributes (PK) uniquely identify each entity occurrence.
  • Foreign Key Attributes (FK) are found on the 'many' (or 'one' in a one-to-one) side of a relationship and always take the same values as the primary key attribute in a linked entity. They define the relationship.
  • Non-Key Attributes describe characteristics of the entity.


You choose a primary key based on any unique characteristic of the entity or, if one doesn't exist, you make one up. Primary keys can be single attributes or combinations of more than one attribute - whatever is simpler to work with. No matter what, however ...

The Primary Key must uniquely identify each entity occurrence and cannot be repeated!

Fingerprints are unique and
can be used as a primary key
for humans

Primary key attributes are often underlined whereas foreign key attributes are italicized however, I'm likely to simple add the suffix [PK] for the primary key and [FK] for the foreign key, if that's OK?


What practical significance does this have? Read through the following examples, one at a time ...

https://drive.google.com/file/d/0B83yXMOilskaM1pTdnNhM1hUUjg/view?usp=drive_web
Click to enlarge

https://drive.google.com/file/d/0B83yXMOilskaRlRnYWFoNVh3QVU/view?usp=drive_web
Click to enlarge

https://drive.google.com/file/d/0B83yXMOilskaZVZQT3RJQktXTFk/view?usp=drive_web
Click to enlarge


Task 2.3
 Notes

In your notebooks : Make some notes on the significance of Primary, Foreign and Non-Key attributes in your notebooks. Use the examples you have just seen to help you.

OUTCOME : Notes on the importance of Primary, Foreign and Non-Key attributes.


Resolving the many to many relationship problem

As we have just seen, many to many relationships are bad because they are difficult to implement without breaking the rule of the Primary Key - that it should be unique.

I suppose that, in the example we looked at, we could have a combined primary key of TeacherID and PupilID in both entities which would allow one entity occurrence for every teacher - pupil partnership but that would also mean that each entity occurrence would have a copy of the teacher forename and surname and the pupil forename and surname which introduces a whole raft of problems around data consistency and unproductive maintenance (see later).


To fix this, we introduce an associative entity which sits between the two other entities and holds details of a single occurrence of the relationship between the two entities. For instance ...

https://drive.google.com/file/d/0B83yXMOilskaa1FXRFhCa25KOXc/view?usp=drive_web
Click to enlarge

Here, the associative entity, Taught, holds each instance of a Teacher teaching a pupil and, through the inclusion of a separated TaughtID, allows a teacher to teach a pupil multiple times (maybe in different classes). This model also allows an extra non-key attribute, Notes, to be used to keep notes on each Taught entity occurrence, such as which class they were in or how they are progressing, for instance.

Task 2.4
 Resolving the M:M Relationship

In your notebooks : You should have identified a number of many to many relationships from Task 2.2. Your task is to resolve them by introducing a suitable associative entity.

OUTCOME : Resolved many to many relationships.


Investigating complex relationships

Since life is complex, real life data models tend to be complex as well, often involving many entities and complex relationships between them. Before we move on, it's time to have a look at some complex relationships courtesy of Barry Williams!

Task 2.5
 Database Answers

In your notebooks : Visit the Database Answers website data models section and marvel at how many there are (over 1500!). Choose one, print it out and write about it in your notebooks.

http://www.databaseanswers.org/

OUTCOME : Annotated complex data model


Activity 3 Practical database design 




At about the same time as Peter Chen proposed the Entity Relationship Model, a computer scientist working for IBM called Edgar F. Codd invented a practical realisation of this theory when he designed the first the Relational Data Base Management System or RDBMS

Both the Entity Relationship model and the RDBMS originally served to 
separate the data from the processes which managed it.

A simple, flat data structure

Let's consider a simple example of data storage - a customer list. A tabular representation of the customers is correctly called a relation (but often a table) and is a realisation of the customer entity which we have already met.

https://drive.google.com/file/d/0B83yXMOilskaRmVrdGQwOFJBTzA/view?usp=drive_web
Click to enlarge

In order for us to be able to refer to the relation, we must give it a unique name within the relation. Each tuple / record must have a unique value which can be used to identify it also which we refer to as the primary key. We can not define the structure of our relation using standardised table notation as follows if we call the relation customer.

customer(ID,name,address,dob,phoneNumber)

This relation name refers to the entity in the data model. Notice that the table name is emboldened and the primary key is underlined. This structure definition is the heading and the data in the table is the body.  Remember that a tuple is a row of data in the table (similar to a record) and an attribute can be represented as a named column in the table.

Task 3.1
 Worksheet time!

On the worksheet : Download and complete the worksheet 'Table Terminology' which requires you to match up, well, table terminology. You should find that this links up nicely with the conceptual data modelling work we have already done.

OUTCOME : Completed Table Terminology.docx worksheet


Note : if you do much work with Access databases, you will often use Leszynski / Reddick table notation to represent the names and structures of the database objects. Even though I used to do this when I worked with Access, I've chosen not to name the objects in this fashion to reduce (!) the complexity of the topic. However, feel free to prefix your relation names if you wish!

Composite keys

When no single attribute is sufficient to identify a tuple and we do not want to allocate a new primary key (for space constraints for instance), we can use a combination of attributes.  This type of identifier is called a composite key.  For instance, in the above relation, we could have used a combination of name and date of birth since it’s very unlikely (but not impossible!) that we would get two customers with the same name and the same date of birth ...
customer(name,address,dob,phoneNumber)

... which means that we can ditch the separate ID field. Yay!

Task 3.2
 Choose a key!

In your notebooks : For each situation, choose and suitable primary key / unique identifier. Consider whether there is any single attribute or combination of attributes in the relation which you could use already, which can be guaranteed to always be unique or whether you will have to introduce a contrived (separate) primary key.
  1. A car registered booked in for an MOT at a garage
  2. An item of stationery in a shop
  3. A patient in a hospital
  4. A single move in the solution of a Rubiks cube
  5. A student signing up to take a class in computing
  6. A single copy of a book in a library
  7. An order placed by a customer for raw materials to build a cat pen.
  8. A box of screws on the order placed for the cat pen.
OUTCOME : Identification of suitable primary keys for different data storage applications.


The problem with flat file databases

This model is all well and good for straightforward data (one entity) but very quickly, as the complexity of the system increases and the quantity of data mushrooms, the flat file approach becomes obsolete.  As soon as we introduce a relationship with another entity, things get complicated.

Consider storing data for a car hire company.  Imagine that every time somebody hired a car, a book was filled out containing details of the hire.


hire(title,forename,surname,telNo,address,town,country,postcode,reg,hired,due)

Look at how the same data is stored more than once in this data structure.  Storing data like this has some disadvantages and could lead to …

Data inconsistency Because of repeated records, the same data can be typed in twice and therefore errors can be made. If the data needs to be updated, we would need to remember to do this for every occurrence or inconsistency would be introduced.

Data redundancy Repeated data is redundant - it is not needed.

Inefficiency Typing the storing repeated data is inefficient both in terms of time and space.

Scenario - data dependence The generic nature of a table makes the program used to manage it dependent on it's structure and therefore hard to alter / develop. If an extra field needed adding or removing from the table, this would mean the program would have to be altered. Sometimes this is known as unproductive maintenance.

Lack of flexibility If you want only specific information from the table, it takes a long time to write the routines you need.

Non-shareability Everyone has access to all the data in the table even though they may not want or need access to it. With all the data being held in one place, this may cause delays in access.

Wha?!

Task 3.3
 Flat files? Nah, mate!

Two parts to this ...
  1. Make notes / produce flashcards / score a symphony to help your remember the definitions for the six key concepts listed above.

  2. Download and complete the worksheet 'Car Hire' which is available from the lesson resources which asks you to give specific examples of these concepts in relation to the car hire example.
OUTCOME : Some awareness of the problems introduced by flat file data structures.


Normalisation

The process of normalisation involves converting a single flat file entity / relation / table into multiple separate entities / relations / tables through the introduction of relationships between implemented through the use of primary / foreign key pairs.


That's right! Designing relational data models is natural - there is no real reason why you would ever design a data model in any other way (unless you were crazy) so, really, all this is just academic.

Someone suggested that Kermit
created an un-normalised data model!

https://drive.google.com/file/d/0B83yXMOilskaNXRCTE1CSFV6bVk/view?usp=drive_web
Click to enlarge

Task 3.4
 Normalise these!

Stage 1

Before you start, download, print and carefully read the worksheet 'Normalisation Example'. It's a long document with lots of information and examples which you need to understand before you even consider trying the next part of the task ...

Stage 2

In the following examples, you are given hand written data which is un-normalised. Your task will be to normalise the data and place it in 3rd normal form (3NF). Represent each stage in the normalisation as an Entity Relationship Diagram and in standard table notation. Use the example from the worksheet to help you.

Local computer shop

Your local computer shop wishes to record its sales in a database. Here is a handwritten list of sales produced by the salesman. Put this data into third normal form so that it can be stored in a relational database.


Plumtree dog agility centre

Plumtree Dog Agility Club has asked you to put the entries for their summer agility show into a database. Below is a sample of the handwritten list of entries. Each handler can have many dogs. Each dog can compete in up to three classes. Each class can have many entries. Put this data into third normal form so that it can be stored in a relational database.


Bash street school

Bash Street School holds the following information about their A-level students. Students can take many courses and each course has many students. Teachers can teach more than one course. Put this data into third normal form so that it can be stored in a relational database.



OUTCOME : Normalised data models for three different contrived situations.


So, you think that you've got it do you? Have a go at this then! You might want to ...


Task 3.5 Car Fleet

Consider the following scenario.  Develop the E-R diagram, determine tables that are needed, determine the primary key for each table, and some of the attributes of each table. After doing this problem, you can then use your normalisation skills to develop third normal form. You could work individually, or get together in groups to discuss this problem.

“A company records information on its fleet of cars and the employees who are permitted to drive them.
The database is used by three groups of people: the department managers, the finance department, and
the service division. 

For the managers, each car has a unique registration number and each driver has a unique employee
number. Drivers may be authorised to drive a number of cars, and any car may be used by a number of drivers. Cars are allocated to departments within the company, although they may be used by drivers in
other departments. Some classes of car require specialist driver qualifications. There are occasional
accidents which may lead to the car being written off and/or the driver being disqualified from driving
some or all classes of car. 

For the finance department, each car, identified by registration number, has a current and a replacement
value, was bought on a certain date and must be taxed on a certain date. For accounting purposes, the allocation of cars to departments is also required. Finally, details are recorded of any insurance claims associated with accidents, or repair costs if no insurance claim arose.

The service department is responsible for giving regular services to each car. Different types of service correspond to different mileage values for each class of car, with a short description documenting each
type of service. Where cars have been involved in accidents, details of repairs are recorded.”

OUTCOME
 : Data model for Car Fleet


Activity 4 Functions of the RDBMS 

The RDBMS, or Relational Data Base Management System, sits between the user and the data itself and manages the structure of and access to it.

Consider the situation where different departments in a business deal with the same customer.  They each may keep a record of the customer details in their own book.  You can imagine all the repeated data – nightmare!  One possible solution is to keep all the data in a central place where everyone can access it.  Do you think that the customers would want the shop floor workers seeing their bank account details?  I don’t think so!

https://drive.google.com/file/d/0B83yXMOilskabmhiUk15a2Ruc28/view?usp=drive_web
Click to enlarge

The RDBMS fixes issues with ...
  • Unproductive maintenance
  • Data inconsistency
  • Data redundancy
  • Security
  • Shareability
  • Response time
  • Inflexibility
  • Maintenance

Task 4.1
 The RDBMS fixes ...

Download, print and read the 'Issues addressed by the RDBMS' infosheet from the lesson resources. Make sure it's stuck into your books so you can refer to it when you revise.

OUTCOME : Infosheet on RDBMS goodness stuck in your notebooks.


Maintaining data integrity

One of the main functions of a database management system is to ensure data integrity and manage an issue known as deadlock.  'Integrity' means that all the data is consistent, correct, well structured, that any changes made are consistently applied.

Consider the following situation ...

https://drive.google.com/file/d/0B83yXMOilskaU0MxZ2hQU3liNTQ/view?usp=drive_web
Click to enlarge

Task4.2
 Data integrity

Consider, and discuss, the situation above, then answer the following questions in your notebook.
  1. Whose changes, John or Sam, will be recorded in the database?
  2. Can you think of 5 different ways in which this situation could have been avoided?
OUTCOME : Appreciation of the ways in which a DBMS can maintain data integrity.


Deadlock

Watch the video of the Dr Seuss' Tale, 'The Zax'. What does this have to do with databases?

Dr Seuss' The Zax (3:43)

Task 4.3
 Deadlock in practice

Consider the situation where two users are attempting to access records from a database. Joanne accesses and locks Record 'A' and tries to access Record 'B'. In the meantime, Sandra has accessed and locked Record 'B' and is trying to access Record 'A' ...


You will carry out an exercise where you simulate a situation known as deadlock.  Write up what you have learnt from the exercise in your notebooks.

OUTCOME : Appreciation of the ways in which database management systems handle deadlock.


Functions of the RDBMS

Apart from preventing or enabling the issues highlighted above, the RDBMS performs a raft of other jobs.
  • Maintaining a data dictionary
  • Providing Open Database Connectivity (ODBC)
  • Sharing the database and the management of deadlock
  • Controlling access to the database
  • Providing client - server connections
  • Maintaining referential integrity
  • Providing indexing functions to improve search performance



Task 4.4
 Another infosheet - what?

Yes - I realise that you've already had an 'infosheet' but there really is very little alternative. Lots to learn and so little time to learn it. Download the 'Functions of the RDBMS' infosheet from the lesson resources and read through each section with a friend. Test each other on the functions if you like.

OUTCOME : Greater awareness of the functions of the RDBMS, no checkpoint.

Referential integrity

If a value appears in a foreign key in one table, it must also appear in the primary key in another table. If a primary key of a record in one table is changed, then it must be updated in all the tables where it is used as a foreign key. If a record in the table where the primary key exists is deleted, maintaining referential integrity would involve deleting all records in other table where the primary key exists as a foreign key.  Not doing so would lead to orphaned records.

To maintain referential integrity, one must cascade update
and delete records or else orphaned records will be created

Task 4.5
 Referential Integrity.

A data structure is set up to manage teachers ordering books. Three entities are required to manage this situation - Teacher, Book and Order (in it's simplest form, a teacher can only order one book at a time and can only ever order a particular book once - what a crazy world!).

teacher(teacherID,forename,surname,phone,email)
book(bookID,title,author,price)
order(teacherID,bookID,orderComplete)
  • Construct a suitable Entity Relationship Diagram to represent this data model.

  • Imagine that there were some data in the relations which are created to represent this model. Explain what would happen in the following situations to maintain referential integrity.

    1. A teacher is removed from the database
    2. An order is cancelled
    3. A book has it's ID changed.
Write about what you have found out in your notebooks.

OUTCOME : A description what happens in order to maintain referential integrity during data alterations.


What is a 'schema'?

A schema is just a fancy word for 'view' (actually derived from the Greek for 'plan' or 'shape').  Different users (including the RDBMS itself) can have, and actually need, different views of the data in order to manage it effectively.
  • External or user schema : The way in which each user sees the database.
  • Conceptual or logical schema : Describes the entities, attributes and relationships.
  • Internal schema : Describes how the data will be stored and how it will be accessed and updated.



Task 4.6
 Schema, mate.

Just write yourself some notes on the different types of schema in your notebooks. That's it.

OUTCOME : Notes about the different database schema.


ACID Compliance

In the past few sections, we've really  been discussing some of the issues surrounding client-server databases. You don't really get these problems if you are sitting in front of your computer and you are the only one using the system. With a client server database, as the name suggests, there is one database and many clients. In Computer Science, ACID is a set of desirable properties of database transactions.

Task 4.7 Four-Fold Learning

Take a piece of (coloured) paper and fold it into 4 segments. On each segment, write about A, C, I and D using the design below to help you. Use this website and this website to help you. Don't go into too much depth - that's why I've asked you to structure the paper like this!


Stick your definition sheet into your notebooks.

OUTCOME : Sheet with information about A, C, I and D.


Activity 5 Learn SQL, flipped fashion

So, how do we interact with our RDBMS? Certainly not with a standard imperative language like Python. Instead, we use a declarative language called SQL or Structured Query Language. I will only introduce this section in class but you must carry this out away from the classroom because it is likely to take you a while, so you better ...

"Get Ready"

Task 5.1
 SQL Tutorial

SQL or Structured Query Language, was invented in the 1970s by Donald Chamberlin and Raymond Boyce of IBM as a concise way of querying their RDBMS. SQL was originally called SEQUEL (Stuctured English Query Language) but its name had to be changed to SQL because SEQUEL was copyrighted to the UK-based Hawker Siddeley aircraft company.

Don Chamberlin talks about the origin of SQL (see this website for the full, interesting read)
  • Visit the W3Schools website SQL Tutorial section which starts with a little introduction to SQL and it's relationship to the RDBMS and associated applications.

  • Carefully work through the following sections. I've labelled them with a * if they are essential. Try to work through as many sections as you can.

    SQL Syntax *
    SQL Select *
    SQL Distinct
    SQL WHERE *
    SQL AND & OR *
    SQL Order By *
    SQL Insert Into *
    SQL Update *
    SQL Delete *
    SQL Like
    SQL Wildcards
    SQL In
    SQL Joins *
    SQL Create DB
    SQL Create Table *
    SQL Not Null
    SQL Unique
    SQL Primary Key *
    SQL Foreign Key
    SQL Alter
    SQL Autoincrement


    ... and any others that you feel necessary and that you can access, including the SQL Functions section.
You DO NOT need to document your work on this. Also, be aware that the 'flavor' of SQL that you have learnt here is not necessarily universal. In later tasks, we will implement an SQLite database which uses a simpler set of commands.

OUTCOME : No specific recorded outcomes required!

http://xkcd.com/327/
XKCD : Exploits of a Mom (explained)

Technically, SQL is comprised of three different types of statement ...
  • SQL stands for Structured Query Language.
    SQL statements retrieve data or query the data structures. True SQL commands do not alter data structures. 

  • DDL stands for Data Definition Language.
    DDL instructions define data structure, create tables and attributes etc.

  • DML stands for Data Manipulation Language.
    DML instructions alter the data in the data structure, adding data, changing data, removing data.
Often, all three types of language come under the common heading of 'SQL'.  Be aware however that strictly, only certain commands qualify as pure SQL.



Task 5.2
 Classification

All of the following commands form part of the SQL language, however, only SELECT classifies as purely SQL. Your task is to classify the following commands as either DDL (Data Definition Language) or DML (Data Manipulation Language) statements. Write your answers in your notebooks as a table like this ...


    CREATE DATABASE (creates a new database)
    ALTER DATABASE (modifies a database)
    DELETE (deletes data from a database) 
    CREATE TABLE (creates a new table)
    UPDATE (updates data in a database)
    INSERT INTO (inserts new data into a database)
    ALTER TABLE (modifies a table)
    DROP TABLE (deletes a table)
    CREATE INDEX (create an index (search key))
    DROP INDEX (deletes an index)
    TRUNCATE (empty a table of it’s data)
    GRANT (grant permission on a table)

OUTCOME : Classification of SQL statements as DDL and DML


A practical database with SQLite

There are plenty of Relational Database Management Systems available - some commercial, some free and closed source, some free and open source. Most require installation, however, there is one very common 'zero-configuration' RDBMS called SQLite which is free, open source and requires no installation at all.

https://sqlite.org/
Click to visit the SQLite homepage

However, SQLite is command line only so, to make our lives a little easier, we would be better downloading DB Browser for SQLite which is a graphical tool which can be used to create and manage SQLite compatible database files. If you can't install software on the machine you are working on, you might want to use the portable version which is available from the download page.

http://sqlitebrowser.org/
Click to visit the DB Browser for SQLite Homepage

Task 5.3
 Setting up your World database

If you have not already downloaded DB Browser for SQLite, do it now! You will use this to create an SQLite compatible database. Choose 'New Database' from the toolbar and save your database as world.db in a suitable place in your documents. Close down the 'Create Table' window that appears.




We are going to create a database containing information about some of the World's major cities and my ratings of the holidays I have taken to them (as if). The database has the following data model. (Note : the Spoken table is used to resolve the horrible Many-To-Many relationship that exists between Country and Language (because more than one language can be spoken in one country and each language can be spoken in many countries). Yuk!)


In your notebook : Put a suitable title for this task and copy the data model into your notebooks.

Download the 'WorldDump' zip file from the lesson resources and unzip this into the same folder as you have just created your world.db file. Delete the .zip file when you have finished. The zip file contains seven .sql files which contain the data for the database. However, you can't import this data until you have set up the table structures, so let's do that now!

Set up the `Continent` table

This table holds the name of the six continents used in the data model. Each ContinentName is identified through a derived two character primary key (ContinentCode). Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `Continent` (
  `ContinentCode` TEXT UNIQUE,
  `ContinentName` TEXT,
  PRIMARY KEY(`ContinentCode`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 01_ContinentDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

Set up the `Language` table

This table contains common languages together with their ISO 639-2B language code which serves as a primary key for the record. Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `Language` (
  `LanguageCode` TEXT UNIQUE,
  `LanguageName` TEXT,
  PRIMARY KEY(`LanguageCode`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 02_LanguageDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

Set up the `Region` table

The region table holds the names of the regions into which the continents are split. Again, each region is identified by a derived, two character primary key and contains a foreign key to the continents table. The extra foreign key constraint is used to enforce referential integrity. Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `Region` (
  `RegionCode` TEXT UNIQUE,
  `RegionName` TEXT,
  `ContinentCode` TEXT,
  PRIMARY KEY(`RegionCode`),
  FOREIGN KEY(`ContinentCode`) REFERENCES `Continent`(`ContinentCode`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 03_RegionDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

Set up the `Country` table

Each of the countries listed in this table are identified by their unique ISO 1366-1 Alpha-3 country code which acts as the primary key for the record. Each record contains the RegionCode as a foreign key which links each record to a record in the Regions table. The extra foreign key constraint is used to enforce referential integrity. Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `Country` (
  `CountryCode` TEXT UNIQUE,
  `CountryName` TEXT,
  `RegionCode` TEXT,
  PRIMARY KEY(`CountryCode`),
  FOREIGN KEY(`RegionCode`) REFERENCES `Region`(`RegionCode`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 04_CountryDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

Set up the `City` table

The city table holds the names and populations of 4514 (!) major cities in the World with populations exceeding 100,000 (Europe, excluding Russia, 90,000) including capital cities and largest cities in each country. Each city is identified by an autonumber primary key (CityID) and includes the Population and two boolean Capital and Largest fields. The combined primary key will prevent the same city being entered twice in the table since each combination of CountryCode and City must be unique. The CountryCode field is included as a foreign key in each record to link the record to a record in the Country table. The extra foreign key constraint is used to enforce referential integrity. Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `City` (
  `CityID` INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE,
  `CityName` TEXT,
  `CountryCode` TEXT,
  `Population` INTEGER,
  `Capital` INTEGER,
  `Largest` INTEGER,
  FOREIGN KEY(`CountryCode`) REFERENCES `Country`(`CountryCode`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 05_CityDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

Setting up the `Spoken` table

There are literally thousands of languages spoken in various parts of the world. This table lists 469 different records which identify a particular language spoken in a particular country around the world. Each instance of a particular language spoken in a particular country is identified by a combined primary key of CountryCode and LanguageCode which would prevent the same language being entered alongside the same country twice. It also identified the language as Official where possible. Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `Spoken` (
  `CountryCode` TEXT,
  `LanguageCode` TEXT,
  `Official` INTEGER,
  PRIMARY KEY(`CountryCode`,`LanguageCode`),
  FOREIGN KEY(`CountryCode`) REFERENCES `Country`(`CountryCode`),
  FOREIGN KEY(`LanguageCode`) REFERENCES `Language`(`LanguageCode`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 06_SpokenDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

Setting up the `Visited` table

Now it comes to my holiday destinations. Each holiday has a unique, autoincrement field (VisitedID) for the primary key, a destination (referenced from City.CityID), a YearVisited and a Rating (from 0 to 10). Use the 'Execute SQL' tab to create the table using the scheme shown below.

CREATE TABLE `Visited` (
  `VisitedID` INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE,
  `CityID` INTEGER,
  `YearVisited` TEXT,
  `Rating` INTEGER,
  FOREIGN KEY(`CityID`) REFERENCES `City`(`CityID`)
);

In your notebooks : Copy the table schema into your notebooks.

Now import the data for this table from the sql file. Choose 'File > Import > Database from SQL file', browse to the 07_VisitedDump.sql file, double click and choose 'No' on the dialogue box to prevent DB Browser from creating a new database. You should get a message saying 'Import Successful'. Inspect the contents of the table using the 'Browse Data' tab.

... and finally

Make sure you click 'Write Changes' to commit the changes you have made to the database to the actual database file - DB Browser holds the changes in a 'journal' file until you commit the changes like this. This is the final data model. You'll need this diagram for the next task ...


In your notebooks : Print out a copy of this data model and stick it in your notebooks. Don't copy it in - no time :)

OUTCOME : A completed World database, world.db together with ERD and CREATE TABLE schema for each table.


Before you start the next task, it may be useful for you to refer to the SQLite documentation on Tutorials Point or on the SQLite website (although this is not really very 'readable' to be honest, gov).


Task 5.4
 Querying the World database

If you haven't managed to create your world.db database, ask your teacher for a copy. You will also need to refer to the datamodel that you printed out from the last activity. In case you can't or won't do this, I've included another copy of it here for your pleasure (it's a beaut!)

Progress as far as you can through the graded challenges and don't worry if you get stuck. If you can get the EASY and MEDIUM challenges licked, then you are good to go.

Challenge
 
You have to construct suitable suitable SELECT queries to find the solutions to the following questions. For each one, I've given the number of rows which should be returned from the query so it's easy to check whether you are correct or not. You will need to use the complete data model from the last task to help you. Here it is again, just in case ...


EASY (Single table)

For some of these questions, you may need to use the LIKE or the LIMIT operators.
  1. List out the names of all the continents (6 rows).
  2. List out the names of continents in reverse alphabetical order (6 rows).
  3. List out the names of all the capital cities (239 rows).
  4. List out the names of all the capital cities which are also the largest cities in the country (181 rows).
  5. List out the names of all the capital cities which are not the largest cities in the country (58 rows).
  6. State the name and population for Berlin (1 row).
  7. List out the names of all the regions which have the word 'Asia' in their name (6 rows).
  8. List out the name and population for all cities beginning with a letter 'D' and ending with the letter 's' (15 rows).
  9. List out the names of all Languages with a space in their name (10 rows).
  10. List out the names of the top 3 cities and their populations for those cities with populations greater than 8,000,000. This list should be given in decreasing order of population (3 rows).
Can you come up with one of your own?

MEDIUM (Multiple table)

For these examples, you can use the INNER JOIN function or manually join the tables on the PK/FK pairs.
  1. List out the names of all the cities in Pakistan (51 rows).
  2. List out the names of the Regions in Latin America (3 rows).
  3. List out the year visited and city name for all cities that I have visited and given a rating of more than 5 out of 10 in ascending order of year visited (20 rows).
  4. Tell me the name of, and the year I visited, my favourite city in the 'United States' (1 row).
  5. List out the names of all the countries where the official language is Arabic (17 rows).
Can you come up with one of your own?

HARD (Craziness and optional)

If you've made it this far, you've probably learnt enough for the exam so these extra challenges are optional. Because each one may require the use of different functions, I've given you hints on each question. Good luck!
  1. List out the names of the regions where Spanish is spoken (5 rows).
      HINT : Use the DISTINCT keyword.
  2. How many countries did I visit between 1980 and 1990 (inclusive) where 'English' was spoken (1 row).
      HINT : Use the COUNT() function and the DISTINCT keyword.
  3. List out the total populations of Germany, France and Belgium in decreasing order of total population (3 rows).
      HINT : Use the SUM() function, the AS alias function, the IN operator and the GROUP BY clause.
Can you come up with one of your own?

OUTCOME : Correct output for Easy and Medium difficulty SQL SELECT queries for the World Database.



Altering data in the database

At this level, we only need to concern ourselves with inserting, deleting and updating data in the database rather than changing the database structure itself.

Try the following examples and then try to solve the extra problems. To prove that you have successfully manipulated the data, run queries before you execute the queries and then again afterwards.

Task 5.5
 INSERT INTO queries

It turns out that somebody (no idea who) forgot to put Manchester, England into the City table.

INSERT INTO City (CityName,CountryCode,Population,Capital,Largest)
VALUES ("Manchester","GBR",465900,0,0);
  • Insert a new city called 'Machala' in Ecuador (ECU) with a population of 241300. It is neither the largest nor the capital city of Ecuador.
  • Insert another language into the Languages table called 'Klingon' with a language code of 'kli'.
OUTCOME : Successfully executed INSERT INTO queries


Task 5.6
 DELETE FROM queries

St Asaph (England) was included as a city in the original list but it's population is too low. Delete it.

DELETE FROM City
WHERE City.CityName = 'St Asaph'
AND City.CountryCode = 'GBR'
  • What am I thinking of? Klingon isn't really a language, is it? Delete it from the languages table.
  • 2010 was a bad year for me. Delete details of all the cities that I visited in 2010. 
OUTCOME : Successfully executed DELETE FROM queries


Task 5.7
 UPDATE queries

Since the city table was created, the population of London (England) has increased to 8674000.

UPDATE City
SET Population = 8674000
WHERE CityName = 'London'
AND CountryCode = 'GBR'
  • Change the population of Houma in China (CHN) from 122717 to 225123.
  • Correct the spelling mistake in the spelling of 'Swedish' in the language table. It's language code is 'swe'.
OUTCOME : Successfully executed UPDATE queries




Task 5.8
 Final challenges

To be honest, it's unlikely that any data models that you are given will be anywhere near as complicated as this (thankfully). So, to finish up this mammoth topic, we have one more worksheet with some practice SQL style questions (from past examination papers) for you to practice.

Download 'SQL Practice',  complete by clicking and replacing the blue text, print and relax!

OUTCOME
 : Mucho practice on SQL related stuff and a massive sense of satisfaction I shouldn't wonder.



Activity 6 SQLite and Python

SQLite and Python

Activity 7 Managing BIG data

Relational databases were conceived to digitize paper forms and automate well-structured business processes, and still have their uses. But RDBMS cannot model or store data and its relationships without increasing complexity, which means performance degrades with the increasing number and levels of data relationships and data size. Additionally, new types of data and data relationships require schema redesign that increases time to market.


Our world is changing from one where we make computational models of it to one where we process the data it generates. Handling the ENORMOUS quantities of big data generated by the worlds ICT systems presents it's own special challenges, since Big Data is unstructured ...


Task 7.1
 What? Is that meant to be ironic?

What is that little green man talking about? Unscramble the words to produce a definition of 'Unstructured Data'. 

OUTCOME
 : Definition of 'Unstructured Data'.


Big Data can often be qualitative rather than quantitative in nature.

https://drive.google.com/file/d/0B83yXMOilskacThjM1p0NjU2bWc/view?usp=drive_web
Click to enlarge

Analysis of data of this type requires different techniques than that employed to query structured datasets using a traditional RDBMS. Machine learning and Artificial Intelligence systems are used to squeeze patterns and correlation from the data that may not be immediately obvious. One technique called Predictive Analytics is widely used in financial and insurance sectors to predict risk.

Task 7.2
 Predictive Analytics

Visit this article on the Predictive Analytics Today website and read about the process and applications of predictive analytics. Produce a two slide presentation to summarise what you have found out which conforms to the following guidelines ...
  • No more than 25 words per slide
  • One relevant image per slide, animated if possible
  • One ACRONYM per slide

OUTCOME : Two slide presentation on the process and applications of predictive analytics


Fact Based Modelling

When dealing with unstructured data, rather than trying to impose structure on it, it is more efficient to use machine learning or AI to identify fundamental facts about it in order to pinpoint the entities present, their attributes and the relationships which exist between them. So called Fact Based Modelling (FBM) can be used to represent these highly complex, interelated facts in the form of a 'graph database schema' where ...
  • the nodes represent the entities identified in the dataset,
  • a property is relevant data related to the node and
  • an edge shows the link that describes the relationship between two nodes
An interesting example of a dataset represented as a graph is that contained within a social network like Facebook or Twitter where each person, like, interest is a node with associated attributes and each edge is a relationship between entities.

A social network represented as a graph

An arbitrary graph database could have the following visual structure ...

A generic graph database

It should be apparent that, like in normal graphs, relationships can be directed or undirected. Simple relationships like 'is married' would be undirected whereas 'is the son of' would be directed. Common sense really.

Task 7.3
 A graph database of my very own

There are a number of graph database systems out there (Neo4j, GraphDB, OrientDB) each with their own, often very specific, methodologies. I am not suggesting under any circumstances that you even try to implement your own (unless you really want to!) but, thinking about practical applications of graph data structures is very useful and certainly very 'now' (what with Big Data and everything, init?)

In your notebooks : Using the image of a generic graph database shown above, construct your own graph schema for the following simple fact based model ...
  1. John is an 56 year old male who likes all pizzas.
  2. There are two types of pizza, one with a cheese and tomato topping, one with pepperoni.
  3. Leanne works with John and Sandy.
  4. Sandy is friends with John but does not work with him.
  5. Harry is Johns son. He works with Sandy at 'JB Plumbing'.
  6. 'JB Plumbing' is a LTD company.
  7. John is married to Anne but she is not Harrys mother (it's complicated).
  8. Everyone shops at the local Tesco supermarket which is on High Street.
Now that you've designed your Graph schema, look at it carefully and try to predict some other facts from the relationships you have specified.


OUTCOME : A visual graph database schema for a simple (!) unstructured, fact based data model.


What's the big deal?

Now, a quick look at the advantages of graph databases for big data processing over traditional RDBMS solutions.

Task 7.4
 What's the big deal?

One at a time, visit the following websites. Read through the information contained therein.
Tweet what you have learnt to @ebacomputing and see your name in lights!

OUTCOME : Tweet!


Managing Big Data

The shear volume of Big Data means that we cannot use traditional computing hardware to process it because there simply are not enough resources to handle it as quickly as it is being generated (think 3Vs). Also, the complexity of searching the unstructured dataset makes it difficult to accomplish without bespoke tools.


The solution to this problem is to split the data storage and the processing work over several (in some cases, thousands) of individual computers using a system called Distributed Computing


One such platform is called Apache Hadoop. Hadoop consists of ...
  • Hadoop Distributed File System (HDFS) which provides high speed access to distributed data,
  • Hadoop YARN for job scheduling and cluster resource management,
  • Hadoop MapReduce for parallel processing of big data,
  • Pig, a high level programming language for parallel computation and
  • ZooKeeper coordination service for distributed applications.

Task 7.5
 Hadoop

Watch the video about Hadoop and the comparison between Hadoop and SQL. Then watch our old friend Christopher Barnatt over at Explaining Computers teach us, like everything about Big Data.

What is Big Data and Hadoop? (8:04)

Explaining Big Data (8:32)


OUTCOME : 5 important messages, mate.


https://en.wikipedia.org/wiki/MapReduce

The MapReduce programming model (click to visit Wikipedia article)

Big Data is Immutable

Apart from the characteristics of Volume, Velocity and Variety, Big Data is also immutable. Immutable means unchanging : Big Data is write once, read many (WORM) - an enormous collection of unchanging data that it makes absolutely no sense to alter. After all, how can you see patterns in data over time if you delete or update it?

Task 7.6
 Immutability

Read the article from the lesson resources called 'Immutability changes everything' - it explains stuff (far better than I can).

An article from www.devx.com encourages us all to ditch mutable data sets (traditional relational databases) and 'embrace' immutability; thinking about converting our small, mutable data sets into larger immutable ones by continually collecting data and never deleting it. Maybe then we could see patterns emerging which could aid our understanding of the world - wow, that's deep!


In your notebooks : Explain why immutability is a good thing and why we should all be doing it.


OUTCOME : Explanation of why immutability is good.


Functional programming (and it's connection to Big Data)

The map() and reduce() functions which are the cornerstone of Big Data processing come from the functional programming paradigm. One of the most crucial (and relevant) features of functional programming is that it operates on data (to look for patterns for instance) rather than processing it like an imperative language would. Hence, it's ideally suited to Big Data applications.

Imperative languages
How do I get the answer?
Functional / declarative languages 
What is the answer?
The best uses for procedural programs are those where it is okay to change the values of the input data or where you need to compare computed values in one of the steps to determine whether you need to continue processing or exit the program and deliver the result.  Functional programs do not change the input data and are most often used when it is necessary to look at the data again and again for different patterns. For example, you could look through a list of all the counties in the United States that voted Republican in the last election and then go through the list for all Democratic counties. This will produce two distinct output lists.


  • The data that a functional language operates on is immutable
    There are no unwanted side effects of applying a mathematical function to the data (i.e. it is not changed) and therefore, if data is distributed across many nodes in a distributed system, there is no need to worry about concurrent access or synchronisation. The data is what it is.

  • Failure doesn't mean failure
    Since the data is immutable, if a function fails in one place, it can be repeated in another. If the functions mutated the data, the whole computation would have to be rerun if one part of it failed.

  • Map / Reduce is easily parallelised
    Since functional languages are declarative rather than imperative, the order of operations of the functions is less important as the language does not describe how to reach the solution, merely asks what the solution is.

  • MapReduce is functional anyway!
    The map() and reduce() functions are functional in nature. They operate on the data rather than processing it.

Task 7.7
 Why functional?

To be honest, this stuff is tricky enough, so ...

In your notebooks : Produce a colourful mindmap to summarise this section. You are very likely to be asked about the reasons behind the choice of functional programming in an exam - that's the stuff in red :)

OUTCOME : Mindmap


Thank the Lord that's over!

Extension Activities 

How about these?
  • Read about the connection between Edgar F. Codd and Peter Chen in this article

  • Did you know that there are over 60 images on this webpage? Can you find out how the filenames are structured? Is there a relationship between the position in the page and the filename?

  • Further reading about the connection between functional programming and Big Data ...
    - Functional programming and Big Data
    - Functional vs procedural models for Big Data

What's next?

Before you hand your book in for checking, make sure you have completed all the work required and that your book is tidy and organised. Your book will be checked to make sure it is complete and you will be given a spicy grade for effort.

END OF TOPIC ASSESSMENT