CS37 : John and Peter

Defining formal languages is a tricky business. We need another language to do it; a metalanguage. Crazy.

We are learning ...
  • About Backus-Naur Form
So that we can ...
  • Check language syntax be referring to BNF productions or syntax diagrams
  • Formulate simple production rules
  • Explain why BNF can represent some languages that cannot be representing with regular expressions

Backus-naur form (BNF) is another metasyntax / metalanguage used to express 'context-free' grammars; a formal way to describe formal languages where each element in the language is not constricted by it's context / surroundings. Context free grammars are easier to parse than context sensitive grammars.

Two computer scientists in the late 1950's developed BNF to help define the structure of the programming language Algol↗.

Backus-Naur Form is used because defining the syntax of a formal language using a regular expression is very tedious for languages with large alphabets. It would be virtually impossible to use RE to define the Python programming language for instance.

This is an example of a Backus-Naur form grammar.

Click to view

Considering the above example, a BNF grammar consists of 4 parts ...
  • A set of literals, terminals or tokens
    Terminals are the smallest units of syntax in a grammar. The alphabet. There are atomic and cannot be defined in terms of anything else;

  • Productions
    A statement written in the BNF metalanguage;

  • A set of meta-variables or non-terminals
    A non-terminal symbol is not part of the alphabet and is further defined in another production. Non-terminals take the place of larger pieces of syntax from a grammar; 

  • A start symbol
    There is a special non-terminal called the start symbol which represents the highest level of the grammar; the top of the tree as it were.

Activity 1 Metasymbols (30)

Non-terminals are defined using production rules. The components of a production rule are ...
  • <...> allows definition of metavariables (or non-terminal symbols) which may be further defined in terms of other meta-variables found elsewhere in the BNF and terminal symbols (components of the language alphabet which require no further definition);

  • ::= means 'is defined as';

    <LHS> ::= RHS means 'LHS is defined as RHS'

  • | (pipe character) represents alternation (OR);

    <LHS> ::= a|b means 'LHS is defined as a OR b'

    This is the same as two separate production rules ...

    <LHS> ::= a
    <LHS> ::= b

  • A meta-symbol can be defined in terms of itself, recursively;

    <LHS> ::= a | a <LHS> means 'LHS is defined as a OR a followed by LHS'

The left hand side of a BNF production is always a single non-terminal whilst the right hand side can consist of any combination of terminals and non-terminals (an expression).

<meta-variable> ::= expression

For example, a full definition of signed and unsigned numbers in BNF uses the following productions ...

<signed_real> ::= -<unsigned_real>|<unsigned_real>
<unsigned_real> ::= <integer>.<integer>
<integer> ::= <digit>|<digit><integer>
<digit> ::= 0|1|2|3|4|5|6|7|8|9

Task 1.1
Describing BNF in English

Can you 'describe' the above BNF productions in a natural language of your choice (probably English)?

OUTCOME : Write your 'descriptions' in your notes along with a copy of the BNF rule.

Activity 2 Parse trees (145)

The grammar is a set of rules which explain how to build a 'parse tree'. A parse tree can be used to generate valid sentences in the language or to check whether sentences are valid in the language. When writing parse trees, we always work from the start symbol down to the terminals. The resulting string is read from left to right.

Example : Writing a valid English sentence.

Task 2.1
 Constructing valid numbers

Can you use a parse tree to construct a range of valid, <signed_real> numbers from the BNF grammar you looked at in the last task?

OUTCOME : Draw the parse trees out in your notes along with the valid numbers you have generated.

Example : Checking the syntax of a mathematical expression - syntactical analysis.

The following BNF rule defines a simple mathematical expression.

<exp> ::= <exp>+<exp> | <exp>*<exp> | (<exp>) | a | b | c

This is equivalent to 6 separate productions ...

<exp> ::= <exp>+<exp>
<exp> ::= <exp>*<exp>
<exp> ::= (<exp>)
<exp> ::= a
<exp> ::= b
<exp> ::= c

If we check the validity of a string, we work from the outside of the string to the inside, choosing the most suitable production each time working down from the start string. If we cannot choose a suitable production or a terminal, the string is not valid.

Task 2.2
 Parse trees

Show parse trees for each of these strings and determine whether the given string is valid or not.
  1. a+b 
  2. a*b+c 
  3. (a+b) 
  4. (a-b) 
  5. (a+(b))
OUTCOME : Draw the parse trees in your notes, state whether the string is valid and if it is not, explain why.

An alternative is to use bottom-up or shift-reduce parsing to examine the string to see if it is indeed part of the language we have defined with the BNF. This is the reverse of the technique of drawing the parse tree we have just seen. We start with the sentence and work up to the start symbol.

Earlier on in this section, we met Pencil hardness. Here is a set of BNF rules which define the structure of the Pencil hardness grammar. I've numbered the productions so I can refer to them in the example which follows.
  1. <lead_hardness> ::= <simple_hardness>|<scale_of_hardness> 
  2. <scale_of_hardness> ::= <numeric_value><simple_hardness> 
  3. <simple_hardness> ::= H|B|F|HB 
  4. <numeric_value> ::= 1|2|3|4|5|6|7|8|9
Let's try to prove that 2H is a valid lead hardness using the shift-reduce technique.

Click to view

Here is another example using the simple mathematical expression grammar.

<exp> ::= <exp>+<exp> | <exp>*<exp> | (<exp>) | a | b | c

We'll prove that ((a+b)*c) is a valid mathematical expression using bottom up parsing.

Click to view

Now time for some more questions to see if you 'get it'!

Task 2.3
 BNF Challenges

Use parse trees or bottom-up parsing methods to answer the following set of questions.
  1. Backus Naur Form (BNF) is used by compiler writers to express the syntax of a programming language. The syntax for a part of one such language is written in BNF as follows:

    <expression> ::= <integer> | <expression> <operator> <expression>
    <integer> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    <operator> ::= + | - | * | /

    Do the following expressions conform to this grammar?

    a) 4*9
    b) 8+6/2
    c) -6*2
    d) (4+5)*5

  2. Identifiers in a particular programming language are valid if they are constructed based on the following BNF productions : (NEED TO CHECK PRODUCTION RULES)

    <letter> ::= A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
    <digit> ::= 1|2|3|4|5|6|7|8|9
    <character> ::= <letter>|<digit>|_
    <string> ::= <character>|<character><string>
    <identifier> ::= <letter>|<letter><string>

    Are the following identifiers valid?

    a) Z__
    b) 3AXY
    c) D_E
    d) D12
    e) PETER's
    f) _AbC
    g) ABC
OUTCOME : Draw the parse trees and bottom-up parsing tables in your notes, state whether the string is valid and if it is not, explain why.

Activity 3 Extended Backus-Naur Form (35)

For complex grammars, BNF is too restrictive. Extended BNF, or EBNF, adds regular expression style notation ...
  • Literals are enclosed by double quotes for example "word"; 
  • Optional items shall be enclosed in square brackets indicating that the item appears either zero or one times. 
  • Repetition is introduced using curly brace meta-symbols '{' and '}'. 
  • Curly braces followed by a * character indicate zero or more sequential instances of the item 
  • Curly braces followed by a + character indicate one or more sequential instances (i.e. not optional). 
  • Double period (..) shall appear within a literal definition as a shortcut denoting a set of ASCII characters between the letters on either side of them. For example "A..Z" representing the set of uppercase letters. 
For instance, the EBNF definitions of the signed and unsigned numbers uses the following productions ...

signed_real = ("-" | "+") unsigned_real .
unsigned_real = integer "." integer .
integer = digit {digit} .
digit = "0..9" .

Note that EBNF is no more powerful than BNF, just more convenient. Any EBNF production can be written in BNF, it just might look a lot more complicated.

Task 3.1
 Visit the BNF club
Web Browser

Visit the BNF club and spend a little time investigating the BNF rules for SQL. Do you see any you recognise from your work on databases? If not, it's probably because you haven't done about them yet!

OUTCOME : No specific outcome required

Activity 4 Syntax diagrams (100)

In the last task (the BNF club), you saw some diagrams that looked like railway sidings. These are called railroad diagrams or, more correctly, syntax diagrams. Syntax diagrams are an alternative way of defining the syntax rules of a language. They can even be used to describe a subset of the English language.

The following EBNF production ...

Syntax = S (Y (N (T (A (X | ctic | gma) | he (sis | tic)) | o (nym | (p (sis | tic)))) | stem [[at] ic]) | em (an | iom) tic) .

... could be used to generate this syntax diagram.

This is an example of a programming statement which is defined both as a BNF grammar and a syntax diagram. Look carefully at the popup.

Click to view

Modern programming languages still use syntax diagrams to define their structure. For instance, the following syntax diagram is from the JSON (JavaScript Object Notation) website ...

Click to visit http://www.json.org/

If you can get from the left hand side to the right hand side whilst parsing the number (in this case) and not leaving the lines, the the number is valid as shown in the following beautiful giffy example ...

How to use a syntax diagram to check the validity of a string

Task 4.1
 Advantages of syntax diagrams
  1. Using this number syntax diagram from the Json website, can you use it to define a set of 5 valid numbers?

  2. Use this number syntax diagram to determine whether the following numbers are valid in that language:

    a) +12.33E14
    b) -0.99
    c) 12.244E-12
    d) -1233.34e+45
    e) 0

  3. How much easier is it to use a Syntax diagram like this than the BNF grammar you used in the Task called 'Constructing valid numbers'?
OUTCOME : Make a note of the syntax diagram in your notebook and use it to write down at least 3 examples of valid numbers.  You could use a different coloured pen to 'trace' your way through the railroad diagram.  Include a statement justifying the advantage of syntax diagrams over BNF grammars.

We can't get away without setting some challenging questions can we?

Task 4.2
 Syntax diagram challenges
  1. A user ID for logging onto a computer system consists of ...

    ... two digits followed by 
    ... one of the abbreviations H, L, D, R, Pa, Ph, followed by 
    ... a surname consisting of one or more upper (A..Z) or lower case (a..z) letters followed by 
    ... a single initial chosen from the set of upper case letters, A..Z.  

    Draw a syntax diagram for the user ID's in this system.

  2. An identifier in a programming language consists of a letter followed by zero or more letters or digits.  Draw a syntax diagram for valid identifiers in this language.

  3. Draw a syntax diagram which defines the Roman Numerals from 1 to 10 ...

    I, II, III, IV, V, VI, VII, VIII, IX, X

    Your diagram should contain only 3 terminal symbols and should not consist simply of a series of alternatives. HINT : Start with I and build the alternatives up from there.
OUTCOME : Draw your syntax diagrams carefully in your notes and get your teacher to check them over. Ask for hints if you need to :)

Activity 5 EBNF Visualiser (30)

There is a free piece of software called 'EBNF Visualiser' which is available to download from the Software directory for this section as well as a number of '.ebnf' files from the resources for this section which define various grammars.

The user interface of EBNF Visualiser (Windows 7+ version)

Task 5.1
 EBNF visualiser - a challenge!
EBNF visualiser
Various *.ebnf files (grammars)

Can you get EBNF visualiser to work?  Can you find the case_statement syntax diagram from the screenshot? You can inspect the *.ebnf files using notepad to see what the flavour of BNF looks like.

OUTCOME : Nothing apart from the satisfaction of getting it to work!

Extension Activities 

Wasn't this challenging enough? Have a night off!

Railroad diagram generator : http://www.bottlecaps.de/rr/ui

What's next?

Before you hand your book in for checking, make sure you have completed all the work required and that your book is tidy and organised. Your book will be checked to make sure it is complete and you will be given a spicy grade for effort.