Introduction
In this post, I’ll go over formal languages and grammars, then describe an application of context free grammars in natural language processing.
Formal Language
A formal language is a set of strings of symbols. For example, let Σ = {a, b, c} be our symbol set; then any language over this set is L ⊂ Σ* (L is a subset of all possible strings of Σ, which is zero or more repetitions of any element in Σ). {}, {a, bb, ac}, {a, aa, aaa, …} are all examples of languages defined on Σ.
The written English language can be described as a formal language over a symbol set of the English alphabet, numbers, and punctuation. Consider that while all sentences ever written in English constitute a finite set, there are infinite possible sentences that can still be written.
Languages can be large and complicated, or infinite like English, making explicitly writing them out infeasible or impossible. Thus we would like to define some formalism for specifying languages. In the above example, I made use of “…” to express an infinite list of arbitrary length strings of the letter a; this is already a type of specification, though not very formal.
Various formalisms exist for language specification, such as a language being a set of strings
- generated by a formal grammar
- captured by a regular expression
- accepted by a Turing Machine
- solving a constraint satisfaction problem
When comparing formalisms, we should consider
- expressive power (can formalism X describe more languages than formalism Y?)
- computational efficiency (what is the computational complexity of X to produce strings from language L or check if string s is in L?)
- spatial efficiency (how expressive is an n-bit formalism?)
- language equivalence (given X that describes L1 and Y that describes L2, does L1 = L2?)
It so happens that these are often difficult to answer and relate to deeper concepts in computability theory and computational complexity theory.
Context free grammars, a type of formal grammar, are rather expressive and easy to reason about and parse for practical purposes. For these reasons, we will further explore these grammars as a means to specify languages.
Formal Grammar
A grammar g is a set of production (or rewrite) rules for strings of a formal language L(g) (L is a special function that produces the language for an input grammar). A grammar can generate strings of L(g), given a start symbol, or can implement a “recognizer” that determines whether some string s ∈ L(g).
We can define a grammar as a 4-tuple G = {N, Σ, P, S}. Where
- N is a set of non-terminal symbols (symbols that can be expanded via production rules)
- Σ is a set of terminal symbols, disjoint from N (akin to Σ that defines the characters of a language)
- P is a set of production rules of the form (N ∩ Σ)* N (N ∩ Σ)* → (N ∩ Σ)* (The left hand side must contain at least one non-terminal, but aside from this constraint either side can contain any combination of terminals and non-terminals)
- S ∈ N is the start symbol
Example Grammar
Let N = {A, B, C}, Σ = {a, b}, P = {A → aA, A → B, B → bB, B → b, C → aA}, and S = C.
Let’s see what language we can generate by successively applying the production rules until there are no more non-terminal symbols or no rules can be applied. L = {C} = {aA} = {aaA, aB} = {aaaA, aaB, abB, ab} = …
You may find this is the language {anbm | n, m ≥ 1} (n occurrences of a followed by m of b).
Chomsky Hierarchy
In 19591, Noam Chomsky described a hierarchy for formal grammars. Each lower type is more expressive than the types above it.
-
Type 3: regular grammars. These grammars are accepted by finite state machines, and have production rules of the form A → a or A → aB where A, B ∈ N and a ∈ Σ. Note that the above example is of this type. Type 3 grammars are strictly “right-expanding” and can capture any regular expression (because they constitute regular languages).
-
Type 2: context free grammars (CFGs). Rules are of the form A → α, where α ∈ (N ∩ Σ)+ (where + indicates one or more), meaning a non-terminal can expand in any direction (i.e. A can produce aA – right; aAa – middle; Aa – left; or any combination of these). The “context-free” indicates that all production rules can be applied to a single non-terminal, no matter the symbols that surround it. This property makes CFGs computationally easier to handle than less constrained grammars. Think of a CFG that can generate the language {anbn | n ≥ 1}.
-
Type 1: context-sensitive grammars (CSGs). The production rules are equivalent to the definition of P for grammars, with the one constraint that the right hand side is not empty. Try thinking of a CSG that can generate {anbncn | n ≥ 1}.
-
Type 0: unrestricted. These are formal grammars exactly as specified by the definition we initially introduced. They can produce recursively enumerable languages and thus can generate any string accepted by a Turing Machine.
Recall the formalisms we introduced for language specification. We now know that formal grammars are much more general than regular expressions, and that any language produced by an unrestricted grammar can be enumerated by a Turing Machine. Constraint satisfaction problems we have not discussed much (see [2] for more information), but they are in fact similar in computational complexity to unrestricted grammars.
There are many interesting and deep connections between grammars, languages, and computation machines, a taste of which you can see in this table.
Modeling English
In order to model written English with a formal language we first make the assumption we are dealing with the standard written form, the kind Oxford English Dictionary would accept.
As already mentioned, we can model standard English as an infinite language. It then follows to wonder: might we be able to construct a grammar that can generate the language? We should consider what type of formal language English is to determine which type of grammar is sufficiently expressive.
English is not a regular language3. Intuitively, this is because of non-right-branching features such as center embedding; “The cat ran”, “The cat the dog chased ran”, “The cat the dog the rat bit chased ran”, etc. This construction is of the form {(The N)n Vtn-1 ran | N is a noun, Vt is a transitive verb, n ≥ 1}, which is similar to the context-free (but not regular) language {anbn | n ≥ 1}.
There is much work on determining the complexity of natural languages, whether certain ones are context-free (i.e. Dutch and Swiss-German have been proven to be non-context-free due to cross-serial dependencies), but we can quickly see that CFGs are expressive enough for regular English grammar rules.
Our Approximation of English
Armed with the theoretical knowledge of languages and grammars, we can take a shot at constructing a CFG for generating English (well, a little version of it).
Below is an interactive Python script that defines such a CFG.
- N = {S, NP, VP, PN, N, ADJ, DET, V_INTRANS, V_TRANS} corresponding to {sentence, noun phrase, verb phrase, proper noun, noun, adjective, determiner, intransitive verb, transitive verb}.
- Σ = {Sasha, Nikolai, Olga, dog, goat, tree, slim, green, the, a, every, runs, barks, grows, feeds, walks, bites}
- P is given in the code. Note that one rule A → a | b is short-hand notation for two separate rules A → a and A → b.
- S = S; however, feel free to change the START variable in the code.
Enjoy!
Some Remarks
The above example is a random generative grammar that weighs each right hand side production rule equally. We have implicitly introduced a probabilistic context free grammar.
Linguistics assumes compositionality of language. The principle of compositionality says that the meaning of any expression is determined by the structure and meaning of its constituents. Consider this claim: if some language is not compositional, one cannot construct a CFG to model it.
Bibliography
2. Russell and Norvig (2010). *Artificial Intelligence: A Modern Approach" Chapter 5
3. Chomsky, Noam (1957). "Syntactic Structures"