History of compiler construction facts for kids

Kids Encyclopedia Facts

A compiler is a special computer program. It takes code written by a programmer (called the source code) and changes it into another computer language. This new language is usually machine code, which computers understand directly. The main reason for this change is to make a program that the computer can run.

Imagine you write a story in English, but your friend only understands Spanish. A compiler is like a super-fast translator that turns your English story into a Spanish one, so your friend can read it!

Any program written in a "high-level" language (like Python or Java) needs to be translated into machine code before a computer can use it. So, all programmers use compilers or similar tools. Better compilers can make programs run faster or work better.

In the late 1970s, a project called PQCC helped create the way compilers are built even today. They showed that compilers often have two main parts: one part deals with the code's structure and meaning, and another part creates the final machine code.

The First Compilers

Early computers were programmed using very basic languages, or even directly with numbers (machine code). It's much easier for programmers to use "high-level" languages, which are more like human languages. Programs written in these languages can also be used on different types of computers.

However, it took a while for compilers to become popular. The code they made wasn't always as good as code written by hand. Also, making a compiler was a huge project, and early computers didn't have much memory, which made it hard.

Between 1942 and 1945, Konrad Zuse created the first high-level language for a computer, called Plankalkül. He imagined a device that would automatically translate programs into machine code. But the first actual compiler for his language was built much later.

In 1951, Corrado Böhm wrote the first practical compiler for his PhD. This was one of the first computer science doctorates ever given!

The first compiler to be actually used was by Grace Hopper. She even came up with the word "compiler." Her A-0 system helped load and link programs, which is a bit different from what compilers do today.

The first compiler in the modern sense was made by Alick Glennie in 1952 for the Mark 1 computer in Manchester.

In 1957, the FORTRAN team at IBM, led by John W. Backus, released the first compiler available for businesses. It took 18 people a whole year to create!

By 1960, a FORTRAN compiler called ALTAC was available for another computer, the Philco 2000. This meant FORTRAN programs could be compiled for different computer designs. The first high-level language that worked on many different computers was COBOL. In December 1960, a COBOL program was compiled and run on two different computers: the UNIVAC II and the RCA 501.

Self-Hosting Compilers

Just like any other software, compilers can be written in a high-level language. A cool thing is that a compiler can be "self-hosted." This means it's written in the very same programming language it compiles!

Building a self-hosting compiler is a bit like a puzzle. The very first version of such a compiler has to be written in a different way. Maybe it's written in basic machine code, or compiled by a compiler for another language. But once it works, it can then compile itself!

Corrado Böhm's Idea

In his 1951 PhD paper, Corrado Böhm described a language, a machine, and a way to compile that language on the machine. He not only explained a full compiler but also defined it using its own language. This was a very clever idea!

NELIAC

NELIAC was a version of the ALGOL 58 programming language and its compiler. It was developed in 1958 by the Naval Electronics Laboratory.

NELIAC was the idea of Harry Huskey, a famous computer scientist. The first version was built for a computer called the Countess. It was the world's first self-compiling compiler. First, a simple version was written in a basic language. Then, the compiler was rewritten in its own language and compiled by that simple version. Finally, it compiled itself, making the simple version no longer needed.

Lisp

Another early self-hosting compiler was made for the Lisp in 1962 by Tim Hart and Mike Levin at MIT. They wrote a Lisp compiler using Lisp itself. They tested it using an existing Lisp "interpreter" (a program that runs code directly without compiling it). Once the compiler was good enough to compile its own code, it became self-hosting.

This method works when there's already an interpreter for the language you want to compile. It's like a program working on itself as input.

Forth

Forth is another example of a self-hosting compiler. Forth is an "extensible" language, meaning you can easily add new features to it. This extensibility allows Forth to create new versions of itself or move to new computer systems.

How Compilers Understand Code

A very important part of a compiler is the "parser." The parser reads the source code and turns it into an internal format that the computer can understand.

Programming languages are often described using "context-free grammars." This is a way to precisely define how different parts of a program fit together. Fast and efficient parsers can be built for these grammars. Parsers can be written by hand or created by special programs called "parser generators."

The idea of context-free grammars was developed in the mid-1950s by Noam Chomsky. The ALGOL project (1957–1960) introduced "block structure" into programming languages and used a context-free grammar to describe its syntax.

If a language designer sticks to certain rules for context-free grammars, even more efficient parsers can be made.

LR Parsing

The LR parser was invented by Donald Knuth in 1965. An LR parser reads the code from left to right. It figures out how the code was built by looking at the "rightmost derivation" (a way to trace how the code was put together based on grammar rules). The "k" in LR(k) means how many symbols the parser looks ahead to make decisions.

Knuth showed that LR grammars can be parsed very quickly. He also proved that any LR grammar can be changed into an LR(1) grammar, meaning you only need to look ahead one symbol to parse it.

In 1969, Frank DeRemer came up with more practical LR techniques called Simple LR (SLR) and Look-ahead LR (LALR). This was a big step forward because Knuth's original LR parsers were too big for computers back then.

LALR parsers are often used in compilers today to check the syntax of source code because they work well for many programming languages.

LL Parsing

An LL parser also reads the input from left to right. But it builds a "leftmost derivation" of the sentence. LL grammars are a bit more limited than LR grammars, but they are still very useful for compiler writers because they are simple and efficient to build.

LL(k) grammars can be parsed by a "recursive descent parser." These are usually coded by hand.

The design of the ALGOL language led to the idea of recursive descent parsing. This concept was discussed in 1961 by A.A. Grau and Edgar T. Irons.

Niklaus Wirth made recursive descent popular with PL/0, a simple language used to teach how to build compilers in the 1970s.

LR parsing can handle more types of languages than LL parsing. It's also often better at finding errors in the code quickly.

Earley Parser

In 1970, Jay Earley invented the Earley parser. These parsers are good because they can parse all context-free languages quite efficiently.

Languages for Describing Grammars

John Backus suggested "metalinguistic formulas" to describe the rules for a new programming language called IAL (now known as ALGOL 58) in 1959. His work was based on ideas from Emil Post.

Later, for ALGOL 60, Peter Naur named Backus's notation "Backus normal form" (BNF) and simplified it. However, Donald Knuth argued it should be called Backus–Naur form, which is what we use today.

Niklaus Wirth later defined extended Backus–Naur form (EBNF), a better version of BNF, in the early 1970s. Another version is Augmented Backus–Naur form (ABNF). Both EBNF and ABNF are widely used to describe the grammar of programming languages and for other things like communication rules.

Parser Generators

A parser generator is a program that helps build parts of a compiler. You give it a description of a programming language's grammar, and it creates a parser for that language. This parser can then be used in a compiler. The parser finds and identifies keywords and symbols in the code, turning them into "tokens" that the rest of the compiler can use.

The first program called a "compiler-compiler" was written by Tony Brooker in 1960. It was used to create compilers for the Atlas computer. However, it was different from modern parser generators.

In the early 1960s, Robert McClure invented a compiler-compiler called TMG. It was later moved to different computers.

The Multics project, which aimed to create an operating system in a high-level language, used TMG to develop their own version of the PL/I language called EPL in 1964.

Ken Thompson used TMG to write the compiler for the B language in 1970. B was the language that came right before C.

XPL

XPL is a version of the PL/I programming language. It was designed in 1967 by a team at Stanford University to help build compilers.

XPL included a simple "translator writing system" called ANALYZER. It was first run on the IBM System/360 computer.

Yacc

Yacc is a very famous parser generator. It was developed by Stephen C. Johnson at AT&T for the Unix operating system. Yacc stands for "Yet Another Compiler Compiler." It creates a compiler based on grammar rules similar to Backus–Naur form.

Johnson worked on Yacc in the early 1970s. Because Yacc was the standard compiler generator on most Unix systems, it was used by many people. Versions of Yacc, like GNU Bison, are still used today.

The compiler made by Yacc needs a "lexical analyzer" first. Lexical analyzer generators, like lex or flex, are often used with Yacc.

Coco/R

Coco/R is a parser generator that creates LL(1) parsers. It was developed by Hanspeter Mössenböck in 1985.

ANTLR

ANTLR is another parser generator that creates LL(*) parsers. It was developed by Terence Parr in the early 1990s.

Metacompilers

Metacompilers are different from parser generators. They take a program written in a "metalanguage" as input. This metalanguage combines grammar rules with operations that build abstract syntax trees or simply create reformatted text.

Many metacompilers can be programmed in their own metalanguage, allowing them to compile themselves.

Dewey Val Schorre's META II compiler, released in 1964, was the first documented metacompiler. It could define its own language and others. META II also translated code into instructions for one of the earliest "virtual machines."

TREE-META, a second-generation metacompiler, appeared around 1968. It improved on META II by separating code production from grammar analysis.

CWIC, described in 1970, was a third-generation metacompiler. It added more features for analyzing grammar and generating optimized code.

Cross Compilation

A cross compiler runs on one type of computer but creates code for a different type of computer. Cross compilers are often used for "embedded development," where the target computer (like a chip in a smart device) has very limited power.

An early example was AIMICO. A program on a UNIVAC II computer was used to create assembly language for an IBM 705 computer.

The ALGOL 68C compiler created "ZCODE" output. This ZCODE could then either be compiled into the local machine code or run directly by an interpreter. This made it easier to use ALGOL 68C on many different computer platforms.

Optimizing Compilers

Compiler optimization is about making the final machine code better without changing what the program does. This often means making it run faster or use less memory.

The people who made the first FORTRAN compiler wanted to create code that was even better than code written by hand. They often succeeded!

Later compilers, like IBM's Fortran IV, focused more on giving good error messages and compiling quickly, even if the final code wasn't perfectly optimized. It wasn't until the IBM System/360 series that IBM offered two different compilers: a fast one for checking code and a slower one for optimizing it.

Frances E. Allen introduced many important ideas for optimization. Her work in the 1960s and 70s showed how to use "graph data structures" to understand programs for optimization. She also developed ways to analyze how data flows through a program and created a list of optimizing changes. Her work helped create the modern optimizers we use today.

John Cocke and Jacob T. Schwartz's book in 1970 explained many optimization techniques that are still common, like removing unnecessary code and making calculations simpler.

In 1972, Gary A. Kildall introduced the theory of "data-flow analysis," which is used in optimizing compilers today.

Peephole Optimization

Peephole optimization is a simple but effective way to make code better. It was invented by William M. McKeeman in 1965. It works by looking at a small "peephole" of code (just a few instructions) and replacing them with more efficient ones.

Capex COBOL Optimizer

In the mid-1970s, Capex Corporation developed the "COBOL Optimizer." This optimizer knew about "weaknesses" in the standard IBM COBOL compiler. It would replace parts of the machine code with more efficient code. For example, it might change a slow way of looking up information into a faster one.

Today, most compilers let programmers choose whether to use optimization or not.

Diagnostics and Error Messages

When a compiler finds a mistake in a program, a clear error message is very helpful. But it's often hard for the compiler writer to make good error messages.

The WATFIV Fortran compiler, developed in the late 1960s, was designed to give better error messages than IBM's compilers at the time. WATFIV was also easier to use because it combined compiling, linking, and running into one step.

PL/C

PL/C was a programming language developed at Cornell University in the early 1970s. It was made specifically for teaching programming. The designers, Richard W. Conway and Thomas R. Wilcox, wanted it to be easy to use for students.

PL/C removed some complex features of PL/I and added many tools for finding and fixing errors. The PL/C compiler was special because it would never fail to compile a program. It would automatically fix many syntax errors, and any remaining errors would be turned into messages that the program would print out.

Just-in-Time Compilation

Just-in-time compilation (JIT) means creating executable code "on the fly," or right before it's needed. This allows the compiler to make smart decisions based on how the program is actually running, which can make it faster.

Intermediate Representation

Most modern compilers have a "lexer" and a "parser" that create an "intermediate representation" of the program. This intermediate representation is a simplified version of the code. It can then be used by an "optimizer" (to make the code better) and a "code generator" (to create instructions for the target computer). Because the code generator uses this intermediate representation, the same code generator can be used for many different high-level languages.

One common form of intermediate representation is "three-address code." This means each operation has an operator, two inputs, and a result.

"Static Single Assignment" (SSA) was developed by IBM researchers in the 1980s. In SSA, a variable is given a value only once. If a variable needs to change, a new variable is created instead. SSA makes it easier to optimize code and generate the final machine instructions.

Code Generation

A code generator creates the final machine language instructions for the computer's processor.

Register Allocation

The Sethi–Ullman algorithm is a method to figure out the smallest number of "registers" (small storage areas in the processor) needed to hold variables.

Famous Compilers

Amsterdam Compiler Kit by Andrew Tanenbaum
Berkeley Pascal, written by Ken Thompson in 1975.
GNU Compiler Collection (GCC): Originally by Richard Stallman in 1987, GCC is a very important modern compiler used for many free software projects, like Linux.
LLVM: Stands for "Low Level Virtual Machine."
Small-C by Ron Cain and James E Hendrix
Turbo Pascal: Created by Anders Hejlsberg, first released in 1983.
WATFOR: Created at the University of Waterloo. One of the first popular compilers for teaching, though not used much today.