Sungwon Chung

Sungwon Chung@sungpi

Showing all posts tagged "Compiler"

Top-down Parsing

Definition

Parser begins with the Sentence symbol S, expands a production for S, and so on recursively until words are reached. (Terminals) If the string of words matches the input, a parsing has been found.

Why Backup May Be Required

This approach to parsing might seem hopelessly inefficient. However, top-down filtering, that is testing whether the next word in the input string could begin the phrase about to be tried, can prune many failing paths early.

For What Kinds of Languages Does It Work Well

Languages with keywords, such as programming languages or natural language application. Easy to program.

Recursive Descent Parser

A parse for some context-free languages can be written using the technique of recursive descent. The basic idea is to write a procedure to parse each kind of statement or expression in the grammar. When such procedures are written in a recursive language, they can call each other as needed.
It works well for a well-structured language such as Pascal. In Pascal each statement other than assignment statements begin with a unique reserved word; thus, it is easy for a “big switch" program to determine which statement parser to call.

Problems with Left Recursion

It may be necessary to restructure a grammar to avoid left recursion, which can cause a recursive descent parser to go into a loop.

Left Factoring

A method of modifying a grammar to eliminate left recursion

Implementation of Top-down Parser by A Recursive Program

be able to write a procedure for a construct such as an IF statement

Avoiding Ambiguity while keeping the syntax

Pascal and C follows this convention.

  • If parser is SLR, LR(1), or LALR LR parser generator, the programmer will often rely on the generated parser feature of preferring shift over reduce whenever there is a conflict.
  • If the parser is produced by a pruning and deep pruning LR generator, one can issue directives that prune away the ambiguities completely.
  • If the parser is hand-written, the programmer may use a non-ambiguous context-free grammar. Alternatively, one may rely on a non-context-free grammar or a parsing expression grammar.
Avoiding Ambiguity By Changing the Syntax
  • Have a “end if" statement.
  • Disallow following then to be if.
  • Require a bracelet when else follows if
  • Require every if to be paired with an else

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Operator Precedence Grammar and Parser

Definitions

Expressions could be written in an unambiguous, fully parenthesized form. However, this is less convenient for the programmer.

Precedence numbers

Precedence specifies which operations in a flat expression are to be performed first.
Usually,

  • .
  • ^
  • unary -
  • * /
    • -
  • = <> >= > <= <
  • not
  • and
  • or
  • :=
Associativity

Associativity specifies which operations are to be performed first when adjacent operators have the same precedence.

How operator precedence parser works(give example)

Shift & Reduce

  • Operands
  • Operator
  • (
  • )
  • End

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Derivation From a Context-Free Grammar

Derivation is the process of deriving a sentence from the start symbol S according to a grammar.

Relationship to Parse Tree

Notations for a Derivation

Symbol => is used to denote a derivation step.
=>* is used to denote derivation in zero or more steps.
=>+ is used to denote derivation in one or more steps.

Definition of a language in Terms of Derivations

Leftmost derivation : the leftmost nonterminal is replaced in each step. If S=>*(lm) a then a is called a left-sentential form of the grammar G. A leftmost derivation is traced by a predictive, top-down parser.
Rightmost derivation canonical derivation : the right most nonterminal is replaced in each step. A shift-reduce parser traces a rightmost derivation “backwards"

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Context-free Languages

Grammar Definition

A - a
A $\in$ N
a $\in$ V*

Since left-hand-side of each production is a single nonterminal, every derivation is a tree.

Memory Needed for Parsing.

Recursive program = a stack for temporary storage

Notations for defining the syntax of a language :

BNF
Pascal-style transition diagrams.

Ambiguous grammars VS unambiguous grammars

Ambiguous : grammar for which some sentence has more than one parse tree is ambiguous. There’s two way to deal with this. Re-write the grammar and User disambiguating rules to guide the parser.

Augmented transition network;Yacc

  • Arbitrary tests : can be added to the arcs.
  • Structure-building actions : can be added to the arcs. These actions may save information in registers to be used later by the parser, or to build the representation of the meaning of the sentence. Transformations, can also be handled.
  • Phrase names : as well as part-of-speech names, may appear on arcs. This allows a grammar to be called as a subroutine.

What kinds of language constructs are context-free?

Arithmetic expressions, noun pharses.
Any language that requires counting must be at least context free.

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Floating Point

[http://steve.hollasch.net/cgindex/coding/ieeefloat.html]

Lexical Analyzer

What constructs are parsed

Type of grammar used

Use regular expressions.

Need for speed?

NFA to DFA, and converted to tables to control a table-driven parser.

lex

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Regular Expression

Definition

Regular expresssions are a more convenient way(than a regular grammar) to specify a regular language.

Construct a Regular Expression for a Language

in lex rule.
[chars] : Any number of the set of characters chars
[c1-c2] : Any character from c1 through c2
[^chars] : not chars
{category} : An instance of a previously named category
“string" : Exactly the specified string.
s1 | s2 : s1 or s2
a : zero or more a
a+ : one or more a
a? : zero or one a
* a{2,5} : 2 to 5 a

Construct the Language Denoted by a Regular Expression

Follow the rule.

Relation Between Regular Expression and Finite Automata

Always can change into DFA.

Automatic Construction of Finite Automata from Regular Expressions

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Regular Languages 정규 언어

Grammar?

Formal Definition of a Grammar

Grammar specifies the legal syntax of a language. A grammar specifies a set of productions; non-terminal symbols. A grammar describes the structure of the sentences of a language in terms of components or phrases. The mathematical description of phrase structure grammars is due to Chomsky.
문법은 언어의 syntax(the rules by which legitimate statements can be constructed)를 나타내는 규칙이다. 문법은 production으로 이루어져 있다.

Formally, a Grammar is a four-tuple G = (T, N, S, P) where:

  • T is the set of terminal symbols or words of the language.
  • N is the set of nonterminal symbols or phrase names that are used in specifying the grammar. We say V = T $\cup$ N is the vocabulary of the grammar.
  • S is a distinguished element of N called the start symbol.
  • P is a set of productions, P $\subseteq V^NV^ \texttimes V^*$. We write productions in the form a $\textrightarrow$ b where a is a string of symbols from V containing at lest one nonterminal and b is any string of symbols from V.
  • V = Vocabulrary
    문법은 언어의 구조가 올바른지를 판단하는 거다.

Chart about Grammar Classes.

  • General Phrase-structure
  • Context Sensitive
  • Context Free pushdown stack O($n^3$)
  • Regular finite storage O(n)
    We’ll focus on Context-Free and Regular Language.

Regular Languages

Grammar definitions

A - xB.
A - x.
A, B $\in$ N
x $\in T^*$
Only one nonterminal can appear in any derived string, and it must appear at the rightmost.

Uses

In compilers, we use to parse the stream of characters into tokens. Cause parser never has to back up or do search. O(n); Linear parsing time. Indentifiers, numbers, word forms.

Local Ambiguity

Examples

A - xB
B - x

Way to Handle

Make test cases to pre-process it.

Relationship of finite automata and regular grammars

Equivalent to a deterministic finite automaton

Linear nature of a derivation from a regular grammar

Since only one nonterminal can appear in any derived string, and the rightmost end, parse tree has only one step to evaluate it, produces right-biased tree.

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Major Phases of a Compiler 컴파일러의 주요 단계들

What the phases do 각 단계는 무엇을 하나?

Lexical Analyzer

or scanning
Reads the stream of characters and group the characters into meaningful sequences called lexemes.
글자들을 읽어서 글자들에 의미를 부여하는 단계; 문장에서 단어를 잘라준다고 생각하면 편하다.
ex a = 1 + 2 는 a,=,1,+,2 5개의 token이 된다.

Syntax Analyzer

or parsing
The parser uses the first components of the tokens produced by the lexical analyzer to produce a tree-like intermediate representation that describes the grammatical structure of the token stream.
Parser는 lexer가 만들어준 토큰을 가지고, 문장에 의미를 부여하기 시작한다. (컴퓨터 과학에서는 tree로 token을 엮어낸다.)

Semantic Analyzer

Semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition.
Parser가 만들어준 트리로, 이 것이 과연 문법적으로 맞는 놈인지 문법을 검사한다.

Intermediate Code Generator
Machine-independent Code optimizer
Code Generator
Machine-dependent Code Optimizer

Data flow between phases 각 단계별 데이터의 흐름

  • Character Stream : Before lexical analysis ; lexer는 character들을 받아서
  • Token : Before syntax analysis ; token을 생성하고
  • Syntax Tree : before intermediate code generation ; parser는 syntax tree를 만들며
  • Intermediate representation : before code generation ; 코드 제네레이터는 중간 코드를 생성하고
  • Target Machine Code : after code generation ; 코드 제네레이터는 컴퓨터가 읽을수 있는 형태로 뽑아준다.

Citation:

  • Aho, Lam, Sethi, & Ullman, Compilers: Principles, Techniques, and Tools
  • UT Austin CS375: Compilers by G.Novak.

Lexical Analysis

Lexical Analysis
3.1 The Role of the Lexical Analyzer
  1. To read the input characters of the source program

  2. Group them into lexemes

  3. Produce as output a sequence of tokens for each lexeme in the source program

Scanning and Lexical analysis

3.1.1 Lexical Analysis Versus Parsing

Why do we separate lexer and parser?

  1. Simplicity of design is the most important consideration

  2. Compiler efficiency is improved

  3. Compiler portability is enhanced
    (only lexer has to deal with device-input)

3.1.2 Tokens, Patterns, and Lexemes

ex printf(“Hello World!");

printf and score are lexemes matching the pattern for token id.