Why do we use parsers




















There is also a visitTerminal that is called when the walk reaches a leaf of the parse tree. Each of these methods has a parameter that provides information about the nonterminal or terminal node that the walk is currently visiting. The methods enterEveryRule and exitEveryRule are called on entering and exiting any nonterminal node, in case we want some generic behavior.

The method visitErrorNode is called if the input contained a syntax error that produced an error node in the parse tree. The interface requires us to implement these methods, but we can just leave their method bodies empty. We need to convert the parse tree into a recursive data type. If this syntax is mysterious, review recursive data type definitions. When a recursive data type represents a language this way, it is often called an abstract syntax tree.

A Sum value captures the important features of the expression — its grouping and the integers in it — while omitting unnecessary details of the sequence of characters that created it. By contrast, the parse tree that we just generated with the Sum parser is a concrete syntax tree.

Each parse tree node will correspond to a Sum variant: sum nodes will create Plus objects, and addend nodes will create Number objects. Whenever the walker exits each node of the parse tree, we have walked over the entire subtree under that node, so we create the next Sum object at exit time. But we have to keep track of all the children that were created during the walk over that subtree.

We use a stack to store them. By default, Antlr parsers print errors to the console. In order to make the parser modular, however, we need to handle those errors differently. You can attach an ErrorListener to the lexer and parser in order to throw an exception when an error is encountered during parsing.

The Sum. So if you copy the technique used in this grammar file, you can call:. Then when you call parser. This is a simplistic approach to handling errors. Antlr offers more sophisticated forms of error recovery as well. To learn more, see Chapter 9 in the Definitive Antlr 4 Reference. Safe from bugs.

A grammar is a declarative specification for strings and streams, which can be implemented automatically by a parser generator. These specifications are often simpler, more direct, and less likely to be buggy then parsing code written by hand.

Easy to understand. A grammar captures the shape of a sequence in a form that is compact and easier to understand than hand-written parsing code. Ready for change.

A grammar can be easily edited, then run through a parser generator to regenerate the parsing code. Software in 6. Communicating clearly with future programmers, including future you. Designed to accommodate change without rewriting. If you run into trouble and need a deeper reference, you can look at: Definitive Antlr 4 Reference. The pragmatic approach is the foundation of the PEG formalism: it was created to describe programming languages more naturally.

That is because context-free grammar has its origin in the work to describe natural languages. The first should be a sort of recipe to generate only the valid sentences in the language described by the grammar. It does not have to be related to the parsing algorithm.

Instead the second kind directly define the structure and semantics of a parser for that language. The practical implications of this theoretical difference are limited: PEG is more closely associated to the packrat algorithm, but that is basically it.

For instance, generally PEG packrat does not permit left recursion. Although the algorithm itself can be modified to support it, this eliminates the linear-time parsing property. Also, PEG parsers are generally scannerless parsers. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus will usually return an error.

By usually we mean that some parsers that adopt CFGs can deal with ambiguous grammars. For instance, by providing all possible valid results to the developer and letting him sort it out. Instead PEG eliminates ambiguity altogether because the first applicable choice will be chosen, thus a PEG can never be ambiguous.

The disadvantage of this approach is that you have to be more careful in listing possible alternatives, because otherwise you could have unexpected consequences. That is to say some choices could never be matched. In the following example doge will never be matched, since dog comes first it will be picked each time.

It is an open area of research whether PEG can describe all grammars that can be defined with CFG, but for all practical purposes it does. In theory parsing is a solved problem, but it is the kind of problem that keeps being solved again and again. That is to say that there are many different algorithms, each one with strong and weak points, and they are still being improved by academics.

In this section we are not going to teach how to implement every one of the parsing algorithm, but we are going to explain their features. The goal is that you can choose with more awareness which parsing tool to use, or which algorithm to study better and implement for your custom parser.

There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are defined in relation to the parse tree generated by the parser. In simple terms:. The same tree would be generated in a different order by a top-down and a bottom-up parser. In the following images the number indicate the order in which the nodes are created.

Traditionally top-down parsers were easier to build, but bottom-up parsers were more powerful. Now the situation is more balanced, mostly because of advancement in top-down parsing strategies. The concept of derivation is closely associated to the strategies.

Derivation indicates the order in which the nonterminal elements that appears in the rule, on the right, are applied to obtain the nonterminal symbol, on the left. The two possibilities are: leftmost derivation and rightmost derivation. The first indicates that the rule are applied from left to right, while the second indicates the opposite. A simple example: imagine that you are trying to parse the symbol result which is defined as such in the grammar.

In the case of leftmost derivation you pick the first option, while for rightmost derivation you pick the second one. It is important to understand that the derivation is applied depth-first or recursively.

That is to say, it is applied on the starting expression then it is applied again on the intermediate result that is obtained. Derivation is associated with the two strategies, because for bottom-up parsing you would apply rightmost derivation, while for top-down parsing you would choose leftmost derivation.

Note that this has no effect on the final parsing tree, it just affects the intermediate elements and the algorithm used. Parsers built with top-down and bottom-up strategies shares a few elements that we can talk about. The terms lookadhead and backtracking do not have a different meaning in parsing than the one they have in the larger computer science field. Lookahead indicates the number of elements, following the current one, that are taken into consideration to decide which current action to take.

A simple example: a parser might check the next token to decide which rule to apply now. When the proper rule is matched the current token is consumed, but the next one remains in the queue. Backtracking is a technique of an algorithm.

It consists of finding a solution to a complex problems by trying partial solutions, and then keep checking for the most promising one. If the one that is currently being tested fails, then the parser backtracks i.

Lookahead is especially relevant to LL , LR and LALR parsing algorithms, because parsers for languages that just needs one lookahead token are easier to build and quicker to run. The lookahead tokens used by such algorithms are indicated between parentheses after the name of the algorithm e.

Chart parsers are a family of parsers that can be bottom-up e. Chart parsers essentially try to avoid backtracking, which can be expensive, by using dynamic programming. Dynamic programming , or dynamic optimization, is a general method to break down larger problem in smaller subproblems. A common dynamic programming algorithm used by chart parser is the Viterbi algorithm. The goal of the algorithm is to find the most likely hidden states given the sequence of known events.

Basically given the tokens that we know, we try to find the most probable rules that have produced them. The name chart parser derives from the fact that the partial results are stored in a structure called chart usually the chart is a table.

The particular technique of storing partial results is called memoization. Memoization is also used by other algorithms, unrelated to chart parsers, like packrat.

Before discussing parsing algorithms we would like to talk about the use of automatons in parsing algorithms. Automaton s are a family of abstract machines, among which there is the well known Turing machine. Since they define abstract machines, usually they are not directly related to an actual algorithm. Rather, they describe in a formal way the level of complexity that an algorithm must be able to deal with. However since DFAs are state machines in the case of a lexer the distinction is frequently moot.

That is because state machines are relatively straightforward to implement i. That is why we are going to briefly talk about DFA and why they are frequently used for lexers. DFA is a finite- state machine, a concept with which we assume you are familiar. Basically, a state machine has many possible states and a transition function for each of them.

These transition functions govern how the machine can move from one state to a different one in response to an event. When used for lexing, the machine is fed the input characters one at a time until it reaches an accepted state i. An online algorithm is one that does not need the whole input to work. In the case of a lexer, it means that it can recognize a token as soon as its characters are seen by the algorithm. The alternative would be that it would need the whole input to identify each token.

In addition to these properties it is fairly easy to transform a set of regular expressions into a DFA, which makes it possible to input the rules in a simple way that is familiar to many developers. Then you can automatically convert them into a state machine that can work on them efficiently. We provide a table to offer a summary of the main information needed to understand and implement a specific parser algorithm. You can find more implementations by reading our articles that present parsing tools and libraries for Java , C , Python and JavaScript.

To understand how a parsing algorithm works you can also look at the syntax analytic toolkit. It is an educational parser generator that describes the steps that the generated parser takes to accomplish its objective. It implements a LL and a LR algorithm. The second table shows a summary of the main features of the different parsing algorithms and for what they are generally used.

The top-down strategy is the most widespread of the two strategies and there are several successful algorithms applying it. LL L eft-to-right read of the input, L eftmost derivation parsers are table-based parsers without backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide which rule to apply. The parsing table use as rows and columns nonterminals and terminals, respectively.

The concept of LL parser does not refers to a specific algorithm, but more to a class of parsers. They are defined in relations to grammars. That is to say an LL parser is one that can parse a LL grammar. In turn LL grammars are defined in relation to the number of lookahead tokens that are needed to parse them.

This number is indicated between parentheses next to LL, in the form LL k. An LL k parser uses k tokens of lookahead and thus it can parse, at most, a grammar which needs k tokens of lookahead to be parsed. Effectively the concept of LL k grammar is more widely employed than the corresponding parser. Which means that LL k grammars are used as a meter when comparing different algorithms. This use of LL grammars is due both to the fact that LL parsers are widely used and that they are a bit restrictive.

In fact, LL grammars does not support left-recursive rules. You can transform any left-recursive grammar into an equivalent non-recursive form, but this limitation matters for a couple of reasons: productivity and power. The loss of productivity is due to the requirement that you have to write the grammar in a special way, which takes time.

The power is limited because a grammar that might need 1 token of lookahead, when written with a left-recursive rule, might need tokens of lookahead, when written in a non-recursive way. So this limitation is not merely annoying, but it is limiting the power of the algorithm, i. The loss of productivity can be mitigated by an algorithm that automatically transforms a left-recursive grammar into a non-recursive one.

ANTLR is a tool that can do that, but, of course, if you are building your own parser, you have to do it yourself. In the past the first kind were the only one considered practical, because it is easy to build efficient parsers for them. Up to the point that many computer languages were purposefully designed to be described by a LL 1 grammar. The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually compared to CYK, another chart parser, that is simpler, but also usually worse in performance and memory.

The distinguishing feature of the Earley algorithm is that, in addition to storing partial results, it implements a prediction step to decide which rule is going to try to match next. The Earley parser fundamentally works by dividing a rule in segments, like in the following example.

Then working on these segments, that can be connected at the dot. The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages, while other famous algorithms e. For instance, it has no problem with left-recursive grammars. More generally, an Earley parser can also deal with nondeterministic and ambiguous grammars. It can do that at the risk of a worse performance O n 3 , in the worst case.

However it has a linear time performance for normal grammars. The catch is that the set of languages parsed by more traditional algorithms are the one we are usually interested in. There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in certain way the parsing can be more efficient.

With Earley you do less work, so the parser does more of it. In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in terms of performance. So Earley parsers are easy to use, but the advantage, in terms of performance, in the average case might be non-existent.

This makes the algorithm great for an educational environment or whenever productivity is more relevant than speed. In the first case it is useful, for example, because most of the time the grammars your users write work just fine. The problem is that the parser will throw at them obscure and seemingly random errors. Of course the errors are not actually random, but they are due to the limitations of an algorithm that your users do not know or understand.

So you are forcing the user to understand the inner workings of your parser in order to use it, which should be unnecessary. An example of when productivity is more important than speed might be a parser generator to implement syntax highlighting, for an editor that needs to support many languages. In a similar situation, being able to support new languages quickly might be more desirable than completing the task as soon as possible.

Packrat is often associated to the formal grammar PEG, since they were invented by the same person: Brian Ford. The title says almost everything that we care about: it has a linear time of execution, also because it does not use backtracking. The other reason for its efficiency it is memoization: the storing of partial results during the parsing process.

The drawback, and the reason why the technique was not used until recently, is the quantity of memory it needs to store all the intermediate results. If the memory required exceeds what is available, the algorithm loses its linear time of execution. Packrat also does not support left-recursive rules, a consequence of the fact that PEG requires to always choose the first option.

Actually some variants can support direct left-recursive rules, but at the cost of losing linear complexity. Packrat parsers can perform with an infinite amount of lookahead, if necessary. This influence the execution time, that in the worst case can be exponential.

A recursive descent parser is a parser that works with a set of mutually recursive procedures, usually one for each rule of the grammars. Thus the structure of the parser mirrors the structure of the grammar. The term predictive parser is used in a few different ways: some people mean it as a synonym for top-down parser, some as a recursive descent parser that never backtracks. The opposite to this second meaning is a recursive descent parser that do backtracks.

That is to say one that finds the rule that matches the input by trying each of the rules in sequence, and then going back each time it fails. Typically recursive descent parsers have problems parsing left-recursive rules, because the algorithm would end up calling the same function again and again. A possible solution to this problem is using tail recursion. Parsers that use this method are called tail recursive parsers. Tail recursion per se is simply recursion that happens at the end of the function.

However tail recursion is employed in conjunction with transformations of the grammar rules. The combination of transforming the grammar rules and putting recursion at the end of the process allows to deal with left-recursive rules. A Pratt parser is a widely unused, but much appreciated by the few that know it parsing algorithm defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts with a polemic on BNF grammars, which the author wrongly argues are the exclusive concerns of parsing studies.

This is one of the reasons for the lack of success. In fact the algorithm does not rely on a grammar, but works directly on tokens, which makes it unusual to parsing experts. The second reason is that traditional top-down parsers work great if you have a meaningful prefix that helps distinguish between different rules.

For example, if you get the token FOR you are looking at a for statement. Since this essentially applies to all programming languages and their statements, it is easy to understand why the Pratt parser did not change the parsing world. Where the Pratt algorithm shines is with expressions. In fact, the concept of precedence makes it impossible to understand the structure of the input simply by looking at the order in which the tokens are presented.

Basically, the algorithm requires you to assign a precedence value to each operator token and a couple of functions that determine what to do, according to what is on the left and right of the token. Then it uses these values and functions to bind the operations together while it traverses the input. While the Pratt algorithm has not been overtly successful it is used for parsing expressions.

A parser combinator is a higher-order function that accepts parser functions as input and returns a new parser function as output. A parser function usually means a function that accepts a string and outputs a parse tree. A parser combinator is modular and easy to build, but they are also slower they have O n 4 complexity in the worst case and less sophisticated. They are typically adopted for easier parsing tasks or for prototyping.

In a sense the user of a parser combinator builds the parser partially by hand, but relying on the hard word done by whoever created the parser combinator. Generally they do not support left recursive rules, but there are more advanced implementations that do just that. See, for example, the paper Parser Combinators for Ambiguous Left-Recursive Grammars , that also manages to describe an algorithm that has polynomial time of execution.

Many contemporary implementations are called monadic parser combinators , since they rely on the structure of functional programming called monad. Monads are a fairly complex concept that we cannot hope to explain here. However basically a monad is able to combine functions and actions relying on a data type.

The crucial feature is that the data type specifies how its different values can be combined. The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer, that returns the value itself when the value is valid e.

Thus you can avoid using a null value and unceremoniously crashing the program. Instead the Nothing value is managed normally, like any other value would be manage. The bottom-up strategy main success is the family of many different LR parsers. The reason of their relative unpopularity is because historically they have been harder to build, although LR parsers are also more powerful than traditional LL 1 grammars.

In order for the code written in human-readable form to be understood by a machine, it must be converted into machine language. This task is usually performed by a translator interpreter or compiler. The parser is commonly used as a component of the translator that organizes linear text in a structure that can be easily manipulated parse tree.

Lexical Analysis: A lexical analyzer is used to produce tokens from a stream of input string characters, which are broken into small components to form meaningful expressions. Syntactic Analysis: Checks whether the generated tokens form a meaningful expression.

This makes use of a context-free grammar that defines algorithmic procedures for components. These work to form an expression and define the particular order in which tokens must be placed.

Semantic Parsing: The final parsing stage in which the meaning and implications of the validated expression are determined and necessary actions are taken. A parser's main purpose is to determine if input data may be derived from the start symbol of the grammar.

If yes, then in what ways can this input data be derived? This is achieved as follows:. Top-Down Parsing: Involves searching a parse tree to find the left-most derivations of an input stream by using a top-down expansion.

Parsing begins with the start symbol which is transformed into the input symbol until all symbols are translated and a parse tree for an input string is constructed. Examples include LL parsers and recursive-descent parsers. Top-down parsing is also called predictive parsing or recursive parsing.

Bottom-Up Parsing: Involves rewriting the input back to the start symbol.



0コメント

  • 1000 / 1000