[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gnubol: Re: which parser tool
>>>>> "Bob" == RKRayhawk <RKRayhawk@aol.com>
>>>>> wrote the following on Fri, 12 Nov 1999 14:09:27 EST
Bob> On Date 10/31/99 1:51:52 PM EST Michael McKernan,
Bob> mck@tivoli.mv.com writes
Bob> in part << Pccts is the exception, .... It _does_ have
Bob> infinite look-ahead. That sounds expensive, but it's only used
Bob> when it's required so as a practical matter, it isn't a
Bob> problem
>>>
Bob> The issue is epsilon. Excellent syntaxtic sugar like curly
Bob> brackets {} as in
Bob> refmod : OPAREN arith_expr COLON { arith_expr } CPAREN
Bob> does not create a node for arith_expr and then a non-node that
Bob> is not anything but is instead FREE nothing. No no no! It
Bob> creates epsilon which is actual. There are a couple of ways to
Bob> express these optionals in the PCCTS language: such as zero or
Bob> one - {}, or zero or more {}*. (Don't get mad with me if my
Bob> PCCTS is not perfectly presented).
Sure, that's syntactic sugar and quite helpful for the readers of the
grammar. If you would care to expose what's really hapeening here
you are free to write ( arith_expr | ) and the generator will not be
at all upset. Zero or more, however, is the Kleene star operation,
and is essential in every parsing system.
Bob> These are generating an epsilon entry in the table each time.
Bob> The syntax is defintely pulling the wool over your eyes. The
Bob> part ...{rule_name}... is not a node, it is two nodes. The
Bob> problem is not one node or two nodes, the problem is that the
Bob> extra node is epsilon all over the place, in nearly any
Bob> grammar using this tool; but ESPECIALLY COBOL because of the
Bob> optionality of nearly every clause and phrase.
But that is the grammar! I can't avoid it. I think it's unduly
pessimistic to see each of these occurrences as adding another path
through the graph. There are two elements, and what I have done by
using the extended notation is to make it clear to the reader that
this is a well behaved construct that rejoins the parse at the same
point regardless of what happened in the box.
Bob> This problem is deep, but quite real. Any rule which is
Bob> optional is not a rule that is ambiguous. Including such a
Bob> rule optionally (such as zero or one times, of zero or more
Bob> times), does not in and of itself create an ambiguity either.
Bob> It is when the rule that includes another rule optionally is
Bob> included in yet another rule that the ambiguities burst
Bob> forth. In a COBOL grammar we do that alot. When you use more
Bob> rigid parsers you get the chatter from the gen and you go back
Bob> and flatten rules, enumerating possibile combinations, one of
Bob> which simply lacks the optional rule - which is categorically
Bob> different from epsilon.
I'm not sure I stayed with you completely through that, but pccts can
be as fussy as any other generator. If you bend it to your will
without understanding what it's telling you, there is little succor
in the sequelae. (Acknowledgement to Peter Ingerman, who could turn a
phrase.)
Bob> When you use PCCTS, you get either no feed back on what you
Bob> have done; or else you develop a style of tollerating numerous
Bob> informational diagnostics.
Pccts does give you the first option, but you'd better use it
carefully. The other generators that I have known leave you with
lots of harmless ambiguities without any reasonable way to get rid of
them. There are certainly cases in COBOL, where I would tolerate the
harmless ambiguities, although I could, with a little effort on my
part and on the processors part, get pccts to shut up about them. It
seems to me that a lot of these cases come from the fine line that
the committee walks trying to introduce new features without
invalidating the language in previous standards, and I don't really
want them to change that to make the parser writer's life easier.
Bob> This is not intended to pick on any one person or a particular
Bob> phrase; there are numerous posts in the project with phrases
Bob> similar to Michael's, to the effect " ... it's only used when
Bob> it's required ... ". What may not be clear to some of you is
Bob> that you are using it very, very frequently. I think I did not
Bob> use enough very's in that statement.
OK, it sound glib, but I believe it's true. I took an existing
grammar, which I understood was failing to parse at LL/4, and added
twenty to thirty syntactic predicates that permitted it to remain an
essentially LL/1 grammar. Most were utterly trivial, like looking
ahead to see if "equal" follows "greater". The example you cite at
the top of your message is exceptional because, of course, the
definition of expression admits of arbitrary complexity. If you are
a malicious programmer, you may be able to make a pccts parser run
more slowly, but nothing gets parsed more than twice. It is limited.
I can construct some not overly large files that can make diff run
overnight, but the existence of a pathological case detracts little
from the utility of the tool, because it works so well on all the
interesting cases.
Your own suggestion of classifying identifiers (which has merit --
there are admittedly difficulties doing the same thing in a
predictive parser) might have met with more resistance has you dwelt
on the fact that the worst case identifier is a thirty character name
followed by forty-nine (topmost is FD) thirty character qualifiers
and is subscripted by forty-eight or forty-nine (depending on whether
occurs is permitted at level 1) similarly qualified identifiers.
The existence of this pathological case does not invalidate your
suggestion, either.
Bob> Whatever they did in PCCTS they cannot have avoided epsilon. I
Bob> am pretty confident that there are gobs of ambiguities in the
Bob> parser tables we are genning so far. Lookahead and
Bob> backtracking is engaged most of the time. The ambiguities
Bob> will increase as the project progresses.
Just to keep the record straight, pccts is an LL generator and
generates a reasonably readable rescursive descent parser in C (or
C++ or Java, depending on the incarnation). Recursive descent
parsers are often considered suspect because of the overhead of the
calls themselves. Fortunately, Terence Parr and his colleagues are
more than theoretical computer scientists. Before a pccts parser
enters a sub-rule it checks a hash derived from the union of
permitted next tokens, so many rules and many trial parses can be
avoided immediately. In the case of the relation I mentioned
earlier, if the next token in the stream is not "greater" or ">" or
">=", neither the trial parse nor the rules themselves will be
entered at all.
Bob> What I am thinking of is large applications with numerous
Bob> large programs. Backtracking is an issue, and C intermediate
Bob> code is an issue. I am not saying that any contributor should
Bob> do anything different right now. It is so valuable to gather
Bob> the fruits of these efforts. Yet there is this step to the
Bob> next development phase; and there is an openness to discussing
Bob> parser tool selection. So I am trying to approach the issue
Bob> where it really counts.
Bob> There is an underlying unarticulated assumption that you are
Bob> not engaging the special aspects of PCCTS frequently. I am
Bob> confident that that is wrong.
I am equally confident that that is right, and I speak from the
position of having used pccts to generate a parser for a language
that is even more unruly, if not as large, as COBOL. No, I can't
tell you what percentage of the time is spent in trial parsing, but I
can say that performance was quite acceptable, and better than I
expected it to be
Bob> The collection of rules for COBOL are horrendously ambiguous
Bob> because of optionality. The is the nature of COBOL. But this
Bob> meaning of the term ambiguous is pedestrian; in and of itself
Bob> that does not map to 'ambiguous' as parser tool workers mean
Bob> it.
Since, in the standard, English is used extensively as the meta
language over COBOL, it is difficult to achieve the degree of
precision that we might like, but we can appeal to authority for
clarification or, failing that, depend on precedent. This is not
the first implementation of COBOL.
Bob> If you use a tool that can generate epsilon nodes, and you do
Bob> generate epsilon nodes, then the table _might_ contain
Bob> 'ambiguities' as the parser tool people mean the word. This
Bob> situation produces error messages in simpler parsers, but
Bob> advanced parsers encode it, and later recognize occurences of
Bob> the situation and respond (at some expense).
Bob> This is happening to us when the epsilon is one of the
Bob> branches of a rule at level three in any given structure (a
Bob> top structure or a substructure).
Bob> (For those new to parsers permit me to emphasize that
Bob> 'ambiguities' in the technical sense can occur in many other
Bob> ways than epsilon).
In bottom up parlance, when the parse reaches a point where a
reduction is required and no alternative is available, or more than
one alternative is available, there is an ambiguity at that point.
Bob> So, for example, unless I am really disoriented, because a
Bob> data reference can include an optional OF clause, then the
Bob> rule's optionality creates an epsilon on _path_from_ every
Bob> node that has a data reference. So, MOVE data_ref TO data_ref
Bob> has ambiguities on both data references for every such legal
Bob> statement (you know, like ADD and clauses, like VARYING
Bob> phrases), _every_time_ the higher level rules are coursed to
Bob> reduction. The fact that gargantuum portions of compilable
Bob> code do not have these OFIN clauses does not free you from the
Bob> expense of engaging the infinite lookahead and backtracking
Bob> potentiality.
So long as it remains potential, we ain't gettin' hurt. Maybe I
don't understand again.
Bob> Every such actual program statement MOVE simple-name-1 TO
Bob> simple-name-2. engages the thing you all are thinking is
Bob> free, engages it twice. (some VARYING clauses thrice, more
Bob> for arithmetic statements that have lists of data references).
Bob> And depending upon how you coded your rules, the full
Bob> engagement did not merely harness optional OFIN, but also
Bob> optional subsctipt as well as optional reference modification,
Bob> as well as the front part of the rule for subsctipt followed
Bob> by reference modification...
Bob> I am certain that that is not free.
What is required, should not be expected to be free. I don't propose
that we should be profligate with these devices.
Bob> We need lookahead. But when you use the fancy device, the
Bob> algorithm must ratchet up the resource consumption when it
Bob> encounters ambiguous epsilon continuations (it is the
Bob> optionality that does that, although there are other ways into
Bob> the 'ambiguity' game with that tool).
Now, I'm sure I don't understand.
Bob> When you use the less happy tool, things are different. You
Bob> are forced to figure out solutions to the tool workers
Bob> 'ambiguities'. I asure you COBOL remains ambiguous as we use
Bob> the term in English.
I understand that it is possible to parse COBOL in LALR/1 and that it
has even been done, although the examples are proprietary. I have no
interest in seeing how it's done, in any case. Struggling with a
picky tool may do a lot for your character development, but it seldom
produces the nice trees or other intermediate forms that make the rest
of the job tolerable. You can (sometimes) twist your language so
that it becomes acceptable to a certain parsing discipline, but you
havr to do that by moving language elements into strange places in
the rules, and that bodes ill for the output.
Bob> I think I am warranted in discoursing on this because of the
Bob> performance implications. But there is another point
Bob> here. When the dumb parsers tell you they cannot handle your
Bob> 'ambiguity' they tell you something much more valuable. We
Bob> are in trouble in the grammar where we try that. Regrettably,
Bob> using PCCTS you are habituating yourselves to being oblivious
Bob> to these situations. We are suppressing valuable information
Bob> during our development effort if we tollerate techical
Bob> 'ambiguities' in the grammar work.
When you "force" a parse without understanding the origin of
the reported ambiguity, you've potentially bought a lot of trouble.
It was pointed out recently on this list that most COBOL parsers
were written by hand and the results have not been calamitous.
Many reported ambiguities are indeed harmless, however, and can be
attributed to weaknesses in the parsing algorithms in their present
state of development. Maybe when quantum computing... Nah, we better
not wait.
Bob> Let me say again this is not aimed at anyone. Picking your
Bob> editor or favorite programming language can be an emotional
Bob> thing that engages a persons identity. So I guess parser tool
Bob> preference might trigger similar sensitivity, and sure do hope
Bob> to avoid that.
The emotion I felt was relief at not having to make ugly lexer hacks
or passing off some half analyzed crap to be cleaned up in a
subsequent phase. I haven't seen them all, but this one ain't bad.
Bob> Also, PCCTS is an excellent tool. It represents not only a lot
Bob> of great work by a number of generous people, but it is
Bob> definitely an significant intellectual achievement. So we all
Bob> have reason to be greatful for PCCTS.
I am grateful.
Bob> Best Wishes Bob Rayhawk RKRayhawk@aol.com
And best wishes to you. If anyone has read to this point, why aren't
you coding?
Bob> -- This message was sent through the gnu-cobol mailing list.
Bob> To remove yourself from this mailing list, send a message to
Bob> majordomo@lusars.net with the words "unsubscribe gnu-cobol" in
Bob> the message body. For more information on the GNU COBOL
Bob> project, send mail to gnu-cobol-owner@lusars.net.
--
This message was sent through the gnu-cobol mailing list. To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body. For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.