[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gnubol: Re: which parser tool



>>>>> "Bob" == RKRayhawk  <RKRayhawk@aol.com>
>>>>> wrote the following on Fri, 12 Nov 1999 14:09:27 EST

  Bob> On Date 10/31/99 1:51:52 PM EST Michael McKernan,
  Bob> mck@tivoli.mv.com writes

  Bob> in part << Pccts is the exception, .... It _does_ have
  Bob> infinite look-ahead. That sounds expensive, but it's only used
  Bob> when it's required so as a practical matter, it isn't a
  Bob> problem
  >>>


  Bob> The issue is epsilon. Excellent syntaxtic sugar like curly
  Bob> brackets {} as in

  Bob>     refmod : OPAREN arith_expr COLON { arith_expr } CPAREN

  Bob> does not create a node for arith_expr and then a non-node that
  Bob> is not anything but is instead FREE nothing.  No no no! It
  Bob> creates epsilon which is actual. There are a couple of ways to
  Bob> express these optionals in the PCCTS language: such as zero or
  Bob> one - {}, or zero or more {}*. (Don't get mad with me if my
  Bob> PCCTS is not perfectly presented).

Sure, that's syntactic sugar and quite helpful for the readers of the
grammar.  If you would care to expose what's really hapeening here
you are free to write ( arith_expr | ) and the generator will not be
at all upset.  Zero or more, however, is the Kleene star operation,
and is essential in every parsing system.   

  Bob> These are generating an epsilon entry in the table each time.
  Bob> The syntax is defintely pulling the wool over your eyes.  The
  Bob> part ...{rule_name}... is not a node, it is two nodes. The
  Bob> problem is not one node or two nodes, the problem is that the
  Bob> extra node is epsilon all over the place, in nearly any
  Bob> grammar using this tool; but ESPECIALLY COBOL because of the
  Bob> optionality of nearly every clause and phrase.

But that is the grammar!  I can't avoid it.  I think it's unduly
pessimistic to see each of these occurrences as adding another path
through the graph.  There are two elements, and what I have done by
using the extended notation is to make it clear to the reader that
this is a well behaved construct that rejoins the parse at the same
point regardless of what happened in the box.

  Bob> This problem is deep, but quite real. Any rule which is
  Bob> optional is not a rule that is ambiguous. Including such a
  Bob> rule optionally (such as zero or one times, of zero or more
  Bob> times), does not in and of itself create an ambiguity either.

  Bob> It is when the rule that includes another rule optionally is
  Bob> included in yet another rule that the ambiguities burst
  Bob> forth. In a COBOL grammar we do that alot. When you use more
  Bob> rigid parsers you get the chatter from the gen and you go back
  Bob> and flatten rules, enumerating possibile combinations, one of
  Bob> which simply lacks the optional rule - which is categorically
  Bob> different from epsilon.

I'm not sure I stayed with you completely through that, but pccts can
be as fussy as any other generator.  If you bend it to your will
without understanding what it's telling you, there is little succor
in the sequelae. (Acknowledgement to Peter Ingerman, who could turn a
phrase.) 

  Bob> When you use PCCTS, you get either no feed back on what you
  Bob> have done; or else you develop a style of tollerating numerous
  Bob> informational diagnostics.

Pccts does give you the first option, but you'd better use it
carefully.  The other generators that I have known leave you with
lots of harmless ambiguities without any reasonable way to get rid of
them.  There are certainly cases in COBOL, where I would tolerate the
harmless ambiguities, although I could, with a little effort on my
part and on the processors part, get pccts to shut up about them.  It
seems to me that a lot of these cases come from the fine line that
the committee walks trying to introduce new features without
invalidating the language in previous standards, and I don't really
want them to change that to make the parser writer's life easier.

  Bob> This is not intended to pick on any one person or a particular
  Bob> phrase; there are numerous posts in the project with phrases
  Bob> similar to Michael's, to the effect " ... it's only used when
  Bob> it's required ... ". What may not be clear to some of you is
  Bob> that you are using it very, very frequently. I think I did not
  Bob> use enough very's in that statement.

OK, it sound glib, but I believe it's true.  I took an existing
grammar, which I understood was failing to parse at LL/4, and added
twenty to thirty syntactic predicates that permitted it to remain an
essentially LL/1 grammar.  Most were utterly trivial, like looking
ahead to see if "equal" follows "greater".  The example you cite at
the top of your message is exceptional because, of course, the
definition of expression admits of arbitrary complexity.  If you are
a malicious programmer, you may be able to make a pccts parser run
more slowly, but nothing gets parsed more than twice.  It is limited.
I can construct some not overly large files that can make diff run
overnight, but the existence of a pathological case detracts little
from the utility of the tool, because it works so well on all the
interesting cases.  

Your own suggestion of classifying identifiers (which has merit --
there are admittedly difficulties doing the same thing in a
predictive parser) might have met with more resistance has you dwelt
on the fact that the worst case identifier is a thirty character name
followed by forty-nine (topmost is FD) thirty character qualifiers
and is subscripted by forty-eight or forty-nine (depending on whether
occurs is permitted at level 1) similarly qualified identifiers.
The existence of this pathological case does not invalidate your
suggestion, either.


  Bob> Whatever they did in PCCTS they cannot have avoided epsilon. I
  Bob> am pretty confident that there are gobs of ambiguities in the
  Bob> parser tables we are genning so far. Lookahead and
  Bob> backtracking is engaged most of the time.  The ambiguities
  Bob> will increase as the project progresses.

Just to keep the record straight, pccts is an LL generator and
generates a reasonably readable rescursive descent parser in C (or
C++ or Java, depending on the incarnation).  Recursive descent
parsers are often considered suspect because of the overhead of the
calls themselves.  Fortunately, Terence Parr and his colleagues are
more than theoretical computer scientists.  Before a pccts parser
enters a sub-rule it checks a hash derived from the union of
permitted next tokens, so many rules and many trial parses can be
avoided immediately.  In the case of the relation I mentioned
earlier, if the next token in the stream is not "greater" or ">" or
">=", neither the trial parse nor the rules themselves will be
entered at all.

  Bob> What I am thinking of is large applications with numerous
  Bob> large programs.  Backtracking is an issue, and C intermediate
  Bob> code is an issue. I am not saying that any contributor should
  Bob> do anything different right now. It is so valuable to gather
  Bob> the fruits of these efforts. Yet there is this step to the
  Bob> next development phase; and there is an openness to discussing
  Bob> parser tool selection. So I am trying to approach the issue
  Bob> where it really counts.

  Bob> There is an underlying unarticulated assumption that you are
  Bob> not engaging the special aspects of PCCTS frequently. I am
  Bob> confident that that is wrong.

I am equally confident that that is right, and I speak from the
position of having used pccts to generate a parser for a language
that is even more unruly, if not as large, as COBOL.  No, I can't
tell you what percentage of the time is spent in trial parsing, but I
can say that performance was quite acceptable, and better than I
expected it to be

  Bob> The collection of rules for COBOL are horrendously ambiguous
  Bob> because of optionality. The is the nature of COBOL. But this
  Bob> meaning of the term ambiguous is pedestrian; in and of itself
  Bob> that does not map to 'ambiguous' as parser tool workers mean
  Bob> it.

Since, in the standard, English is used extensively as the meta
language over COBOL, it is difficult to achieve the degree of
precision that we might like, but we can appeal to authority for
clarification or, failing that, depend on precedent.  This is not
the first implementation of COBOL.

  Bob> If you use a tool that can generate epsilon nodes, and you do
  Bob> generate epsilon nodes, then the table _might_ contain
  Bob> 'ambiguities' as the parser tool people mean the word. This
  Bob> situation produces error messages in simpler parsers, but
  Bob> advanced parsers encode it, and later recognize occurences of
  Bob> the situation and respond (at some expense).

  Bob> This is happening to us when the epsilon is one of the
  Bob> branches of a rule at level three in any given structure (a
  Bob> top structure or a substructure).

  Bob> (For those new to parsers permit me to emphasize that
  Bob> 'ambiguities' in the technical sense can occur in many other
  Bob> ways than epsilon).

In bottom up parlance, when the parse reaches a point where a
reduction is required and no alternative is available, or more than
one alternative is available, there is an ambiguity at that point.

  Bob> So, for example, unless I am really disoriented, because a
  Bob> data reference can include an optional OF clause, then the
  Bob> rule's optionality creates an epsilon on _path_from_ every
  Bob> node that has a data reference. So, MOVE data_ref TO data_ref
  Bob> has ambiguities on both data references for every such legal
  Bob> statement (you know, like ADD and clauses, like VARYING
  Bob> phrases), _every_time_ the higher level rules are coursed to
  Bob> reduction. The fact that gargantuum portions of compilable
  Bob> code do not have these OFIN clauses does not free you from the
  Bob> expense of engaging the infinite lookahead and backtracking
  Bob> potentiality.

So long as it remains potential, we ain't gettin' hurt.  Maybe I
don't understand again.

  Bob> Every such actual program statement MOVE simple-name-1 TO
  Bob> simple-name-2.  engages the thing you all are thinking is
  Bob> free, engages it twice.  (some VARYING clauses thrice, more
  Bob> for arithmetic statements that have lists of data references).

  Bob> And depending upon how you coded your rules, the full
  Bob> engagement did not merely harness optional OFIN, but also
  Bob> optional subsctipt as well as optional reference modification,
  Bob> as well as the front part of the rule for subsctipt followed
  Bob> by reference modification...

  Bob> I am certain that that is not free.

What is required, should not be expected to be free.  I don't propose
that we should be profligate with these devices.

  Bob> We need lookahead. But when you use the fancy device, the
  Bob> algorithm must ratchet up the resource consumption when it
  Bob> encounters ambiguous epsilon continuations (it is the
  Bob> optionality that does that, although there are other ways into
  Bob> the 'ambiguity' game with that tool).

Now, I'm sure I don't understand.

  Bob> When you use the less happy tool, things are different. You
  Bob> are forced to figure out solutions to the tool workers
  Bob> 'ambiguities'. I asure you COBOL remains ambiguous as we use
  Bob> the term in English.

I understand that it is possible to parse COBOL in LALR/1 and that it
has even been done, although the examples are proprietary.  I have no
interest in seeing how it's done, in any case.  Struggling with a
picky tool may do a lot for your character development, but it seldom
produces the nice trees or other intermediate forms that make the rest
of the job tolerable.  You can (sometimes) twist your language so
that it becomes acceptable to a certain parsing discipline, but you
havr to do that by moving language elements into strange places in
the rules, and that bodes ill for the output.

  Bob> I think I am warranted in discoursing on this because of the
  Bob> performance implications. But there is another point
  Bob> here. When the dumb parsers tell you they cannot handle your
  Bob> 'ambiguity' they tell you something much more valuable.  We
  Bob> are in trouble in the grammar where we try that. Regrettably,
  Bob> using PCCTS you are habituating yourselves to being oblivious
  Bob> to these situations. We are suppressing valuable information
  Bob> during our development effort if we tollerate techical
  Bob> 'ambiguities' in the grammar work.

When you "force" a parse without understanding the origin of
the reported ambiguity, you've potentially bought a lot of trouble.
It was pointed out recently on this list that most COBOL parsers
were written by hand and the results have not been calamitous.

Many reported ambiguities are indeed harmless, however, and can be
attributed to weaknesses in the parsing algorithms in their present
state of development.  Maybe when quantum computing... Nah, we better
not wait.

  Bob> Let me say again this is not aimed at anyone. Picking your
  Bob> editor or favorite programming language can be an emotional
  Bob> thing that engages a persons identity. So I guess parser tool
  Bob> preference might trigger similar sensitivity, and sure do hope
  Bob> to avoid that.

The emotion I felt was relief at not having to make ugly lexer hacks
or passing off some half analyzed crap to be cleaned up in a
subsequent phase.  I haven't seen them all, but this one ain't bad.

  Bob> Also, PCCTS is an excellent tool. It represents not only a lot
  Bob> of great work by a number of generous people, but it is
  Bob> definitely an significant intellectual achievement. So we all
  Bob> have reason to be greatful for PCCTS.

I am grateful.

  Bob> Best Wishes Bob Rayhawk RKRayhawk@aol.com

And best wishes to you.  If anyone has read to this point, why aren't
you coding?  




  Bob> -- This message was sent through the gnu-cobol mailing list.
  Bob> To remove yourself from this mailing list, send a message to
  Bob> majordomo@lusars.net with the words "unsubscribe gnu-cobol" in
  Bob> the message body.  For more information on the GNU COBOL
  Bob> project, send mail to gnu-cobol-owner@lusars.net.




--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.