[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gnubol: How do we parse this language, anyway?
In a message dated 12/9/99 11:52:16 AM EST,
an extremely knowledge William M. Klein, wmklein@ix.netcom.com
writes most reluctantly:
<<
If the code needs to be there (to avoid infinite look-ahead), fine keep
"assuming" conditionals are OK, but Please PLEASE make it an error to put a
nested unterminated conditional statement where an imperative is required.
>>
And he also offers important perspective information in the phrasing "...getti
ng into the business of looking at Micro Focus extensions... will start you
down a
very slippery slope ..." . (Heavy snipping here. I wish to emphasize that
all of the posters in this group appear to respect this fine product; the
phrase relates to management of our own project.).
I am not in a position to challenge your assessment at all that there is
little code base issue. I do solicit others views on that however, with all
due respect. And I mean that most dererentially. Posts seem to indicate that
there was more than Micro Focus's compiler that could be supporting code base
that is 'extended'.
Your expertise is leading in the sense that we probably can begin to presume
that the code base issue is not large. I am trying to protect one interest of
those who would contribute to this project, and that is I know they want the
resulting compiler to be used as much as possible. So I hope that the
discussion can help avoid any surprises that would seriously disuade
potential migrators to our tool.
We may have backed into the more valuable part of this discussion, which is
that the
nesting of conditionals (to find rejectable code or to find extensions) is an
extremely challenging issue for the coders of grammar rules for this
language.
For example at
http://adam.wins.uva.nl/~x/grammars/vs-cobol-ii/
is the generous posting of a grammar that represents lots of work on the
language of interest here, but also some very VERY impressive reverse
engineering tool development behind the scenes. It is interesting that they
gen the parser from studying real code. That is, code base is the driving
material that their tools analyze. If I am not mistaken the raw material is
mainframe code. ((Thus of a 'reject' category, and the code is clean of
'extensions')).
But notice that you have these rules (some snipping done to get to one minor
point);
statement =
(
accept-statement
| add-statement
| alter-statement|
call-statement
... etc
)
statement-list = { statement }+
add-statement =
"ADD" { ( identifier | literal ) }+
[ "TO" ] ( identifier | literal )
"GIVING" { identifier [ "ROUNDED" ] }+
[ [ "ON" ] "SIZE" " ERROR" statement-list ]
[ "NOT" [ "ON" ] "SIZE" "ERROR" statement-list ]
[ "END-ADD" ]
((Of have done a lot of snipping. There grammar is quite good and very
extensive. I am just trying to get to one point. Please refer to that page
for a better perspective on the overall achievement, and to see their
generosity.))
From my own studies of what we may need to do to get the grammar rules to
work, this portion of the posted grammar actually won't work.
The inner most statement-list is actually an error, I think. I am refering to
the statement-list that is hanging on the
[ [ "ON" ] "SIZE" " ERROR" statement-list ]
clause. Now there may be a number of ways to solve the problem, but the
grammar posted does not deal with it.
Our basic discussion is leading us to either 'extend' or 'reject' when that
inner most statement list has some kind of a dangling unterminated
conditional. (There are other places where we must do that too). So fine,
everyone is bored with it all. But notice one further problem. The interior
statement-list block can not be allowed to end with a simple arithmetic
statement. Or you would have the situation that everyone agrees can not
possibly happen.
This situation actually represents a serious problem for the developers of
the impressive tools at the University of Amsterdam. By genning the parser
from pristine code base, they end up non challant about ambiguities that code
errors can present. For parsers intended for retrofitting, or remediating
otherwise valid code; that is not a failing. But this will not be sufficient
for real compilers which must see likely problems and stay robust.
Thus in a certain sense, there are several kinds of statement-list blocks: 1)
those that end with a closed statement (like explicitly scope terminated
forms), 2) those that end in some simple statement form capable generally of
extension with a conditional clause but not of the same family as the
enclosing block, and 3) those that end in a simple statment of the same
family as the enclosing statement that could be real ambiguous because of the
following conditional clause, and 4) those that are open unterminated
conditionals but the clauses of which do not relate to the enclosing
statement, and 5) those that are open unterminated conditionals that relate
to the enclosing statement, and have not yet exhausted both alternatives of
the conditional pair. Situation 5 becomes more interesting if we permit
transposition of alternating conditionals.
Existing code under existing COBOL '85 standards, hides these distinctions
from naive analysis and from mechanical scans; because the code base simply
lacks some of these block types. This is a classic horizon problem, in the
sense that the positive logic rule extractor can not quite see beyond the
horizon to the block type restrictions. It is only seeing
acceptable blocks in arithmetic statment interiors, and sees no other
statement blocking pattern that suggest a need for distinctions, because the
compiler managing their code base is stomping on those forms. I emphasize
that for the purposes that they have in mind this is not a failing. And even
if your idea is to just kick start a grammar development effort, the pristine
code an naive sanners are of inestimable value.
For a robust compiler in a sketch we _might_ have
rule_arith_open_code :
ARITH blah
{ ON SIZE ERROR ...
| ON SIZE ERROR ...R NOT ON SIZE ERROR ...
| NOT ON SIZE ERROR ...R ON SIZE ERROR ...
| NOT ON SIZE ERROR ...
}
Where ...R marks restrictions. If the entire structure is embedded in an
arithmetic
conditional block _interior_ then all of the loose ends are restricted: none
of them can have a simple arithmetic statment on the very end
rule_arith_nested_interior :
ARITH blah
{ ON SIZE ERROR ...R
| ON SIZE ERROR ...R NOT ON SIZE ERROR ...R
| NOT ON SIZE ERROR ...R ON SIZE ERROR ...R
| NOT ON SIZE ERROR ...R
}
Now all of the blocks must be constrained from ending in a simple arithmetic
or they will
compete for the subsequent conditional clause on the enclosing conditional
statement.
This is something like a shift/reduce conflict, but merely shifting will not
work, nor will reducing. The rule that is recursing must actually be coded to
_not_ look for the simple arithmetic as the last statement on the block. Use
of nice tools like PCCTS will hide this problem, I am convinced.
No matter what you do with a friendly tool you will be either intentionally
or unknowingly shifting or reducing. You do not want either. You do not want
the parser to look for simple arithmetics at the end of the interior blocks.
This requirement is distinct for each of the conditional families. Like, I/O
verb interiors do not want to be looking for simple I/O statements on the
right edge of interior blocks which would be followed out at the enclosing
rule by the subsequent conditional clause.
It is a deep situation; you do not want the compiler to look for the simple,
and you do not want either shift or reduce! The positive logic extractors can
not possibly see this when the foundation is pristine code. That is not a
criticism. It just characterizes the result.
Furthermore the interior blocks need some restrictions that when reducing
unterminated conditionals they look only for conditionals that are actually
fully duplex, and allow a larger external recursing concurrent context to
find the odd numbered clause sequences. This again to avoid ambiguity
between interior blocks looking for one clause forms and the exterior blocks
looking for one clause forms. I think that is an horizon problem too (but it
is beyond mere twilight). Again transposition of the alternates makes this
more hairy.
I hope you will agree that ateast some further comment on other compilers
might be useful as perspective on the code base issue. And I hope you can
see my sense that keeping the discussion open goes beyond mere reject/extend
preferences. This is a significant design issue for the grammar rules.
Aggressive parser, mere associativity, and bland statement-list blocks I
think will not cut it. But anyway we are nearing workable grammars. So this
chatter ends soon.
I believe it a useful thing to keep in mind that we can, if we choose,
externalize the diagnostic severity attribute. And that may be useful as a
generally available concept if we find any other
potential code base issue that we think large enough to lead us to provide
alternate behaviors.
So in general, if anyone out there is not completely enraged and exhausted by
all this, I would encourage comments yea or nea, does anyone else think we
have yet seen anything that suggest we will have a different responsibility
than vendors do because we are dealing with convergence?
If all of this is too analytical, let me know, I can get it out of your way.
Bob Rayhawk
--
This message was sent through the gnu-cobol mailing list. To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body. For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.