[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gnubol: How do we parse this language, anyway?



In a message dated 12/9/99 11:52:16 AM EST,
an extremely knowledge William M. Klein, wmklein@ix.netcom.com 
writes most reluctantly:

<< 
 If the code needs to be there (to avoid infinite look-ahead), fine keep
 "assuming" conditionals are OK, but Please PLEASE make it an error to put a  
nested unterminated conditional statement where an imperative is required.
  >>


And he also offers important perspective information in the phrasing "...getti
ng into the business of looking at Micro Focus extensions... will start you 
down a
very slippery slope ..." .  (Heavy snipping here. I wish to emphasize that 
all of the posters in this group appear to respect this fine product;  the 
phrase relates to management of our own project.).

I am not in  a position to challenge your assessment at all that there is 
little code base issue. I do solicit others views on that however, with all 
due respect. And I mean that most dererentially. Posts seem to indicate that 
there was more than Micro Focus's compiler that could be supporting code base 
that is 'extended'.  

Your expertise is leading in the sense that we probably can begin to presume 
that the code base issue is not large. I am trying to protect one interest of 
those who would contribute to this project, and that is I know they want the 
resulting compiler to be used as much as possible. So I hope that the 
discussion can help avoid any surprises that would seriously disuade 
potential migrators to our tool.

We may have backed into the more valuable part of this discussion, which is 
that the 
nesting of conditionals (to find rejectable code or to find extensions) is an 
extremely challenging issue for the coders of grammar rules for this 
language. 

For example at
  http://adam.wins.uva.nl/~x/grammars/vs-cobol-ii/
is the generous posting of a grammar that represents lots of work on the 
language of interest here, but also some very VERY impressive reverse 
engineering tool development behind the scenes. It is interesting that they 
gen the parser from studying real code. That is, code base is the driving 
material that their tools analyze. If I am not mistaken the raw material is 
mainframe code. ((Thus of a 'reject' category, and the code is clean of 
'extensions')).


But notice that you have  these rules (some snipping done to get to one minor 
point);

statement =  
 ( 
  accept-statement 
 | add-statement 
 | alter-statement| 
call-statement
 ... etc
 )


statement-list = { statement }+



add-statement = 
 "ADD" { ( identifier | literal ) }+ 
     [ "TO" ] ( identifier | literal )
      "GIVING" { identifier [ "ROUNDED" ] }+
    [ [ "ON" ] "SIZE" " ERROR" statement-list ]
    [ "NOT" [ "ON" ] "SIZE" "ERROR" statement-list ]
  [ "END-ADD" ]

((Of have done a lot of snipping. There grammar is quite good and very 
extensive. I am just trying to get to one point. Please refer to that page 
for a better perspective on the overall achievement, and to see their 
generosity.))

From my own studies of what we may need to do to get the grammar rules to 
work, this portion of the posted grammar actually won't work. 
The inner most statement-list is actually an error, I think. I am refering to 
the statement-list that is hanging on the
    [ [ "ON" ] "SIZE" " ERROR" statement-list ]
clause. Now there may be a number of ways to solve the problem, but the 
grammar posted does not deal with it.

Our basic discussion is leading us to either 'extend' or 'reject' when that 
inner most statement list has some kind of a dangling unterminated 
conditional. (There are other places where we must do that too). So fine, 
everyone is bored with it all. But notice one further problem.  The interior 
statement-list block can not be allowed to end with a simple arithmetic 
statement. Or you would have the situation that everyone agrees can not 
possibly happen. 

This situation actually represents a serious problem for the developers of 
the impressive tools at the University of Amsterdam. By genning the parser 
from pristine code base, they end up non challant about ambiguities that code 
errors can present. For parsers intended for retrofitting, or remediating 
otherwise valid code; that is not a failing. But this will not be sufficient 
for real compilers which must see likely problems and stay robust.

Thus in a certain sense, there are several kinds of statement-list blocks: 1) 
those that end with a closed statement (like explicitly scope terminated 
forms), 2) those that end in some simple statement form capable generally of 
extension with a conditional clause but not of the same  family as the 
enclosing block, and 3) those that end in a simple statment of the same 
family as the enclosing statement that could be real ambiguous because of the 
following conditional clause, and 4) those that are open unterminated 
conditionals but the clauses of which do not relate to the enclosing 
statement, and 5) those that are open unterminated conditionals that relate 
to the enclosing statement, and have not yet exhausted both alternatives of 
the conditional pair. Situation 5 becomes more interesting if we permit 
transposition of alternating conditionals.

Existing code under existing COBOL '85 standards, hides these distinctions 
from naive analysis and from mechanical scans; because the code base simply 
lacks some of these block types. This is a classic horizon problem, in the 
sense that the positive logic rule extractor can not quite see beyond the 
horizon to the block type restrictions. It is only seeing 
acceptable blocks in arithmetic statment interiors, and sees no other 
statement blocking pattern that suggest a need for distinctions, because the 
compiler managing their code base is stomping on those forms. I emphasize 
that for the purposes that they have in mind this is not a failing.  And even 
if your idea is to just kick start a grammar development effort, the pristine 
code an naive sanners are of inestimable value. 

For a robust compiler in a sketch we _might_ have 

rule_arith_open_code : 
 ARITH blah
  {   ON SIZE ERROR ...
 |    ON SIZE ERROR ...R         NOT ON SIZE ERROR ...
 |    NOT ON SIZE ERROR ...R ON SIZE ERROR ...
 |    NOT ON SIZE ERROR ...
}
 
Where ...R marks restrictions. If the entire structure is embedded in an 
arithmetic 
conditional block _interior_ then all of the loose ends are restricted: none 
of them can have a simple arithmetic statment on the very end

rule_arith_nested_interior : 
 ARITH blah
  {   ON SIZE ERROR ...R
 |    ON SIZE ERROR ...R         NOT ON SIZE ERROR ...R
 |    NOT ON SIZE ERROR ...R ON SIZE ERROR ...R
 |    NOT ON SIZE ERROR ...R
}
 
Now all of the blocks must be constrained from ending in a simple arithmetic 
or they will 
compete for the subsequent conditional clause on the enclosing conditional 
statement.

This is something like a shift/reduce conflict, but merely shifting will not 
work, nor will reducing. The rule that is recursing must actually be coded to 
_not_ look for the simple arithmetic as the last statement on the block. Use 
of nice tools like PCCTS will hide this problem, I am convinced. 

No matter what you do with a friendly tool you will be either intentionally 
or unknowingly shifting or reducing. You do not want either. You do not want 
the parser to look for simple arithmetics at the end of the interior blocks. 
This requirement is distinct for each of the conditional families. Like, I/O 
verb interiors do not want to be looking for simple I/O statements on the 
right edge of interior blocks which would be followed out at the enclosing 
rule by the subsequent conditional clause.

It is a deep situation; you do not want the compiler to look for the simple, 
and you do not want either shift or reduce! The positive logic extractors can 
not possibly see this when the foundation is pristine code. That is not a 
criticism. It just characterizes the result.

Furthermore the interior blocks need some restrictions that when reducing 
unterminated conditionals they look only for conditionals that are actually 
fully duplex, and allow a larger external recursing concurrent context to 
find the odd numbered clause sequences.  This again to avoid ambiguity 
between interior blocks looking for one clause forms and the exterior blocks 
looking for one clause forms. I think that is an horizon problem too (but it 
is beyond mere twilight). Again transposition of the alternates makes this 
more hairy.

I hope you will agree that ateast some further comment on other compilers 
might be useful as perspective on  the code base issue. And I hope you can 
see my sense that keeping the discussion open goes beyond mere reject/extend 
preferences. This is a significant design issue for the grammar rules. 
Aggressive parser, mere associativity, and bland statement-list blocks I 
think will not cut it. But anyway we are nearing workable grammars. So this 
chatter ends soon.

I believe it a useful thing to keep in mind that we can, if we choose, 
externalize the diagnostic severity attribute. And that may be useful as a 
generally available concept if we find any other 
potential code base issue that we think large enough to lead us to provide 
alternate behaviors.

So in general, if anyone out there is not completely enraged and exhausted by 
all this, I would encourage comments yea or nea, does anyone else think we 
have yet seen anything that suggest we will have a different responsibility 
than vendors do because we are dealing with convergence?

If all of this is too analytical, let me know, I can get it out of your way.

Bob Rayhawk


--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.