[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gnubol: distinguishing data refs and conditional refs

To: gnu-cobol@lusars.net
Subject: Re: gnubol: distinguishing data refs and conditional refs
From: RKRayhawk@aol.com
Date: Thu, 11 Nov 1999 13:49:27 EST
Delivered-To: gnu-cobol-outgoing@wallace.lusars.net
Reply-To: gnu-cobol@lusars.net
Sender: owner-gnu-cobol@wallace.lusars.net

The idea that "The tools used and discussed in this project
are generally 'context free' processing mechanisms",  and the use of that
technical sense of the word 'context', is an attempt to focus on the tools; 
not an attempt to assert the category of the COBOL language.

My point in this can be made by different words. All of the languages 
mentioned in this interaction are _not_ context free because they have a set 
of declarative statements which are valid. Consequently you can not parse the 
procedures with a purely context free algorithm and finish the task.  The 
idea is not to get to a theory but to say that you need the symbol table. We 
must design the symbol table into the parse process.

My comments simply assert that you can plug the symbol table access into 
either the lexer or a filter between the lexer and the parser and manifest 
references to data items very precisely. Certain _type_ information can be 
used to manifest trerminal symbols distinctly, leading to the possibility of 
productions that are distinct. 

For example in,

stmt_move : MOVE data_ref_1 TO data_ref_2

data_ref_2 cannot be an elementary iterm (unless data_ref_1 is a 66 level).

If we have even a rudimentary symbol table that atleast distinguishes data 
from procedure types, then we can easily error out the following patterns 
syntactically,

   data_ref OF section_name
and
   paragraph_name OF data_ref

This can be done with grammar rules. You will still need safeguards in the 
semantic actions or the code that interprets your AST or intermediate code.

But what is much more important is that if you do manifest references to 
program components as fairly precise, then the positive productions and error 
productions intended to find correct subscripting and corrext reference 
modification become much more reliable. 

That is, a non reference modifiable item should not be reference modified; 
and a non-subscriptable item should not be subscripted; ... and ... a non 
qualifiable reference should not be qualified (with an OF or IN clause), ... 
and ... a reference that can only be qualified to one level (such as a 
paragraph name in a section) should not be qualified to multiple levels.

The languages we are interested in are not context free, the tools we are 
interested in _are_ context free. My point is that the symbol table is a 
major fixer for that contradiction. Backtracking capabilities in the parser 
are also relevant to the problem. I am merely saying that 'backtracking' in 
fact reads 'slow' to me. Use a backtracker where actually necessary, but do 
not allow that capability to mislead you into thinking you are saving 
development effort. 

Another major problem we have with available tools is that we are almost 
always interested in two parallel interest when attempting to establish 
compilers or interpreters for the interesting well known languages. We want 
to respond to the symbol sequence by detecting structure, but we also want to 
validate the type of the program component references. Type detection per se 
is not built in to most of the popular tools (the consequence is that grammar 
coders frequently find them selfes coding a type detecting conditional in an 
action, where it is too late to handle it properly).

If you do not deal with type in the rules of the production then you relegate 
it to
the semantics code. It is not going to go away. 

In the end you have one of two results. The first is a mixture of trapping 
type issues in the syntax analysis (the parser(s)) with some trapping in the 
semantics: which would be conditional statements in the actions, or testing 
for type validity in the modules that get to the AST or intermediate code 
coming out of the parser. The second possible result of development efforts 
is a near complete absence of type validation in the syntax analysis (that is 
the rules can not tell what kind of references are parsing into reductions), 
and all type checking is done in the semantics (some in the actions, some in 
down stream processes).

If the symbol table is poopooed as not so essential to front end processing, 
I know that you are relegating everything to the semantics.  That is not 
necessary (and it really does not save any development time, the type 
checking requirement is not going to go away).

But what is more, we have good workers coming out of the phase of writing the 
preprocessor, commenting on the tough parts, saying something like the data 
reference rules are especially hard. I accept that. So divide and conquer.

Which data references are hard? Which data references are errors 
topologically? Write rules for these. The only way to permit that in the 
grammar is to provide it with distinct tokens (terminal symbols).

All of the languages of interest are non context free.  The declarations 
present the available, suitable grammar tools with a basic challenge: valid 
syntax is determined by portions of the program other than the part we are 
currently scanning. (context dependence is therefore present in C, FORTRAN 
... ).

So, I don't think that too many would argue with a suggestion that we might 
have a rule that says

diagnosed_error :  PERFORM data_name
   { best to tell them not to do that }

rather than let the semantics detect it.

Likewise for

diagnosed_error2:  MOVE section_name
   { best to tell them not to do that }

And the only difference is the type of the referant.  Obivously you could try 
to detect that in if statements in other rules, but the problem is you have 
already reduced something you would rather reduce in a different way (because 
it is an error). Backtracking is a potential solution, but read backtracking 
= slow.
Please, read COBOL = very large programs; not because it is a verbose 
language deep or wide, but the real world has large programs, very large 
programs, with major collections of very large programs. Read COBOL as  large 
programs. 

Backtracking is a major issue if you are using it where you really don't need 
to. And I think this category of tool is somewhat slower even in parsing 
symbol sequences that do not actually need the feature in the parser 
algorithm.

((Another way to see this is that backtracking is a kind of error recovery 
modality.  I am urging that we stay out of error recovery modality  generally 
for the purpose of creating a more stable compiler.  Those kind enough to 
respond to some of my posts surely know the following, but for those not 
familitar with compiler technology: an error production is not the same thing 
as sending the parser into error recovery mode. To the parser an error 
production just looks like another positive rule. The benefit of this 
distinction is there is no loss of position in the parse process.  Some tools 
will try to ignore symbols for a while when you get into error recovery mode. 
 That can be hard to anticipate for the coder of the grammar.  For example, 
if you have skipped code then interpreting and validating an OF clause might 
be pretty difficult. )).

COBOL from its inception has been a heavily typed language.  The rigidity of 
the standards committees has produced evolutionary enhancements that always 
have the characteristic of isolating new features by type. (That to protect 
old code, so that new compiler functionality does not invade old semantics.  
They do it, in effect, by requiring new features to have distinct syntax).  
Stated otherwise, there is a great deal that we can do syntactically, in 
validating a program. And more particularly, type attributes are already 
known when we hit the procedure division, it is good to use type information 
as early as possible to simplify the task of code generation.

In COBOL it is very important to possition yourself to unscramble data 
references early in the compile. I really do accept the verdict that comes 
from those of you who have been kind enough to work on the preprocessor. 
Untangling data references is a major undertaking. We do not engage that 
issue by avoiding it in the parse rules, we simply delay it (and maybe 
commence to replicate the problem, for surely MOVE will have to find invalid 
references, as will ADD, as will COMPUTE).  

You can't catch all type error in the rules, but you can get many reference 
problems in the rules.  As near as possible I recommend that we hand semantic 
code a doable set of AST nodes or intermeditate code parts. That will require 
substitution for nodes that are munged in the source code, or suppression of 
the emission of the set to the semantic processor. 

You really and truely do not want the semantic code to be a syntax checker. 
However that is also a dream not entirely achievable.  There is a breakeven 
point after which the parser is so heavily laden with type distinctions that 
it falls out of the category of being acceptable. IMHO COBOL pushes you to 
that brink faster then _any_ other language. 

With a symbol table it is possible to encode certain type information in the 
terminal symbols.  That forestalls a need for backtracking functionality for 
many rules. 

There are quite a few folks associated with this project as contributors or 
occassional commentors that have lots of expertise with compiler generation 
technology and are steep in their understanding of the fundamentals, so I 
would certainly hesitate to be too philosophical. lest I look too 
pretensious. But honestly the difference between syntax and semantics is 
actually entirely subjective.  There is no such dividing line in the real 
universe. We end up drawing the line where we want to.

In COBOL I believe we generally need to push as much type factoring as far up 
to the source code sensing as we possibly can. COBOL is a heavily typed, 
indeed a rigidly typed programming environment. We can detect _many_ type 
errors syntactically. And we should. I think we can not trap all of them 
syntactically.

But, and here is the kicker, we are going to find that dealing effectively 
with subscripting, reference modification, and intrinsic function invocation 
will simply require considerable enumeration of types (in the set of terminal 
symbols, tokens); and while we are at it we might as well be ruthlessly 
thorough about it. I am very confident that you do not need backtracking for 
that. Which is not to say that backtracking should be out, just don't set up 
dependencies on it.

I do infact believe that some of the things posted in some messages as 
ambiguities are not ambiguous at all.  And do infact believe that isolating 
unbalance parentheses is a crucial step towards keeping the parser in normal 
reduction mode and out of error recovery mode.

With some protection from the acidic effect of unbalanced parens, I believe 
there is a good chance that you can detect good references and detect 
distinctly bad references in syntax, without backtracking, without ambiguity, 
and leave dramatically less type checking work for the semantics (such as if 
statements in the actions, or conditional code in downstream processes). But 
you surely need a symbol table to do it.

When we get to the end we will not have a lexical  analyzer and a syntax 
analyser that happen to have a symbol table and an error handling mechanism 
as add-on.  In the end we will have a symbol table and a vast error handling 
mechanism that happen to be attached to lexical and syntaxtic processes.

A COBOL compiler is not a good code handler.  It is a bad code handler first, 
and a code generator second. Only a small subset of the possible programs 
that the compiler will visit are valid.  A COBOL compiler is an error handler.

A gigantic portion of the productions will have to deal with errors. 

But really, we are going to be much better off if even the _positive_ 
productions cognize type distinctions.


My view on this is quite different than some other posters who say that COBOL 
doesn't really have types. Say what?

Or perhaps consider the suggestion that we should just make the whole data 
division into a _single_ C structure. Life saving Memory Management 
functionality on chips just said ouch. And then there is that really 
inconvenient SYNCRONIZE dealybop. And the fact that many compilers 
_implicitly_ syncronize 01s (gee what a bother). And then there is 
efficiency, we wouldn't want to worry about getting a table on a machine 
boundary or somehow permit the semantics to deal with cache alignment for us. 
 Oh gosh, that COBOL stuff, it doesn't have types, or any attributes that are 
important. Lets just write the semantic process now. Surely the front end 
will do what we need with this thing that does not have any types.

We want to generate code!

So many times I have been there.  The ELSE clauses in my COBOL programs, 
frequently I would rather not write them. But regrettably we need ELSEs and 
OTHERs. And so it is with our parser. We need to deal with errors. That is 
the principal work at hand. The lexer and the parser are not code generators.

I am certain that it will not work to write the positive productions, call it 
a nearly complete parser; and predict that we can just add in the error traps 
as we 'discover' the need.  We already know that the error traps are 
necessary, big time.

Another way of looking at the parser is that its job is to bring the task of 
error management down to size, so that semantics become reasonably possible. 
In COBOL a parser that is blind to type can not possible bring the error 
detection problem into manageable size.

This all fits my original post. The _tools_ are context free in the paradigm 
that they presuppose. The language we are targetting is not. We need 
compensations. Backtracking relates. It is expensive, and we become dependent 
if we permit frequent reliance on it. The symbol table relates. I am 
confidence that type is a crucial aspect of COBOL. I urge you all to view 
type as syntactic (it is as well semantic). I urge you to believe that if you 
relegate type to semantics only you will never finish the semantics.

I believe that there is a need for some design work. The symbol table and the 
error processing should be designed. 

It is a fact that the preprocess can exploit opportunities to drive the 
symbol table.  Certainly the parse of the DATA DIVISION can drive major 
portions of the symbol table. (I do not personally believe that we must 
accept that the procedure division can not start until after the data 
division but I am a wee bit alone on that).  Most reference to program 
components happen in the procedure division.  I just can't see starting with 
the procedure division with no symbol table. That can not be a COBOL 
procedure division syntax check. But that is because I draw the line with the 
type information inside syntax as much as possible to compensate for the dual 
fact that the available tools are not context sensitive and COBOL, IMHO, is 
the _most_ context sensitive computer language.

Certainly some design work would be nice at this stage. We should not just 
code the grammars, presto. We ought to figure out the landscape a bit.

As a mental exercise I would suggest that the procedure division parser for 
COBOL be conceived as a type checker which incidentally checks to see if 
symbols are in the right sequence. The procedure division parser is an error 
handler that clearly communicates with the programmer about the type of 
program components in references. A subset of its important task is to offer 
the programmer a code generation in the unlikely event that the syntax is 
correct.

I would fold into the design process an effort to bring forth an agreement 
from project participants that the data that COBOL programs make reference to 
has extreme value, and the way in which the reference is made is of the 
essence. The data manipulations specified in a COBOL program are of great 
importaance, but that importance is secondary to the typing used to specify 
the data. 

COBOL is for data processing. The data is more important than the process. If 
a program does not refer to data properly, the process should not be 
associated with the data. COBOL is strongly typed, and heavily typed. We can 
sense errors in COBOL syntax long before we bounce into inappropriate 
production rules that we might backtrack out of, and long before strangeness 
is handed to code generating code.

I honestly feel that this issue is very important so have written this long 
expression. But truthfully, part of what needs to be said here is that the 
DATA DVISION is real. It is the designers of the language that put the type 
into COBOL syntax (I honestly don't think it is me that draws the line close 
to the front end.) The great achievements of Bachus and Naur post date the 
commitment of the COBOL language design to natural language like context 
dependence.

The tools discussed are way cool, but when we are done we will be way into 
type distinctions in the grammar rules.

Best Wishes
Bob Rayhawk
RKRayhawk@aol.com

--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.
Follow-Ups:
- Re: gnubol: distinguishing data refs and conditional refs
  - From: Tim Josling <tej@melbpc.org.au>
Prev by Date: Re: gnubol: Anyone actually working on Cobol
Next by Date: Re: gnubol: Anyone actually working on Cobol
Prev by thread: Re: gnubol: distinguishing data refs and conditional refs
Next by thread: Re: gnubol: distinguishing data refs and conditional refs
Index(es):
- Date
- Thread