[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gnubol: distinguishing data refs and conditional refs
The idea that "The tools used and discussed in this project
are generally 'context free' processing mechanisms", and the use of that
technical sense of the word 'context', is an attempt to focus on the tools;
not an attempt to assert the category of the COBOL language.
My point in this can be made by different words. All of the languages
mentioned in this interaction are _not_ context free because they have a set
of declarative statements which are valid. Consequently you can not parse the
procedures with a purely context free algorithm and finish the task. The
idea is not to get to a theory but to say that you need the symbol table. We
must design the symbol table into the parse process.
My comments simply assert that you can plug the symbol table access into
either the lexer or a filter between the lexer and the parser and manifest
references to data items very precisely. Certain _type_ information can be
used to manifest trerminal symbols distinctly, leading to the possibility of
productions that are distinct.
For example in,
stmt_move : MOVE data_ref_1 TO data_ref_2
data_ref_2 cannot be an elementary iterm (unless data_ref_1 is a 66 level).
If we have even a rudimentary symbol table that atleast distinguishes data
from procedure types, then we can easily error out the following patterns
syntactically,
data_ref OF section_name
and
paragraph_name OF data_ref
This can be done with grammar rules. You will still need safeguards in the
semantic actions or the code that interprets your AST or intermediate code.
But what is much more important is that if you do manifest references to
program components as fairly precise, then the positive productions and error
productions intended to find correct subscripting and corrext reference
modification become much more reliable.
That is, a non reference modifiable item should not be reference modified;
and a non-subscriptable item should not be subscripted; ... and ... a non
qualifiable reference should not be qualified (with an OF or IN clause), ...
and ... a reference that can only be qualified to one level (such as a
paragraph name in a section) should not be qualified to multiple levels.
The languages we are interested in are not context free, the tools we are
interested in _are_ context free. My point is that the symbol table is a
major fixer for that contradiction. Backtracking capabilities in the parser
are also relevant to the problem. I am merely saying that 'backtracking' in
fact reads 'slow' to me. Use a backtracker where actually necessary, but do
not allow that capability to mislead you into thinking you are saving
development effort.
Another major problem we have with available tools is that we are almost
always interested in two parallel interest when attempting to establish
compilers or interpreters for the interesting well known languages. We want
to respond to the symbol sequence by detecting structure, but we also want to
validate the type of the program component references. Type detection per se
is not built in to most of the popular tools (the consequence is that grammar
coders frequently find them selfes coding a type detecting conditional in an
action, where it is too late to handle it properly).
If you do not deal with type in the rules of the production then you relegate
it to
the semantics code. It is not going to go away.
In the end you have one of two results. The first is a mixture of trapping
type issues in the syntax analysis (the parser(s)) with some trapping in the
semantics: which would be conditional statements in the actions, or testing
for type validity in the modules that get to the AST or intermediate code
coming out of the parser. The second possible result of development efforts
is a near complete absence of type validation in the syntax analysis (that is
the rules can not tell what kind of references are parsing into reductions),
and all type checking is done in the semantics (some in the actions, some in
down stream processes).
If the symbol table is poopooed as not so essential to front end processing,
I know that you are relegating everything to the semantics. That is not
necessary (and it really does not save any development time, the type
checking requirement is not going to go away).
But what is more, we have good workers coming out of the phase of writing the
preprocessor, commenting on the tough parts, saying something like the data
reference rules are especially hard. I accept that. So divide and conquer.
Which data references are hard? Which data references are errors
topologically? Write rules for these. The only way to permit that in the
grammar is to provide it with distinct tokens (terminal symbols).
All of the languages of interest are non context free. The declarations
present the available, suitable grammar tools with a basic challenge: valid
syntax is determined by portions of the program other than the part we are
currently scanning. (context dependence is therefore present in C, FORTRAN
... ).
So, I don't think that too many would argue with a suggestion that we might
have a rule that says
diagnosed_error : PERFORM data_name
{ best to tell them not to do that }
rather than let the semantics detect it.
Likewise for
diagnosed_error2: MOVE section_name
{ best to tell them not to do that }
And the only difference is the type of the referant. Obivously you could try
to detect that in if statements in other rules, but the problem is you have
already reduced something you would rather reduce in a different way (because
it is an error). Backtracking is a potential solution, but read backtracking
= slow.
Please, read COBOL = very large programs; not because it is a verbose
language deep or wide, but the real world has large programs, very large
programs, with major collections of very large programs. Read COBOL as large
programs.
Backtracking is a major issue if you are using it where you really don't need
to. And I think this category of tool is somewhat slower even in parsing
symbol sequences that do not actually need the feature in the parser
algorithm.
((Another way to see this is that backtracking is a kind of error recovery
modality. I am urging that we stay out of error recovery modality generally
for the purpose of creating a more stable compiler. Those kind enough to
respond to some of my posts surely know the following, but for those not
familitar with compiler technology: an error production is not the same thing
as sending the parser into error recovery mode. To the parser an error
production just looks like another positive rule. The benefit of this
distinction is there is no loss of position in the parse process. Some tools
will try to ignore symbols for a while when you get into error recovery mode.
That can be hard to anticipate for the coder of the grammar. For example,
if you have skipped code then interpreting and validating an OF clause might
be pretty difficult. )).
COBOL from its inception has been a heavily typed language. The rigidity of
the standards committees has produced evolutionary enhancements that always
have the characteristic of isolating new features by type. (That to protect
old code, so that new compiler functionality does not invade old semantics.
They do it, in effect, by requiring new features to have distinct syntax).
Stated otherwise, there is a great deal that we can do syntactically, in
validating a program. And more particularly, type attributes are already
known when we hit the procedure division, it is good to use type information
as early as possible to simplify the task of code generation.
In COBOL it is very important to possition yourself to unscramble data
references early in the compile. I really do accept the verdict that comes
from those of you who have been kind enough to work on the preprocessor.
Untangling data references is a major undertaking. We do not engage that
issue by avoiding it in the parse rules, we simply delay it (and maybe
commence to replicate the problem, for surely MOVE will have to find invalid
references, as will ADD, as will COMPUTE).
You can't catch all type error in the rules, but you can get many reference
problems in the rules. As near as possible I recommend that we hand semantic
code a doable set of AST nodes or intermeditate code parts. That will require
substitution for nodes that are munged in the source code, or suppression of
the emission of the set to the semantic processor.
You really and truely do not want the semantic code to be a syntax checker.
However that is also a dream not entirely achievable. There is a breakeven
point after which the parser is so heavily laden with type distinctions that
it falls out of the category of being acceptable. IMHO COBOL pushes you to
that brink faster then _any_ other language.
With a symbol table it is possible to encode certain type information in the
terminal symbols. That forestalls a need for backtracking functionality for
many rules.
There are quite a few folks associated with this project as contributors or
occassional commentors that have lots of expertise with compiler generation
technology and are steep in their understanding of the fundamentals, so I
would certainly hesitate to be too philosophical. lest I look too
pretensious. But honestly the difference between syntax and semantics is
actually entirely subjective. There is no such dividing line in the real
universe. We end up drawing the line where we want to.
In COBOL I believe we generally need to push as much type factoring as far up
to the source code sensing as we possibly can. COBOL is a heavily typed,
indeed a rigidly typed programming environment. We can detect _many_ type
errors syntactically. And we should. I think we can not trap all of them
syntactically.
But, and here is the kicker, we are going to find that dealing effectively
with subscripting, reference modification, and intrinsic function invocation
will simply require considerable enumeration of types (in the set of terminal
symbols, tokens); and while we are at it we might as well be ruthlessly
thorough about it. I am very confident that you do not need backtracking for
that. Which is not to say that backtracking should be out, just don't set up
dependencies on it.
I do infact believe that some of the things posted in some messages as
ambiguities are not ambiguous at all. And do infact believe that isolating
unbalance parentheses is a crucial step towards keeping the parser in normal
reduction mode and out of error recovery mode.
With some protection from the acidic effect of unbalanced parens, I believe
there is a good chance that you can detect good references and detect
distinctly bad references in syntax, without backtracking, without ambiguity,
and leave dramatically less type checking work for the semantics (such as if
statements in the actions, or conditional code in downstream processes). But
you surely need a symbol table to do it.
When we get to the end we will not have a lexical analyzer and a syntax
analyser that happen to have a symbol table and an error handling mechanism
as add-on. In the end we will have a symbol table and a vast error handling
mechanism that happen to be attached to lexical and syntaxtic processes.
A COBOL compiler is not a good code handler. It is a bad code handler first,
and a code generator second. Only a small subset of the possible programs
that the compiler will visit are valid. A COBOL compiler is an error handler.
A gigantic portion of the productions will have to deal with errors.
But really, we are going to be much better off if even the _positive_
productions cognize type distinctions.
My view on this is quite different than some other posters who say that COBOL
doesn't really have types. Say what?
Or perhaps consider the suggestion that we should just make the whole data
division into a _single_ C structure. Life saving Memory Management
functionality on chips just said ouch. And then there is that really
inconvenient SYNCRONIZE dealybop. And the fact that many compilers
_implicitly_ syncronize 01s (gee what a bother). And then there is
efficiency, we wouldn't want to worry about getting a table on a machine
boundary or somehow permit the semantics to deal with cache alignment for us.
Oh gosh, that COBOL stuff, it doesn't have types, or any attributes that are
important. Lets just write the semantic process now. Surely the front end
will do what we need with this thing that does not have any types.
We want to generate code!
So many times I have been there. The ELSE clauses in my COBOL programs,
frequently I would rather not write them. But regrettably we need ELSEs and
OTHERs. And so it is with our parser. We need to deal with errors. That is
the principal work at hand. The lexer and the parser are not code generators.
I am certain that it will not work to write the positive productions, call it
a nearly complete parser; and predict that we can just add in the error traps
as we 'discover' the need. We already know that the error traps are
necessary, big time.
Another way of looking at the parser is that its job is to bring the task of
error management down to size, so that semantics become reasonably possible.
In COBOL a parser that is blind to type can not possible bring the error
detection problem into manageable size.
This all fits my original post. The _tools_ are context free in the paradigm
that they presuppose. The language we are targetting is not. We need
compensations. Backtracking relates. It is expensive, and we become dependent
if we permit frequent reliance on it. The symbol table relates. I am
confidence that type is a crucial aspect of COBOL. I urge you all to view
type as syntactic (it is as well semantic). I urge you to believe that if you
relegate type to semantics only you will never finish the semantics.
I believe that there is a need for some design work. The symbol table and the
error processing should be designed.
It is a fact that the preprocess can exploit opportunities to drive the
symbol table. Certainly the parse of the DATA DIVISION can drive major
portions of the symbol table. (I do not personally believe that we must
accept that the procedure division can not start until after the data
division but I am a wee bit alone on that). Most reference to program
components happen in the procedure division. I just can't see starting with
the procedure division with no symbol table. That can not be a COBOL
procedure division syntax check. But that is because I draw the line with the
type information inside syntax as much as possible to compensate for the dual
fact that the available tools are not context sensitive and COBOL, IMHO, is
the _most_ context sensitive computer language.
Certainly some design work would be nice at this stage. We should not just
code the grammars, presto. We ought to figure out the landscape a bit.
As a mental exercise I would suggest that the procedure division parser for
COBOL be conceived as a type checker which incidentally checks to see if
symbols are in the right sequence. The procedure division parser is an error
handler that clearly communicates with the programmer about the type of
program components in references. A subset of its important task is to offer
the programmer a code generation in the unlikely event that the syntax is
correct.
I would fold into the design process an effort to bring forth an agreement
from project participants that the data that COBOL programs make reference to
has extreme value, and the way in which the reference is made is of the
essence. The data manipulations specified in a COBOL program are of great
importaance, but that importance is secondary to the typing used to specify
the data.
COBOL is for data processing. The data is more important than the process. If
a program does not refer to data properly, the process should not be
associated with the data. COBOL is strongly typed, and heavily typed. We can
sense errors in COBOL syntax long before we bounce into inappropriate
production rules that we might backtrack out of, and long before strangeness
is handed to code generating code.
I honestly feel that this issue is very important so have written this long
expression. But truthfully, part of what needs to be said here is that the
DATA DVISION is real. It is the designers of the language that put the type
into COBOL syntax (I honestly don't think it is me that draws the line close
to the front end.) The great achievements of Bachus and Naur post date the
commitment of the COBOL language design to natural language like context
dependence.
The tools discussed are way cool, but when we are done we will be way into
type distinctions in the grammar rules.
Best Wishes
Bob Rayhawk
RKRayhawk@aol.com
--
This message was sent through the gnu-cobol mailing list. To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body. For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.