[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[GNU-COBOL] GNU-COBOL design concepts: zPROCs and TLP
In a message dated 10/27/99 10:04:54 PM EST, root@pobox.com writes:
<< The lexer and parser are not things to be done in isolation, though.
They produce the symbol table and the first intermediate form of the
program. To be acceptable, a parser needs to be able to separate
the universe into those things that can be compiled and those things
that can be diagnosed. A _good_ parser will, as well, produce structures
that facilitate straightforward, efficient and minimally error prone
translation into efficient machine code. Let's get all this stuff out
in the open, where we can stare at it. >>
So, staring at this, ... there will be a lexer, a parser, a symbol table
(SYMT), and a first intermediate form (FIF). That much implies a
FIF-processor that inputs the FIF and outputs either other intermediates or
the GCC code.
There is certainly also a preprocessor and an error processor.
If it is advisable that these things not be done in isolation, then it is
probably advisable that we not isolate ourselves from UNIX fundamentals. UNIX
is a multi-tasking operating system. Perhaps, the basic spin of GNU COBOL
should be a spawn of Distinct Division Processors. I will call these zPROCs.
Conceptually, it may be useful to envison the preprocessor as generating the
zeroith intermediate form (ZIF), which would infact be COBOL (just like the
big guys do on the mainframe when they 'preprocess' DB2 (SQL) and CICS
statements and emit a fully expanded COBOL program (actually before COPY
statement expansion, an Achilles heel if there ever was one; so not quite
'fully' expanded by preprocessors on mainframes).
The preprocessor could, for all intents and purposes, emit four files, one
for each DIVISION. Or start/end points for just one file that the zPROCs
would share access to. Actually, it will perhaps be more efficient to do
multiple files, since the preprocessor would have need of exclusive control
of the emitted output. All zPROCs then would need to wait until the last
record is written, if there were only one file. Alternatively with multiple
files, If the DATA DIVISION file is closed early, then the DATA zPROC could
be spawned early.
This landscape suggest a critical question: can we imagine the FIF (first
intermediate form) in a manner that permits the PROCEDURE zPROC to commence
before the DATA zPROC terminates. If we are that good then we can exploit the
Task Level Parallelism (TLP) that will be in silicon about the time this
compiler arrives.
Abstractly then the parser uses the SYMT to validate syntax and emit FIF, the
FIF processor would get partially symbolic information from the parser that
it would resolve by means of SYMT access.
Speaking of parallelism, I can image numerous FIF invocations per PROCEDURE
DIVISION. Perhaps for each paragraph. ((here too we would need a way to
eliminate file contention)).
If these ideas are relevant, then we need a kind of registry, for on-going
tasks. We also need a house cleaning mechanism for orphaned components or
processes.
The first cut at that will also require clear competence of the notion of
imbedded programs (nested programs). Also, ... we may need to compile more
than one compilation unit in a single invocation of the compiler. The
compiler needs parameter processes, and a convention for applying parameters
to enumerated compile units. If we are to go after the large base of COBOL
source code on the mainframes we will need parameter processing from within
the program source (generally line one only).
Thus we compile with
1) a set of command line parameters (optional)
2) source code parameters (optional) (typically line one, or first few
lines)
3) source code (required)
So, very high level pseudo code might look like
GNU COBOL main init: (per invocation)
set up parm processing
set up symbol table generally
do any preset of GNU COBOL task registry
set up error processing
<<register error processing component (probably a file)>>
do GNU COBOL main once
((arrive back here with parallel processes in full swing))
spawn general house cleaning (for atleast error processing component)
GNU COBOL main once:
do GNU COBOL main loop until end of parms AND end of compile unit ids
diagnose dangling command line parameters
GNU COBOL main loop:
get parms until compile unit id encountered (or end of command input)
get compile unit ids until further parms encountered (or end of command
input)
for each compile unit
push parm context
(magically extract source code compile parameter from line one,
i.e. peek!)
do GNU COBOL preprocessor within parameter context
pop parm context
GNU COBOL preprocessor:
read source file
for the program and for each nested program
expand into ZIF (zeroith intermediate form, expanded COBOL)
<<register ZIF file(s)>>
trigger zPROC at each DIVISION boundary
<<register process>>
(possibly) wait for completion of this compile unit (but
better to default to truck on)
(possibly) initiate house cleaning ZIF files and processes
(but better to do later)
Notes :
- a compile unit is basically a file identified in command input. it can
have one program or a program with one or more nested programs (also note
that one or zero of them is 'main()')
- command input can be either the command line or a control file or a
combination of the two.
- we will need a code generation component naming convention (ZIFs, FIFs
and .C/CPP file names),
Elaborations: the zPROC design concept should allow for other future
divisions and future sections.
The foregoing pseudo code is suggested not only to lay the ground work for
Task Level Parallelism but to provide the basis for simplifying the grammar.
We could have _four_ grammars. This provides for the division of labor.
There is a slight possibility that we could win a small efficiency gain with
four grammars as the locality of reference will be much tighter for the
smaller tables that would result. The PROCEDURE zPROC need not know the word
PIC or REDEFINES or OCCURS, it only needs access to a SYMT (symbol table)
that will support determination of proper reference to data items with any
number of attributes.
Thus in the background there is a list of reserved words, with an associated
element that determines to which DIVISION it is relevant.
We may want a filter(s) between the lexer(s) and parsers(s). The most
important function of this gizmo would be to characterize a reference before
feeding it to the parser. Rather than push things back onto the stack (which
I think will destablize the compiler), and rather than have the parser bare
the full responsibility for characterizing references, I would enhance the
use of multiple IDENTIFIER tokens. This would simply be an evolution of
current work accomplishments.
When the lexer can't see the text matter as a reserved word or punctuation,
it could return a VAGUE-IDENTIFIER to the filter. The filter could do the
look up in SYMT. If the item is dominated by an occurs clause, then the
filter could return ID-SUBSRIPTABLE, if not then it could return ID-FLAT (and
so on, for example, I think packed items can not be reference modified, and
POINTER types ought not be reference modified. The parser would view only
characterized identifiers, never VAGUE-IDENTIFIER.
Then rules might be much less ambiguous
valid-ref : ID-SUBSCRIPTABLE '(' keep-example-simple-numeric ')'
| ID-FLAT
{blip XREF (cross reference) with line number};
erroneous-ref : ID-FLAT '(' keep-example-simple-numeric ')'
{diagnose this illegal subscript};
any_reference : valid_ref
| erroneous_ref
add_stmt : ADD any_reference TO any_reference
{if $2.errflag or $4.errflag don't bother
else emit FIF};
dangling_reference : any_reference
{ diagnose incomplete statement}
A side effect of the multiple grammar proposal is that each grammar only
needs to know its tokens. Tokens for the other divisions can be left out
entirely.
The filter(s) could catch invalid use of keywords. For example, SELECT and
ASSIGN, which do not belong in the PROCEDURE DIVISION, can be trapped in the
PROCEDURE zPROC filter, this will keep the grammar table small and simplify
the grammars. Each zPROC filter will need a blob token to defuse these bombs
when passing it up to the grammar. The availability of this token can support
detailed and exhaustive consideration of error-production rules.
logged_blob : RW-BLOB-WRONG-DIVISION
{ rw_blob_count++;
diagnose blob;
if rw_blob_count > rw_blob_max_this_division crash_it // else nothing
};
If we guess that treating a blob as a refernce is best recovery, then we
tweak the rule for
any_reference to
any_reference : valid-ref
| erroneous-ref
| logged_blob
This is an example of how THE BLOB will support error productions (do not
take that as a recommendation that we treat blobs as references). The filters
might well keep track of blob sequences: every non-blob sets the blob
sequence count to zero and goes through either unchanged or (if an
identifier) gets characterized; a blob bumps the count up by one; perhaps the
second and subsequents blob can be returned as DISTINCTLY_REPETITIVE_BLOBS
which should short-cicuit much higher than as any_reference. For example,
managed_stmt : valid_statement
{emit further FIF}
| blob_spool
{diagnose blob sequence as a whole};
blob_spool : blob_spool DISTINCTLY_REPETITIVE_BLOBS
| DISTINCTLY_REPETITIVE_BLOBS
{ // details diagnosed already, do nothing but spool them}
Error production can remove ambiguities. But more generally, any push back
code suggest that we are not getting enough specifics from the lexer (which
can certainly happen), but a filter can do what is currently being done (in
the excellent preprocessor work) before the push back. None of this is
intended as criticism of the preprocessor work. But often it is too late in
the grammar to push things back.
One futher note. We would not have to restrict the zPROCs to full divisions.
The WORKING-STORAGE SECTION and LINKAGE SECTION, could be distinct zPROCs
(obviously these could share functions. This notion is perhaps quite useful,
as GNU could deploy an HTML SECTION, an SQL (declaratives) SECTION, an XML
SECTION ...: all just by plugging in new hooks in the preprocessor and new
zPROCs. We could even deploy a CPP SECTION, which would add to COBOL the
syntax of C++. The code would pass through the zPROCs on its way to GCC,
much as asm{} code _conceptually_ passes to the 'assembler' in C (some
validation could occur in a CPP zPROC; perhaps header files only, so that we
_could_ do type checking).
The above notes are oriented to the notion of separating the parser tasks.
The lexical task has less of a need for this suggested split. But the
PICTURE and VALUE clauses are probably going to be easier to deal with if we
have a DATA DIVISION lexer that is separate from the PROCEDURE DIVISION lexer.
So in the interest of getting "... all this stuff out
in the open, where we can stare at it, " I would say that we might want our
task list to include these work items
parameter processes
preprocessor (much accomplished, but would experience a burst of revision
to structure it for spinning off muliple ZIFs and spawning zPROCs to
exploit TLP)
error processor
symbol table processes
zPROCs for each division (ZIF processors)
zFILTERs (to blob out-of-division reserved words, and to characterize
identifiers)
FIF processor (gets First Intermediate Form, emits compilable C code)
probably need a completer
GNU COBOL registry
GNU COBOL task and component house cleaning
Abbreviations:
SYMT - symbol table
ZIF - Zeroith Intermediate File (preprocessed text, still in COBOL)
FIF - First Intermediate Form (from the parser)
zPROCs - Division Processor, ZIF file processors unique to each division.
Best Wishes,
--
This message was sent through the gnu-cobol mailing list. To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body. For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.