[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNU-COBOL] GNU-COBOL design concepts: zPROCs and TLP



In a message dated 10/27/99 10:04:54 PM EST, root@pobox.com writes:

<< The lexer and parser are not things to be done in isolation, though.
 They produce the symbol table and the first intermediate form of the
 program.  To be acceptable, a parser needs to be able to separate
 the universe into those things that can be compiled and those things
 that can be diagnosed.  A _good_ parser will, as well,  produce structures
 that facilitate straightforward, efficient and minimally error prone
 translation into efficient machine code.  Let's get all this stuff out
 in the open, where we can stare at it.   >>

So, staring at this, ... there will be a lexer, a parser, a symbol table 
(SYMT), and a first intermediate form (FIF). That much implies a 
FIF-processor that inputs the FIF and outputs either other intermediates or 
the GCC code.

There is certainly also a preprocessor and an error processor.

If it is advisable that these things not be done in isolation, then it is 
probably advisable that we not isolate ourselves from UNIX fundamentals. UNIX 
is a multi-tasking operating system. Perhaps, the basic spin of GNU COBOL 
should be a spawn of Distinct Division Processors. I will call these zPROCs.

Conceptually, it may be useful to envison the preprocessor as generating the 
zeroith intermediate form (ZIF), which would infact be COBOL (just like the 
big guys do on the mainframe when they 'preprocess' DB2 (SQL) and CICS 
statements and emit a fully expanded COBOL program (actually before COPY 
statement expansion, an Achilles heel if there ever was one; so not quite 
'fully' expanded by preprocessors on mainframes).

The preprocessor could, for all intents and purposes, emit four files, one 
for each DIVISION. Or start/end points for just one file that the zPROCs 
would share access to. Actually, it will perhaps be more efficient to do 
multiple files, since the preprocessor would have need of exclusive control 
of the emitted output. All zPROCs then would need to wait until the last 
record is written, if there were only one file. Alternatively with multiple 
files, If the DATA DIVISION file is closed early, then the DATA zPROC could 
be spawned early. 

This landscape suggest a critical question: can we imagine the FIF (first 
intermediate form) in a manner that permits the PROCEDURE zPROC to commence 
before the DATA zPROC terminates. If we are that good then we can exploit the 
Task Level Parallelism (TLP) that will be in silicon about the time this 
compiler arrives.

Abstractly then the parser uses the SYMT to validate syntax and emit FIF, the 
FIF processor would get partially symbolic information from the parser that 
it would resolve by means of SYMT access. 

Speaking of parallelism, I can image numerous FIF invocations per PROCEDURE 
DIVISION. Perhaps for each paragraph. ((here too we would need a way to 
eliminate file contention)).

If these ideas are relevant, then we need a kind of registry, for on-going 
tasks. We also need a house cleaning mechanism for orphaned components or 
processes.

The first cut at that will also require clear competence of the notion of 
imbedded programs (nested programs). Also, ... we may need to compile more 
than one compilation unit in a single invocation of the compiler. The 
compiler needs parameter processes, and a convention for applying parameters 
to enumerated compile units. If we are to go after the large base of COBOL 
source code on the mainframes we will need parameter processing from within 
the program source (generally line one only).

Thus we compile with 
  1) a set of command line parameters (optional)
  2) source code parameters (optional) (typically line one, or first few 
lines)
  3) source code (required)

So, very high level pseudo code might look like

  GNU COBOL main init: (per invocation)
     set up parm processing
     set up symbol table generally 
     do any preset of GNU COBOL task registry
     set up error processing
           <<register error processing component (probably a file)>>
     do GNU COBOL main once
     ((arrive back here with parallel processes in full swing))
     spawn general house cleaning (for atleast error processing component)
    
  GNU COBOL main once:
     do GNU COBOL main loop until end of parms AND end of compile unit ids
     diagnose dangling command line parameters

  GNU COBOL main loop:
     get parms until compile unit id encountered (or end of command input)
     get compile unit ids until further parms encountered (or end of command 
input)
         for each compile unit
            push parm context
            (magically extract source code compile parameter from line one, 
i.e. peek!)
            do GNU COBOL preprocessor within parameter context 
            pop parm context

  GNU COBOL preprocessor:
               read source file
               for the program and for each nested program
                 expand into ZIF (zeroith intermediate form, expanded COBOL)
                    <<register ZIF file(s)>>
                 trigger zPROC at each DIVISION boundary
                    <<register process>>
               (possibly) wait for completion of this compile unit (but 
better to default to truck on)
               (possibly) initiate house cleaning ZIF files and processes 
(but better to do later)


Notes : 
  - a compile unit is basically a file identified in command input.  it can 
have one program or a program with one or more nested programs (also note 
that one or zero of them is 'main()')
  - command input can be either the command line or a control file or a 
combination of the    two.
  - we will need a code generation component naming convention (ZIFs, FIFs 
and .C/CPP file names),
   
Elaborations: the zPROC design concept should allow for other future 
divisions and future sections.

The foregoing pseudo code is suggested not only to lay the ground work for 
Task Level Parallelism but to provide the basis for simplifying the grammar. 
We could have _four_ grammars. This provides for the division of  labor. 
There is a slight possibility that we could win a small efficiency gain with 
four grammars as the locality of reference will be much tighter for the 
smaller tables that would result. The PROCEDURE zPROC need not know the word 
PIC or REDEFINES or OCCURS, it only needs access to a SYMT (symbol table) 
that will support determination of proper reference to data items with any 
number of attributes.

Thus in the background there is a list of reserved words, with an associated 
element that determines to which DIVISION it is relevant.

We may want a filter(s) between the lexer(s) and parsers(s). The most 
important function of this gizmo would be to characterize a reference before 
feeding it to the parser. Rather than push things back onto the stack (which 
I think will destablize the compiler), and rather than have the parser bare 
the full responsibility for characterizing references, I would enhance the 
use of multiple IDENTIFIER tokens.  This would simply be an evolution of 
current work accomplishments.

When the lexer can't see the text matter as a reserved word or punctuation, 
it could return a VAGUE-IDENTIFIER to the filter.  The filter could do the 
look up in SYMT. If the item is dominated by an occurs clause, then the 
filter could return ID-SUBSRIPTABLE, if not then it could return ID-FLAT (and 
so on, for example, I think packed items can not be reference modified, and 
POINTER types ought not be reference modified. The parser would view only 
characterized identifiers, never VAGUE-IDENTIFIER.

Then rules might be much less ambiguous

valid-ref  :  ID-SUBSCRIPTABLE '(' keep-example-simple-numeric ')'
  |  ID-FLAT
     {blip XREF (cross reference) with line number};

erroneous-ref :  ID-FLAT '(' keep-example-simple-numeric ')'
        {diagnose this illegal subscript};
 

any_reference : valid_ref
  | erroneous_ref  

add_stmt : ADD any_reference TO any_reference
    {if $2.errflag or $4.errflag don't bother
     else emit FIF};

dangling_reference : any_reference
    { diagnose incomplete statement} 

A side effect of the multiple grammar proposal is that each grammar only 
needs to know its tokens. Tokens for the other divisions can be left out 
entirely.
 
The filter(s) could catch invalid use of keywords. For example, SELECT and 
ASSIGN, which do not belong in the PROCEDURE DIVISION, can be trapped in the 
PROCEDURE zPROC filter, this will keep the grammar table small and simplify 
the grammars. Each zPROC filter will need a blob token to defuse these bombs 
when passing it up to the grammar. The availability of this token can support 
detailed and exhaustive consideration of error-production rules.

logged_blob : RW-BLOB-WRONG-DIVISION
  { rw_blob_count++;
    diagnose blob;
    if rw_blob_count >  rw_blob_max_this_division crash_it // else nothing
  };

If we guess that treating a blob as a refernce is best recovery, then we 
tweak the rule for
any_reference to 

any_reference : valid-ref
  | erroneous-ref  
  | logged_blob

This is an example of how THE BLOB will support error productions (do not 
take that as a recommendation that we treat blobs as references). The filters 
might well keep track of blob sequences: every non-blob sets the blob 
sequence count to zero and goes through either unchanged or (if an 
identifier) gets characterized; a blob bumps the count up by one; perhaps the 
second and subsequents blob can be returned as DISTINCTLY_REPETITIVE_BLOBS 
which should short-cicuit much higher than as any_reference. For example,

managed_stmt : valid_statement
   {emit further FIF}
 | blob_spool
   {diagnose blob sequence as a whole};

blob_spool : blob_spool  DISTINCTLY_REPETITIVE_BLOBS
 |  DISTINCTLY_REPETITIVE_BLOBS
 { // details diagnosed already, do nothing but spool them}

Error production can remove ambiguities. But more generally, any push back 
code suggest that we are not getting enough specifics from the lexer (which 
can certainly happen), but a filter can do what is currently being done (in 
the excellent preprocessor work) before the push back. None of this is 
intended as criticism of the preprocessor work. But often it is too late in 
the grammar to push things back.

One futher note. We would not have to restrict the zPROCs to full divisions. 
The WORKING-STORAGE SECTION and LINKAGE SECTION, could be distinct zPROCs 
(obviously these could share functions. This notion is perhaps quite useful, 
as GNU could deploy an HTML SECTION, an SQL (declaratives) SECTION, an XML 
SECTION ...: all just by plugging in new hooks in the preprocessor and new 
zPROCs. We could even deploy a CPP SECTION, which would add to COBOL the 
syntax of C++.  The code would pass through the zPROCs on its way to GCC, 
much as asm{} code _conceptually_ passes to the 'assembler' in C (some 
validation could occur in a CPP zPROC; perhaps header files only, so that we 
_could_ do type checking). 

The above notes are oriented to the notion of separating the parser tasks. 
The lexical task has less of a need for this suggested split.  But the 
PICTURE  and VALUE clauses are probably going to be easier to deal with if we 
have a DATA DIVISION lexer that is separate from the PROCEDURE DIVISION lexer.

 

So in the interest of getting "... all this stuff out
 in the open, where we can stare at it, " I would say that we might want our 
task list to include these work items

parameter processes
preprocessor (much accomplished, but would experience a burst of revision
  to structure it for spinning off muliple ZIFs and spawning zPROCs to 
exploit TLP)
error processor
symbol table processes
zPROCs for each division (ZIF processors)
zFILTERs  (to blob out-of-division reserved words, and to characterize 
identifiers)
FIF processor (gets First Intermediate Form, emits compilable C code)
probably need a completer 
GNU COBOL registry
GNU COBOL task and component house cleaning

Abbreviations:
SYMT - symbol table
ZIF - Zeroith Intermediate File (preprocessed text, still in COBOL)
FIF - First Intermediate Form (from the parser)
zPROCs - Division Processor, ZIF file processors unique to each division.

Best Wishes,








--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.