[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gnubol: procedure types



Just a general comment, entirely subjective, concerning I/O and incore 
compilation.

The supposed retained buffers are not going to be retained in a real shared 
system; rather than us doing the I/O, the OS will be paging. That means that 
we will loose control of performance. Strategies that hold vast token lists 
in core exacerbate this.  Again the idea is that real systems are shared; in 
development environments the sharees might each be using our tool which is 
assuming retained buffers and holding vast token lists. A real regen of a 
real COBOL application is going to thrash on the virtual memory device. We 
will not be able to get to the problem. Much of that would not be necessary.

The basic sentiment of extreme programming is entirely acceptable, 'lets get 
moving'. That is fine, and I don't see droves of available coders, so I am 
realistic as I offer counter point. But I think it is atleast worth 
considering that major phases may need three modalities: one such as depicted 
tries to hold all in core; two holds until it busts, and then uses the 
equivalent of a least recently used algorithm to store material temporarily 
or some permanent retirement scheme; a third sends must internal resources to 
DASD from the get go, retrieving them when needed.

An example of the need for the third is a program with a large procedure 
division that has many, many, many paragraphs. This implies lots of forward 
references. The SQL program described in these postings fit that category. We 
can detect that at preprocess time! Also detectable at preprocess time is 
nested program content (and futuristically class and method programs). These 
sneak preview aspects could position the mainline of the parser(s) to help 
project certain reasonably applicable memory management constraints.

Generally, and subjectively, it is not reasonable to assume infinite 
resources.  If there ever was a language that would prove us wrong in such 
assumptions it is COBOL.

Surely we need a main concept and forward momentum in an open project, but if 
we atleast layer certain internal pieces, like SYMT, with function calls, we 
can intercept accesses to compiler components and mediate them to some other 
store than memory. I am not advocating object orientation, we would never 
finish. But SYMT might well need to be on disk at times; we will be much more 
successful with that, if we know it from the start (assuming agreement).

Anyway, let me point something out here that relates mostly to the procedure 
division, and a little to the data division.  This is not meant as anything 
at all to determine choice of left-hand or right-hand reduction tools 
(LL/LR). This is just comments that relates that consideration to postings 
that envision retained long lists of tokens.  If we reduce on the left, the 
tokens you are retaining become obsolete real quick. I am not sure why you 
would hold them.

In any tool, once the tokens are reduced, about the only way you are going to 
get to them is to be using the address bindings to the parser tool's data 
names.  Obviously you can hang one of your retentions there, but if you do 
not hang it on that structure or some AST, what use is it.  I mean when you 
are into paragraph 17 of section 42, why do you still hold raw material from 
s1p1, especially the keyword minutia? So as you get to certain stages in the 
iteration, hopefully it will be discernable what got hung on some surviving 
structure, and what is getting obsolete.  I would advocate developing with an 
intent to find those synch points, and do _lots_ of free()s.  Just my view 
though.


And one other note, an aside about the preprocessor. Again strictly 
subjective.  I don't see much point in any duplicating or preservation of a 
keyword as text.  Depending what we decide to do about standards permitting 
rewriting keywords, or COBOL '61 DEFINE fantasmagora, we should figure that 
out from the perspective of what if anything needs to be provided to error 
processing.  I believe that real tokens can be condensed to an integer as 
soon as lex. If our design of error processing is eloquent enough, we just 
pass the integer.
The underlying error text being a printf-like string such as "Unexpected %s 
found", where the error processor translates the token with 
demodulate_token(interger_token_value) that returns a conventional string 
like PERFORM, or harnessing any genius alternate table we evolve pulls out 
PREFORM from position interger_token_value.  I am radical on this. I don't 
think that we need to point any structure to lltext for a token, much less 
duplicate such bricks.  

Likewise for SYMT, the lexer, not necessarily the preprocessor, should 
definitely reduce references to SYMT pointers or SYMT IDs of somekind.  It is 
not unthinkable that the preprocessor could participate in this.  But really 
toting around copius character data that is nothing more that the hundreds 
and thousands of references to same data names, all in core, I think is not a 
good first start.

So in the sense of design, I think that all lexers should reduce the quantity 
of characters. Per the idea discusses as inverting the preprocessor I/O 
routines, I encourage the notion of doing some reductions upfront.  I am 
strongly in favor of the mechanism that could snoop the outbound records from 
preprocess generating a work list.  Sensing nested programs right then is key 
to resource management (as well as providing for concurrency opportunities 
eventually). But really that is the exact first point in which the character 
count for PROG_NAME declarations and references can be reduced. The idea is 
simple, but I don't say that project contributors work load is small. If a 
thing is a keyword, justify any survival of the text, if not encode it as an 
integer in any store other than the propagating source. If a thing is not a 
keyword, justify any duplication of the text, otherwise store the text only 
once. The preprocessor does not have to really do SYMT work, but there is no 
reason it cannot hand the base material to SYMT.

At this stage global decision can be made about the UPPERCASE/LOWERCASE 
aspect of interaction with the user in the error messages. IF under what ever 
options we fold datanames together by case, or keep them separate, the 
preprocessor does not need to know, it is just handing the stuff to the 
nascent SYMT which will know somehow, someday how to keep things straight.  
DAtaNAMe may not be the same as datanamE under any given compilation option, 
but what does the preprocessor care, it is just a hand-off, let the fullback 
figure out how to get to the end-zone. But distinctly, under whatever 
compilation options, the lexer (beneath the preprocessor and any parser(s)) 
is already blind to the difference between
Move and mOVE and MOVE.  Why store that kind of junk, it is integer=n, or we 
are lost anyway.

I would take another approach. I would start with SYMT as a physical file 
point blank. If you think you have resources buffer that file until your 
gills burst. Layer SYMT with functions to make all of that transparent to 
every routine that needs SYMT.

In preprocess we do not know what the program means. We are not near 
semantics. But we know the difference between reserved words (user modifiable 
or not) and datanames, and literals, and punctuation; because we are born 
that way. These distinctions are a priori.

Admittedly some picture strings and arcane environment division matter might 
mimick references to data or procedures. But who cares! Jam it into the SYMT, 
jam jam jam, write write write.  The lex in the parsers should be able to 
keep things straight. 

The idea is to reduce as in compress. A compiler compiles. That is kind of a 
funny statement isn't it.  Compile then. Make tables. Make a SYMT post haste, 
and I mean do it now.

But even if you hate all that, please do not dup tokens or even make pointers 
to them as textual items. They are not. They are tokens. They are integers. 
Stated differently, all programs above a lexer must see 'MOVE' (or whatever 
case manifestation of it) as a token.
Stated differently the high level programs are in trouble if they can see 
'M'-'O'-'V'-'E', they have ceased to compile, they are dissassembling COBOL 
which ain't right.

If we insist upon having structures (C struct's) bubbling out of lexers, then 
I question any pointer to text there. That is rad I know, but if you manage 
resources, any such structure has instead identifiers to the tables where in 
the accumulating stuff has been compiled.  Even if a compile has multiple 
phases, it puts stuff in tables; I just say do it right away.  But never 
point to tokens as text; that belies a giganticism that will make it 
impossible to move major applications to this tool.

I know that the preprocess has a different lexical capability than any 
further stage, so many things appear to it as text; I am not after that. I am 
after the image in its output buffer at that instant or when it bounces into 
any reader of the preprocess output file(s).

Best Wishes
Bob Rayhawk
RKRayhawk@aol.com



































--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.