[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gnubol: procedure types
Just a general comment, entirely subjective, concerning I/O and incore
compilation.
The supposed retained buffers are not going to be retained in a real shared
system; rather than us doing the I/O, the OS will be paging. That means that
we will loose control of performance. Strategies that hold vast token lists
in core exacerbate this. Again the idea is that real systems are shared; in
development environments the sharees might each be using our tool which is
assuming retained buffers and holding vast token lists. A real regen of a
real COBOL application is going to thrash on the virtual memory device. We
will not be able to get to the problem. Much of that would not be necessary.
The basic sentiment of extreme programming is entirely acceptable, 'lets get
moving'. That is fine, and I don't see droves of available coders, so I am
realistic as I offer counter point. But I think it is atleast worth
considering that major phases may need three modalities: one such as depicted
tries to hold all in core; two holds until it busts, and then uses the
equivalent of a least recently used algorithm to store material temporarily
or some permanent retirement scheme; a third sends must internal resources to
DASD from the get go, retrieving them when needed.
An example of the need for the third is a program with a large procedure
division that has many, many, many paragraphs. This implies lots of forward
references. The SQL program described in these postings fit that category. We
can detect that at preprocess time! Also detectable at preprocess time is
nested program content (and futuristically class and method programs). These
sneak preview aspects could position the mainline of the parser(s) to help
project certain reasonably applicable memory management constraints.
Generally, and subjectively, it is not reasonable to assume infinite
resources. If there ever was a language that would prove us wrong in such
assumptions it is COBOL.
Surely we need a main concept and forward momentum in an open project, but if
we atleast layer certain internal pieces, like SYMT, with function calls, we
can intercept accesses to compiler components and mediate them to some other
store than memory. I am not advocating object orientation, we would never
finish. But SYMT might well need to be on disk at times; we will be much more
successful with that, if we know it from the start (assuming agreement).
Anyway, let me point something out here that relates mostly to the procedure
division, and a little to the data division. This is not meant as anything
at all to determine choice of left-hand or right-hand reduction tools
(LL/LR). This is just comments that relates that consideration to postings
that envision retained long lists of tokens. If we reduce on the left, the
tokens you are retaining become obsolete real quick. I am not sure why you
would hold them.
In any tool, once the tokens are reduced, about the only way you are going to
get to them is to be using the address bindings to the parser tool's data
names. Obviously you can hang one of your retentions there, but if you do
not hang it on that structure or some AST, what use is it. I mean when you
are into paragraph 17 of section 42, why do you still hold raw material from
s1p1, especially the keyword minutia? So as you get to certain stages in the
iteration, hopefully it will be discernable what got hung on some surviving
structure, and what is getting obsolete. I would advocate developing with an
intent to find those synch points, and do _lots_ of free()s. Just my view
though.
And one other note, an aside about the preprocessor. Again strictly
subjective. I don't see much point in any duplicating or preservation of a
keyword as text. Depending what we decide to do about standards permitting
rewriting keywords, or COBOL '61 DEFINE fantasmagora, we should figure that
out from the perspective of what if anything needs to be provided to error
processing. I believe that real tokens can be condensed to an integer as
soon as lex. If our design of error processing is eloquent enough, we just
pass the integer.
The underlying error text being a printf-like string such as "Unexpected %s
found", where the error processor translates the token with
demodulate_token(interger_token_value) that returns a conventional string
like PERFORM, or harnessing any genius alternate table we evolve pulls out
PREFORM from position interger_token_value. I am radical on this. I don't
think that we need to point any structure to lltext for a token, much less
duplicate such bricks.
Likewise for SYMT, the lexer, not necessarily the preprocessor, should
definitely reduce references to SYMT pointers or SYMT IDs of somekind. It is
not unthinkable that the preprocessor could participate in this. But really
toting around copius character data that is nothing more that the hundreds
and thousands of references to same data names, all in core, I think is not a
good first start.
So in the sense of design, I think that all lexers should reduce the quantity
of characters. Per the idea discusses as inverting the preprocessor I/O
routines, I encourage the notion of doing some reductions upfront. I am
strongly in favor of the mechanism that could snoop the outbound records from
preprocess generating a work list. Sensing nested programs right then is key
to resource management (as well as providing for concurrency opportunities
eventually). But really that is the exact first point in which the character
count for PROG_NAME declarations and references can be reduced. The idea is
simple, but I don't say that project contributors work load is small. If a
thing is a keyword, justify any survival of the text, if not encode it as an
integer in any store other than the propagating source. If a thing is not a
keyword, justify any duplication of the text, otherwise store the text only
once. The preprocessor does not have to really do SYMT work, but there is no
reason it cannot hand the base material to SYMT.
At this stage global decision can be made about the UPPERCASE/LOWERCASE
aspect of interaction with the user in the error messages. IF under what ever
options we fold datanames together by case, or keep them separate, the
preprocessor does not need to know, it is just handing the stuff to the
nascent SYMT which will know somehow, someday how to keep things straight.
DAtaNAMe may not be the same as datanamE under any given compilation option,
but what does the preprocessor care, it is just a hand-off, let the fullback
figure out how to get to the end-zone. But distinctly, under whatever
compilation options, the lexer (beneath the preprocessor and any parser(s))
is already blind to the difference between
Move and mOVE and MOVE. Why store that kind of junk, it is integer=n, or we
are lost anyway.
I would take another approach. I would start with SYMT as a physical file
point blank. If you think you have resources buffer that file until your
gills burst. Layer SYMT with functions to make all of that transparent to
every routine that needs SYMT.
In preprocess we do not know what the program means. We are not near
semantics. But we know the difference between reserved words (user modifiable
or not) and datanames, and literals, and punctuation; because we are born
that way. These distinctions are a priori.
Admittedly some picture strings and arcane environment division matter might
mimick references to data or procedures. But who cares! Jam it into the SYMT,
jam jam jam, write write write. The lex in the parsers should be able to
keep things straight.
The idea is to reduce as in compress. A compiler compiles. That is kind of a
funny statement isn't it. Compile then. Make tables. Make a SYMT post haste,
and I mean do it now.
But even if you hate all that, please do not dup tokens or even make pointers
to them as textual items. They are not. They are tokens. They are integers.
Stated differently, all programs above a lexer must see 'MOVE' (or whatever
case manifestation of it) as a token.
Stated differently the high level programs are in trouble if they can see
'M'-'O'-'V'-'E', they have ceased to compile, they are dissassembling COBOL
which ain't right.
If we insist upon having structures (C struct's) bubbling out of lexers, then
I question any pointer to text there. That is rad I know, but if you manage
resources, any such structure has instead identifiers to the tables where in
the accumulating stuff has been compiled. Even if a compile has multiple
phases, it puts stuff in tables; I just say do it right away. But never
point to tokens as text; that belies a giganticism that will make it
impossible to move major applications to this tool.
I know that the preprocess has a different lexical capability than any
further stage, so many things appear to it as text; I am not after that. I am
after the image in its output buffer at that instant or when it bounces into
any reader of the preprocess output file(s).
Best Wishes
Bob Rayhawk
RKRayhawk@aol.com
--
This message was sent through the gnu-cobol mailing list. To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body. For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.