[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gnubol: good work in the preprocessor




Tim,

I have had the privilege of looking at the matterial that you recently 
regathered as the preprocessor.  Gosh a lot of hard work has been done.

It takes time to absorb as much as is there so don't get too upset if you
find the following comments too far off base. But I figure the earlier the
feedback the better. It is really the idea of
keeping programs fast that I am after. Should you do more work on the 
preprocessor, I hope you will find these comments useful

For others who might code lexer logic for components of the compiler
that will get right to the source code, especially the copybook expanded
source code, perhaps these comments are relevant to you as well.


In cobgocmp.H among other places there is a structure ...

/* structure for token in compiler */

struct cmp_token_struct {
  int cmp_token_type; /* tk_... out of parser cobgoprs.y */
  int cmp_token_basic_type; /* tk_... out of lex  */
  struct cmp_token_struct * cmp_token_next;
  int cmp_token_lineno; 
  int cmp_token_charno;
  char * cmp_token_file_name;
  unsigned char * cmp_token_chars;
  unsigned char * cmp_token_chars_upper;
  int cmp_token_length;
  char * cmp_token_line_ptr;
};

In cobgolex.l
in a function named
update_lineno_charno () {

there is code well written and representing clear thought ...
...
   yylval.cmp_token_length=yyleng;
   yylval.cmp_token_chars=yytext;
...

but IMHO this is not entirely a good idea.
aliasing yytext is merely self deception.

If the the invoker of the lexer intends to use the matter at yytext,
time is of the essence. Basically that higher level module needs access to
the data name as a global. The data located there is generally not 
guaranteed
to survive the next call to the lexer. Attempts to extract from or compare
that location _next_ time back from the lexer will be buggy, 
and the aliasing
of yytext will make it very hard to isolate.

Even if the invoker just intends to make a copy pronto, then it still
should just use yytext itself to get at it. The cycles spent copying 
yytext are potentially harmful,
IMHO, but notably also a waste. The space in the datastructure is a
waste too, but I expect that there are not many occurences of it at any one
period.
If there are that means all but the most recent have a stale pointer!


The same function is populating yylval.cmp_token_file_name with an asignment
such as

   yylval.cmp_token_file_name="xxx;";

Clearly that is just stub code, which is cool. But there is a program
design issue that I think can be identified there as well. It is probably 
not necessary to pass the equivalent of a file identifier _with_every_token_!

I can see how this consideration could put some pressure on the effort to
figure out the preprocessor which must expand copy statements. The expansion
process as an algorithm spans files. Error messages can only relate to one
file at a time, however, IMHO. If I am not mistaken the objective of toting
the file identification, is to support reportage to the tool user; such
reportage must be one line at a time (as far as diagnositics relating to
tokens). If it is one line at a time, then it is one file at a time.
(( I am not forcing this at all, a problem could be complicated, since 
certainly you possibly encounter some reportable fact, the nature of 
which actual spans the file, ... but ... you can  not talk the the 
user that way.  Or rather we probably do not have time to create error
reporting that is that good. ))

I would not be surprised if it might be difficult to make the kind of change
implied by these comments.  It actually involves the structure of the
interface to the invoker of the preprocessor lexer.

I have seen source modules in excess of 50,000 lines, before copybooks.
I have never seen a COBOL shop that tended to make copybooks short! The
lexeme stream (which is just a list of all the parts discovered on each 
line) which must be cogitated 
for each pass of the source code within the compiler will be quite large.
Especially when multiplied times hundreds of programs in an application
regen context.

So as inconspicuous as

   yylval.cmp_token_file_name="xxx;";

if it happens for each lexeme, its very big. I have commented at length,
because it is the general concept that matters, not the file id attribute
per se.

In the front-end, the lexeme flow is inside the inner most loop of the
inner most loop. Maybe we should not drag it down with unnecessary weight.
That is intended as criticism of the constructive kind. We will have atleast
two of these inner inner loops if the preprocess is distinct from the 
parser(s).

At about the point that yyleng and yytext are current, the lexemes are
flowing. Two factors lead one to contemplate a tight couple between the
lexer and the invoker of the lexer. First the life-time of the values is
short (actually especially yytext, if you preserve its contents then a mere
copy of yyleng can 'live on', as long as your code is a good juggler). The
invoker to lexer interface is the core loop with potentially large
iteration count. If an alternative is viable, then the pointer copy 
action that aliases yytext is a waste.

I have not chased the interface in detail, but on the surface there appear
to be some luxurious assumptions made. Some variations of acceptable
programming
methodologies tend to lead to extra structures. Usually that provides
insulation between code, as well as clarity. Copying text pointers is
dubius on its face. Copying yytext is a well known weakness. If a previous
yytext will be needed, its contents must be duplicated forthwith, before
exit from the lexer (or before return to the lexer). 
That too, however, takes wind out of the sails.


I realize that the priority has been to get some kind of working version
of the very frontest piece, and obviously a linear interaction
between the lexer and the invoker of the lexer is an excellent first
approximation. But as we stablize that and continue to enhance it, the value
of structuring the interface might be considered in a general review of the
code:

Other structure memeber manipulations raise similar considerations, like

   yylval.cmp_token_line_ptr="xxx;";

If that is in any buffer, or even the lexer's character store, can that
location possibly be relied upon to
survive another call the the lexer? If not then it should be duplicated.

((This is a real aside.  I know that stuff is stub code.  But taken more
literally, and I do that simply to get to a concept, ... taken more 
literally, its definitely worth mentioning that the lexer should not
provide anyone with access to its constants or its text-literals. 

Where necessary the lexer must copy a constant or a literal.

If you do provide a pointer to the lexer's data, 
inevitably someone will be off to the races with that data area doing
damage to either a general purpose data area in global space, or worse
to the stack.  This is a back door way of saying that invokers should
get direct access to yytext as a global and _they_ should do the 
allocating and copying  .... actually, obviously a subroutine of the lexer
or lexer itself can do this as long as they allocate and dup. Again I know
we have stub code here.  
But the urgent issue is that outsiders should not
get access to the lexer's stuff, accept what they need and that as fast
as possible, with an awareness that it is ephemeral as all get out. ))

So an idea of the strutural aspect of the interface, might be vaguely
sketched as
 There are files, and within that
    There are lines, and within that
       There are lexemes

I do not assume that it will be easy to impose such a structure on the
interface. But I am sure that duplicating excessively is expensive, and
expecting survival of the text across lex iterations is not safe. So the
circumstance deserves continuous analysis. This is
a choke point in a compiler. The solution, more braodly, is well designed
globals (which is essentially hateful to some methodologies). As experience
with the scanner grows, you realize that you need certain twigs to decorate
the tree with all the time. Put them in global space to avoid alloc/free
overhead. For anything that really must survive crossing an iteration of
the calls to the lexer, it needs to be duplicated, which is not the approach
implied by the current datanames ending in _ptr.

Pieces of the preprocessor that use pointers in the lexeme flow, do not 
generally need to know much about how they got set-up. Only the alloc/free
logic needs to rig things to globals, and avoid zapping a global. That might
require some baggage in the token structure, yet maybe no more than a bit.

That having been said, let me emphasize that it is easy to admire the hard
work that has been done on the code to date.
Best Wishes
Bob Rayhawk
RKRayhawk@aol.com

--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.