[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gnubol: How do we parse this language, anyway?



Well, I'll welcome you back and I hope you'll stay for a while.  I
don't know that I can keep you entertained, but your effort deserves
some response.

>>>>> "Randall" == Randall Bart <Barticus@att.net>
>>>>> wrote the following on Sat, 27 Nov 1999 17:29:58 -0800

  Randall> I've gotten behind in my email, and if I wait until I
  Randall> catch up, I'll never catch up.  Here are my responses to
  Randall> several messages about parsing (mostly from Bob).

  Randall> At 10:53 AM 11/20/99 , RKRayhawk@aol.com wrote:

  >> The supposed retained buffers are not going to be retained in a
  >> real shared system; rather than us doing the I/O, the OS will be
  >> paging. That means that we will loose control of
  >> performance. Strategies that hold vast token lists in core
  >> exacerbate this.  Again the idea is that real systems are
  >> shared; in development environments the sharees might each be
  >> using our tool which is assuming retained buffers and holding
  >> vast token lists. A real regen of a real COBOL application is
  >> going to thrash on the virtual memory device. We will not be
  >> able to get to the problem. Much of that would not be necessary.

  Randall> I started programming in a world where we measured in
  Randall> kilohertz, and kilobytes.  Now that we measure in mega and
  Randall> giga and tera it's hard to worry about a few extra bytes
  Randall> per token.  But I agree, we should reduce the tokens as
  Randall> much as possible in the preprocessor.

Oh good!  War stories.

Me too.  My first one ran in 40K with no random access storage.  It's
kind of interesting that that compiler had fewer restrictions than
many modern compilers.  Imposing restrictions based on the resources
available to us would have made the compiler virtually worthless.

I'll agree with your conclusion, of course.

  Randall> While we work on designing the parser(s), Tim should
  Randall> continue enhancing the preprocessor, nibbling around the
  Randall> edges of the language.  Each token should be reduced to an
  Randall> index into a symbol table.  Verbs and keywords can be
  Randall> identified as such (perhaps as reserved symbol table
  Randall> indexes).  The preprocessor could do a lot of token
  Randall> manipulation.  A OF B could be reduced to a single token.

That could reduce the number of tokens but it would require the text
capacity of a token be sufficient to hold 50 maximum size data-names
plus separators unless you have a compression scheme in mind.  I
wouldn't want the preprocessor to have a symbol table but the main
lexer might be able to reduce such things to symbol table references.

  Randall> There are a class of tokens I call pseudo-verbs: Period,
  Randall> ELSE, WHEN, SIZE ERROR, OVERFLOW, END-OF-PAGE, INVALID
  Randall> KEY, EXCEPTION, END, END-x, (that list may not be
  Randall> exhaustive).  These could be identified and matched to
  Randall> their antecedents in an early pass, or at least they could
  Randall> be marked for easier digestion by the parser.  Perhaps we
  Randall> could pass the parser just a single statement at a time.
  Randall> Actually, that's a statement fragment, since it would
  Randall> already be split at the pseudo-verbs, eg,

     IF  A = B OR C OR > D
         ADD 1 TO Z
             ON SIZE ERROR
                 PERFORM P1
             NOT ON SIZE ERROR
                 PERFORM P2
         END-ADD
     ELSE
         PERFORM P3 VARYING X FROM 1 BY 1 UNTIL X > 10
     END-IF
     .

  Randall> As I've shown this, each line is a statement fragment,
  Randall> beginning with a verb or pseudo-verb, except that NOT is
  Randall> found in front of the pseudo-verb, as are the optional
  Randall> words ON and AT.

  >> I mean when you are into paragraph 17 of section 42, why do you
  >> still hold raw material from s1p1, especially the keyword
  >> minutia? So as you get to certain stages in the iteration,
  >> hopefully it will be discernable what got hung on some surviving
  >> structure, and what is getting obsolete.

  Randall> At each period everything becomes obsolete.  At each verb
  Randall> or pseudo-verb, everything becomes obsolete except
  Randall> matching pseudo-verbs to antecedents.

  Randall> Someone suggested multiple levels of parsing.  Imagine two
  Randall> different parsers, the statement parser and the sentence
  Randall> parser.  The statement parser would be called once for
  Randall> each statement or statement fragment (each line of my
  Randall> example).  The sentence parser would be called once for
  Randall> each sentence (once in my example), but the tokens to the
  Randall> sentence parser are the statement fragments.

Do you think this would help?  Has anyone ever attempted something
like this?  Most grammars (even COBOL grammars) comprise hierarchies
of rules.  Is there something to be gained by doing the inferior ones
first?  Well, perhaps for errors, but I'm not convinced.

	< stuff about segmentation >

  Randall> At 09:09 AM 11/21/99 , Michael McKernan wrote:

  >> ADD ... SIZE ... SUBTRACT ... SIZE ... PERIOD
  >> 
  >> which appears to be as invalid as the first, but trashes the law
  >> of least astonishment, since the PERIOD has traditionally,
  >> legitimately, terminated anything that's going on.

  Randall> Period terminates anything, but the statement above is
  Randall> invalid in COBOL-74, COBOL-85, and COBOL-20XX.  SUBTRACT
  Randall> with a SIZE ERROR phrase is a conditional statement, and
  Randall> the object of the ADD's SIZE ERROR phrase is required to
  Randall> be imperative.

That's the astonishing part.  

  >> I am maintaining that an unmatched END-x should imply an
  >> appropriate END-x for any unterminated conditional part that
  >> exists when it is encountered.  This isn't the letter of the
  >> law, but it does not affect correct programs, and is arguably
  >> less astonishing than the strict interpretation.

  Randall> I agree.

Now there are two of us.

  >> I'll say it again.  We do not have a floor and ceiling standard.
  >> No compiler has ever been denied certification for permissive
  >> interpretation.

  Randall> I disagree.  The only certification there ever was was
  Randall> FIPS certification.  Some of the FIPS tests verified that
  Randall> errors were issued for invalid statements like the
  Randall> foregoing.

A little too much hyperbole in the heat of battle.  Yes, I remember
some tests like that, but these cases escaped them, or else the '85
compiler that I worked on would not have been certified.  It's quite
possible that my statement is true, though, since the audits were an
open book, and no one invited the auditor before having run them a few
hundred times.

Speaking of such things, do you know who owns the audit tests?  Is
there any possibility that this group might be able to obtain them?
It's going to be a lot harder when we get closer to the end game if
we don't have something like that.  

  Randall> Which brings up an important point: No organization is
  Randall> currently providing certification for COBOL-85 (or any
  Randall> other COBOL) and there is no organization planning
  Randall> certification for COBOL-20XX.  Perhaps a certifying
  Randall> organization will arise, but I won't predict what they
  Randall> will or will not test for.

  < lots more on unterminated conditional parts >

I don't disagree with any of your comments subsequent to this point,
so I would not be able to add even as much value as I have in the
foregoing. 




--
This message was sent through the gnu-cobol mailing list.  To remove yourself
from this mailing list, send a message to majordomo@lusars.net with the
words "unsubscribe gnu-cobol" in the message body.  For more information on
the GNU COBOL project, send mail to gnu-cobol-owner@lusars.net.