[egenix-users] RfC: mx.TextTools Non-recursive Implementation

Mon Jun 17 02:02:05 CEST 2002

I've been working on a non-recursive implementation of the mx.TextTools 
core loop this weekend, and am to the point where I could use some help 
in moving it forward.  (I'm posting to the users list because I figure 
people other than just Marc-André might be interested in hacking on the 
package).

As of now, I seem to have a functional, but still very rough, version of 
the engine.  It passes all of the unit tests I have (including 67 for 
the low-level mx.TextTools functionality (excluding Loop commands) and 
another 170 for SimpleParse functionality).

Why:

     The current loop uses a recursive call of the matching function for 
every table/subtable.  As a result, we tend to blow up the stack if we 
have large numbers of recursive loops.  The non-recursive rewrite 
attempts to eliminate that problem.  This allows a number of common EBNF 
grammar structures to be used which are otherwise almost impossible to 
support cleanly.

     The new loop structure consolidates all the return value and error 
reporting code, which substantially reduces the total amount of code in 
the module.

     The new loop structure should make it very easy to add "user error" 
classes, and to report the hierarchy of parents during user-error reporting.

     It should make it possible (with some thought-work) to add robust 
backtracking support as well (i.e. recording backtracks as frame-stacks 
with some extra information for re-starting)).  Similarly complex (and 
basically the same mechanism) would be adding re-start functionality on 
premature EOF cases.

     If I'm understanding correctly, it should eventually be possible to 
make the code process memory-mapped files (this is actually not specific 
to the new code, but it would be nice to have with either version).

     The new loop is fairly heavily commented by me, as I was figuring 
out what needed to get done and why.  It is also more rigid in it's 
structure, only allowing individual commands to access a defined set of 
variables to signal the result/error-handling clauses.  (This results in 
a slight slowdown, of course, compared to just setting the global 
variables).

Tradeoffs:

     The (unoptimised) first version of the code is ~ 75% of the speed 
of the 2.1.0b1 version from which it was derived (tested using the 
HTML.py script from mx.TextTools).  I'd guess that can be improved to 80 
or 90%, but I don't think there's any way this approach can be as fast 
as the recursive call + direct goto-jumping approach.  Why:

         It's doing all the work of copying variables into new (heap) 
memory manually (versus stack-pushing/call, which I'm assuming is very 
fast in C).
         Side-effect-encapsulation overhead (See last point above).
         It's using a loop, so winds up traversing the loop code on each 
iteration, whereas the goto just does a jump.

     The code is new (therefore largely untested), and is written by a 
first-time C coder (me).  I've tried to follow generally clean 
programming practices, but it's still going to need review by 
experienced C coders (both to check for errors and to optimise it).

     There are at least a few bugs in the module (I get a "no current 
thread" error, and there's probably one or two ref-count bugs in there). 
  There is a warning generated that "not all paths return a value" (I 
can't figure out which paths those are just from looking).

Needed work:
     1) Some sort of documentation and unit tests for the Loop api.
     2) Review of the code by experienced C coder(s), particularly for 
lost-references, improper memory/pointer usage, and that kind of stuff 
common to people who've never dealt with C code before.
     3) Optimisation.
     4) Expanded testing regiment.
     5) Add support for memory-mapped files.

How it Works:
     Following is rough pseudo-code for how the new engine works. It's a 
simple structure compared to the base code, which defined goto jumps for 
error handling, finish and next item, and had the result handling for 
each command in the definition of the command itself (with a call to 
another function to do the actual appending/calling).

while 1:
     while (index_in_table() and returnCode == NULL_CODE):
         decode the current table[index]
         if childReturnCode is NULL_CODE:
             #the current tag is not already processed
             reset tag variables
             switch( tag command ):
                 do what tag wants to do()
                 set tag-related variables
                 set childReturnCode (tag variable)
                 if table:
                     push_frame_stack()
                     set childReturnCode == PENDING
         switch(childReturnCode):
             # figure out what to do with child's results
             # possibly set table-wide returnValue
             childSuccess
                 append values
                 update table-wide values
                 set new index
             childFailure
                 rewind position
                 set new index
                 (possibly set table-wide return code)
             childError
                 signal error for whole table
             childPending
                 ignore/continue processing without updating list values 
(start a new table)
         reset childReturnCode

     #done table, figure out what to do now...
     if no explicit return value:
         figure out implicit
     if failure:
         truncate result list to previous length
         reset position
     if error:
         report error as exception
         exit
     else:
         if frame_stack():
             record table vars as child vars()
             pop_frame_stack()
             set childReturnCode
         else:
             return result

Where it is:
     http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/simpleparse/

What's there:
     The code is currently just a set of 5 files which replace the file 
mxte_impl.h in the 2.1.0 mx.TextTools source tree.  Dropping them in 
should allow you to compile the new version without changing anything 
else (use build --force, of course).

     mxte_impl.h --> the loop described above and the (trivial) 
implementation of the stack.  Quite a few comments.  I'm guessing the 
stack implementation could be much faster, I just don't know enough 
about C to guess how.

     *commands.h --> each of the command-families from the original code 
broken out into a file to make it easier to work with the core loop. 
Each command group has a description of the "contract" it works under in 
the system.  The commands tend to be much lighter (in terms of amount of 
code included) because of the new result-handling framework.

     In the SimpleParse package you'll find a tests directory.  There 
are only a few of those tests still up-to-date (sorry, this directory 
was a quick merge of 2 different versions of the project):
         mx_* --> low-level direct tests of the primary features of the 
  engine (these should eventually move to the Egenix package, likely 
(they aren't specific to the rewrite, so they should be able to test the 
original code as well)).  Run mx_test.py to run all of these.
         test_objectgenerator.py --> tests the individual components 
generated by SimpleParse for use in parsing
         test_simpleparsegrammar.py --> tests the SimpleParse 
parser-generator (i.e. goes from text definitions to parsers and tests 
those parsers).

If there are people interested in working on the code, I'd love a few 
eyes to pick it over and/or suggest better approaches.  People who have 
feedback regarding whether they might find the updated engine useful or 
not might want to share their thoughts as well.

With thanks for your time in listening, and for Marc-André's great work 
in producing TextTools,
Mike

_______________________________________
   Mike C. Fletcher
   http://members.rogers.com/mcfletch/