[egenix-users] Is this possible ?

Pekka Niiranen krissepu at vip.fi
Sat Aug 3 23:50:31 CEST 2002


Ok,

 I did as you told:

-- code starts --
from mx.TextTools import *
text = "Xaa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aaY"

tables = []

tab = ('start',
       (None, AllNotIn,'()', +1),
       (None, Is+LookAhead, '(', MatchOk, 'nesting'),
       'nesting',
       ('group',SubTable+AppendMatch,((None, Is, '(', +1),
                                      (None, SubTableInList, (tables,0)),
                                      (None, Is, ')', MatchFail, MatchOk))),
       (None,Jump,To,'start'))

tables.append(tab) # Add tab to tables

if __name__ == '__main__':

    result, taglist, nextindex = tag(text,tab)
    print taglist

-- code ends --

There remains one quirk (see code above):

The code stops searching whenever there is an extra ) -sign in the middle of text.
How can I make the engine to return nothing (i.e. empty match)
if there are extra ) -sign AND it is not recursing currently ?

Should we have a parameter:  "Fail if not currently recursing" ?

Try adding ) -sign after X -letter and then after Y -letter in text above. In both cases
the result should be an empty match.

This is a matter of taste, I agree, but then one could always say:
"It did not find anything, because of the number of (- and )- signs
did not add up". => one python error message that
I could print when MatchFail happens => less analysing to do => more speed.
In code above only extra ( -signs make engine fail.

-pekka-

"M.-A. Lemburg" wrote:

> Pekka Niiranen wrote:
> > Fine,
> >
> > but the line:
> >
> > (None,EOF,Here,MatchOk)
> >
> > will make text = "aa(AA" match too. If I analysed it correctly,
> > it is because EOF matches allways. Would it be possible
> > to add mxTextTool parameter that will make EOF cause failing if necessary ?
> >
> > Something like: "if EOF is encountered here, fail the whole subgroup ?"
>
> EOF only matches iff the head position is beyond the right slice
> of the text slice being processed. If you need balanced parens,
> you should rewrite the tab tables to have the nesting table match
> both the opening and the closing paren.
>
> > -pekka-
> >
> >
> > "M.-A. Lemburg" wrote:
> >
> >
> >>Pekka Niiranen wrote:
> >>
> >>>Hi,
> >>>
> >>>        I tried the latest beta 3 by:
> >>>
> >>>        a) compiling it myself from sources and
> >>>        b) installing from the precompiled package for python v2.2
> >>>
> >>>        Of the scripts below only the script that uses Simpleparse returns
> >>>anything.
> >>>        The others run without errors, but return [].
> >>>
> >>>        They all run OK with the beta 2 though.
> >>
> >>If they did, then you've hit a bug in beta2. Here are the corrected
> >>versions. Note that the problem was with the EOF handling. If AllNotIn
> >>doesn't match at least one char it'll fail and using 0 as jne offset
> >>causes the same effect as MatchFail.
> >>
> >>#--- solution 1 starts (with limiting letters)---
> >>
> >>from mx.TextTools import *
> >>
> >>def test1():
> >>
> >>     text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>
> >>     tables = [] # used for recursion only
> >>
> >>     tab = ('start',
> >>            (None,Is+LookAhead,'(',+1,'nesting'), # If next character is "(" then recurse
> >>            (None,Is,')',+1,MatchOk), # If current character is ")" then stop or return from recursion
> >>            (None,AllNotIn,'()',+1,'start'), # Search all characters except "(" and ")"
> >>            (None,EOF,Here,MatchOk),
> >>            'nesting',
> >>            ('group',SubTable+AppendMatch,
> >>             ((None,Is,'(',MatchFail,+1), # Since we have looked ahead, collect "(" -sign
> >>              (None,SubTableInList, (tables,0)),  # Recurse
> >>              )
> >>             ),
> >>            (None,Jump,To,'start')) # After recursion jump back to 'start'
> >>
> >>     tables.append(tab) # Add tab to tables
> >>
> >>     result, taglist, nextindex = tag(text,tab)
> >>     print result, nextindex
> >>     print taglist
> >>
> >>#--- solution 2 starts  (without limiting letters) ---
> >>
> >>def test2():
> >>
> >>     text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>
> >>     tab = ('start',
> >>            (None, Is+LookAhead, ')', +1, MatchOk), # When character ")" is seen stop recursion
> >>            (None, Is, '(', 'letters', +1),
> >>            ('group', SubTable+AppendMatch, ThisTable), # Recurse
> >>            (None, Skip, 1, MatchFail, 'start'), # Last character in recursion was ")" so jump over it back to 'start'
> >>            'letters',
> >>            (None, AllNotIn, '()', +1, 'start'),  # Collect all characters except "(" and ")"
> >>            (None, EOF, Here, MatchOk),
> >>            )
> >>
> >>     result,taglist,nextindex = tag(text, tab)
> >>     print result, nextindex
> >>     print taglist
> >>
> >>print 'Test 1:'
> >>test1()
> >>print
> >>
> >>print 'Test 2:'
> >>test2()
> >>print
> >>
> >>
> >>>        I am using Windows 2000 professional, Python 2.2.1 and Winpython
> >>>v148.
> >>>
> >>>-pekka-
> >>>
> >>>
> >>>Pekka Niiranen wrote:
> >>>
> >>>
> >>>
> >>>>Thank you all for your help and inspiration! It is payback time ;)
> >>>>
> >>>>I have tried past two months to create parser that returns
> >>>>strings limited by two different letters. The strings can be nested.
> >>>>I considered recursive call of regular expression to be too slow
> >>>>and decided to use mxTextTools 2.1 beta2 and the latest alpha of
> >>>>Simpleparse 2.0.
> >>>>
> >>>>Below are three solutions I found.
> >>>>Note that Simpleparse creates different tagtable as the "manually"
> >>>>found.
> >>>>
> >>>>Further ideas to be implemented:
> >>>>
> >>>>1) Input of limiting letters as parameters (easy)
> >>>>2) Unicode support
> >>>>3) Test for equal amount of limiting letters before calling of parser
> >>>>(will this speed up the solution ?)
> >>>>4) Parsing one line at a time without looping thru lines of the text
> >>>>with "while" or "for"
> >>>>   (maybe "None, AllNotIn, '()\n'" )
> >>>>
> >>>>One development idea to mxTextTools:
> >>>>
> >>>>1) Instead of using list of tables to recurse, would it be possible to
> >>>>use "global jump" to outside of current table ?
> >>>>
> >>>>--- solution 1 starts (with limiting letters)---
> >>>>
> >>>
> >>>>from mx.TextTools import *
> >>>
> >>>>text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>>>tables = [] # used for recursion only
> >>>>
> >>>>tab = ('start',
> >>>>      (None,Is+LookAhead,'(',+1,'nesting'), # If next character is "("
> >>>>then recurse
> >>>>      (None,Is,')',+1,MatchOk), # If current character is ")" then stop
> >>>>or return from recursion
> >>>>      (None,AllNotIn,'()',0,'start'), # Search all characters except
> >>>>"(" and ")"
> >>>>      'nesting',
> >>>>      ('group',SubTable+AppendMatch,((None,Is,'(',0,+1), # Since we
> >>>>have looked ahead, collect "(" -sign
> >>>>                                     (None,SubTableInList,
> >>>>(tables,0)))), # Recurse
> >>>>      (None,Jump,To,'start')) # After recursion jump back to 'start'
> >>>>
> >>>>tables.append(tab) # Add tab to tables
> >>>>
> >>>>if __name__ == '__main__':
> >>>>
> >>>>   result, taglist, nextindex = tag(text,tab)
> >>>>   print taglist
> >>>>
> >>>>--- solution 1 ends ---
> >>>>
> >>>>--- solution 2 starts  (without limiting letters) ---
> >>>>
> >>>
> >>>>from mx.TextTools import *
> >>>
> >>>>text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>>>
> >>>>tab = ('start',
> >>>>      (None, Is+LookAhead, ')', +1, MatchOk), # When character ")" is
> >>>>seen stop recursion
> >>>>      (None, Is, '(', 'letters', +1),
> >>>>      ('group', SubTable+AppendMatch, ThisTable), # Recurse
> >>>>      (None, Skip, 1, 0, 'start'), # Last character in recursion was
> >>>>")" so jump over it back to 'start'
> >>>>      'letters',
> >>>>      (None, AllNotIn, '()', 0, 'start')) # Collect all characters
> >>>>except "(" and ")"
> >>>>
> >>>>result,taglist,next = tag(text, tab)
> >>>>print taglist
> >>>>
> >>>>--- solution 2 ends ---
> >>>>
> >>>>--- solution 3 starts (Simpleparse solution) ---
> >>>>
> >>>
> >>>>from simpleparse.parser import Parser
> >>>>from mx.TextTools import *
> >>>
> >>>>declaration = r'''
> >>>>
> >>>>
> >>>>>line<  := (a/match)+
> >>>>
> >>>>match   := '(', line, ')'
> >>>><a>     := -[()]
> >>>>'''
> >>>>text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>>>
> >>>>parser = Parser(declaration)
> >>>>success, children, nextcharacter = parser.parse(text, production =
> >>>>"line")
> >>>>print_tags(text,children)
> >>>>
> >>>>--- solution 3 ends ---
> >>>>
> >>>>-pekka-
> >>>
> >>>
> >>>
> >>>_______________________________________________________________________
> >>>eGenix.com User Mailing List                     http://www.egenix.com/
> >>>http://lists.egenix.com/mailman/listinfo/egenix-users
> >>
> >>--
> >>Marc-Andre Lemburg
> >>CEO eGenix.com Software GmbH
> >>_______________________________________________________________________
> >>eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
> >>Python Consulting:                               http://www.egenix.com/
> >>Python Software:                    http://www.egenix.com/files/python/
> >>
> >>_______________________________________________________________________
> >>eGenix.com User Mailing List                     http://www.egenix.com/
> >>http://lists.egenix.com/mailman/listinfo/egenix-users
> >
> >
> >
> > _______________________________________________________________________
> > eGenix.com User Mailing List                     http://www.egenix.com/
> > http://lists.egenix.com/mailman/listinfo/egenix-users
>
> --
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> _______________________________________________________________________
> eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
> Python Consulting:                               http://www.egenix.com/
> Python Software:                    http://www.egenix.com/files/python/




More information about the egenix-users mailing list