Python - Our key to Efficiency

TextTools - Fast Text Manipulation Tools for Python


Engine : Objects : Functions : Constants : Examples : Structure : Support : Download : Copyright & License : History : Home Version 2.0.3

Introduction

    A while ago, in spring 1997, I started out to write some tools that were supposed to make string handling and parsing text faster than what the standard library has to offer. I had a need for this since I was (and still am) working on a WebService Framework that greatly simplifies building and maintaining interactive web sites. After some initial prototypes of what I call Tagging Engine written totally in Python I started rewriting the main parts in C and soon realized that I needed a little more sophisticated searching tools.

    I could walk through text pretty fast, but in many situations I just needed to replace some text with some other text.

    The next step was to create a new types for fast searching in text. I decided to code up an enhanced version of the well known Boyer-Moore search algorithm. This made me think a bit more about searching and how knowledge about the text and the search pattern could be better used to make it work even faster. The result was an algorithm that uses a suffix skip array, which I call Fast Search Algorithm.

    The two search types are built upon a small C lib I wrote for this. The implementations are optimized for gcc/Linux and from the tests I ran I can say that they out-perform every other technique I have tried. Even the very fast Boyer-Moore implementation of fgrep (1).

    Then I reintegrated those search utilities into the Tagging Engine and also added a fast variant for doing 'char out of a string'-kind of tests. These are done using 'sets', i.e. strings that contain one bit per character position (and thus 32 bytes long).

    All this got wrapped up in a nice Python package:

    1. a fast search mechanism,
    2. a state machine for doing fast tagging,
    3. a set of functions aiding in post-processing the output of the two and
    4. a set of functions handling sets of characters.

    One word about the word 'tagging'. This originated from what is done in HTML to mark some text with a certain extra information. I extended this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the object) which can be used to do all kinds of nifty things.

Tagging Engine

Search Objects

    These objects are immutable and usable for one search string per object only. They can be applied to as many text strings as you like -- much like compiled regular expressions. Matching is done exact (doing the translations on-the-fly).

    The search objects can be pickled and implement the copy protocol as defined by the copy module. Comparisons and hashing are not implemented (the objects are stored by id in dictionaries -- may change in future releases though).

    Search Object Constructors

      There are two types of search objects. The Boyer-Moore type uses less memory, while the Fast Search type gives you enhanced speed with a little more memory overhead.

      Note: The Fast Search object is *not* included in the public release, since I wan't to write a paper about it and therefore can't make it available yet.

      BMS(match[,translate])
      Create a Boyer Moore substring search object for the string match; translate is an optional translate-string like the one used in the module 're', i.e. a 256 character string mapping the oridnals of the base character set to new characters.

      FS(match[,translate])
      Create a Fast substring Search object for the string match; translate is an optional translate-string like the one used in the module 're'.

    Search Object Instance Variables

      To provide some help for reflection and pickling the search types give (read-only) access to these attribute.

      match
      The string that the search object will look for in the search text.

      translate
      The translate string used by the object or None (if no translate string was passed to the constructor).

    Search Object Instance Methods

      The two search types have the same methods:

      search(text,[start=0,len_text=len(text)])
      Search for the substring match in text, looking only at the slice [start:len_text] and return the slice (l,r) where the substring was found, or (start,start) if it was not found.

      find(text,[start=0,len_text=len(text)])
      Search for the substring match in text, looking only at the slice [start:len_text] and return the index where the substring was found, or -1 if it was not found. This interface is compatible with string.find.

      findall(text,start=0,len_text=len(text))
      Same as search(), but return a list of all non-overlapping slices (l,r) where the match string can be found in text.

      Note that translating the text before doing the search often results in a better performance. Use string.translate() to do that efficiently.

Functions

Constants

    The package exports these constants. They are defined in Constants/Sets.

      a2z
      'abcdefghijklmnopqrstuvwxyz'

      A2Z
      'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

      a2z
      'abcdefghijklmnopqrstuvwxyz'

      umlaute
      'äöüß'

      Umlaute
      'ÄÖÜ'

      alpha
      A2Z + a2z

      a2z
      'abcdefghijklmnopqrstuvwxyz'

      german_alpha
      A2Z + a2z + umlaute + Umlaute

      number
      '0123456789'

      alphanumeric
      alpha + number

      white
      ' \t\v'

      newline
      '\n\r'

      formfeed
      '\f'

      whitespace
      white + newline + formfeed

      any
      All characters from \000-\377

      *_set
      All of the above as character sets.

Examples of Use

    The Examples/ subdirectory of the package contains a few examples of how tables can be written and used. Here is a non-trivial example for parsing HTML (well, most of it):

    
        from mx.TextTools import *
    
        error = '***syntax error'			# error tag obj
    
        tagname_set = set(alpha+'-'+number)
        tagattrname_set = set(alpha+'-'+number)
        tagvalue_set = set('"\'> ',0)
        white_set = set(' \r\n\t')
    
        tagattr = (
    	   # name
    	   ('name',AllInSet,tagattrname_set),
    	   # with value ?
    	   (None,Is,'=',MatchOk),
    	   # skip junk
    	   (None,AllInSet,white_set,+1),
    	   # unquoted value
    	   ('value',AllInSet,tagvalue_set,+1,MatchOk),
    	   # double quoted value
    	   (None,Is,'"',+5),
    	     ('value',AllNotIn,'"',+1,+2),
    	     ('value',Skip,0),
    	     (None,Is,'"'),
    	     (None,Jump,To,MatchOk),
    	   # single quoted value
    	   (None,Is,'\''),
    	     ('value',AllNotIn,'\'',+1,+2),
    	     ('value',Skip,0),
    	     (None,Is,'\'')
    	   )
    
        valuetable = (
    	# ignore whitespace + '='
    	(None,AllInSet,set(' \r\n\t='),+1),
    	# unquoted value
    	('value',AllInSet,tagvalue_set,+1,MatchOk),
    	# double quoted value
    	(None,Is,'"',+5),
    	 ('value',AllNotIn,'"',+1,+2),
    	 ('value',Skip,0),
    	 (None,Is,'"'),
    	 (None,Jump,To,MatchOk),
    	# single quoted value
    	(None,Is,'\''),
    	 ('value',AllNotIn,'\'',+1,+2),
    	 ('value',Skip,0),
    	 (None,Is,'\'')
    	)
    
        allattrs = (# look for attributes
    	       (None,AllInSet,white_set,+4),
    	        (None,Is,'>',+1,MatchOk),
    	        ('tagattr',Table,tagattr),
    	        (None,Jump,To,-3),
    	       (None,Is,'>',+1,MatchOk),
    	       # handle incorrect attributes
    	       (error,AllNotIn,'> \r\n\t'),
    	       (None,Jump,To,-6)
    	       )
    
        htmltag = ((None,Is,'<'),
    	       # is this a closing tag ?
    	       ('closetag',Is,'/',+1),
    	       # a coment ?
    	       ('comment',Is,'!',+8),
    		(None,Word,'--',+4),
    		('text',sWordStart,BMS('-->'),+1),
    		(None,Skip,3),
    		(None,Jump,To,MatchOk),
    		# a SGML-Tag ?
    		('other',AllNotIn,'>',+1),
    		(None,Is,'>'),
    		    (None,Jump,To,MatchOk),
    		   # XMP-Tag ?
    		   ('tagname',Word,'XMP',+5),
    		    (None,Is,'>'),
    		    ('text',WordStart,'</XMP>'),
    		    (None,Skip,len('</XMP>')),
    		    (None,Jump,To,MatchOk),
    		   # get the tag name
    		   ('tagname',AllInSet,tagname_set),
    		   # look for attributes
    		   (None,AllInSet,white_set,+4),
    		    (None,Is,'>',+1,MatchOk),
    		    ('tagattr',Table,tagattr),
    		    (None,Jump,To,-3),
    		   (None,Is,'>',+1,MatchOk),
    		   # handle incorrect attributes
    		   (error,AllNotIn,'> \n\r\t'),
    		   (None,Jump,To,-6)
    		  )
    
        htmltable = (# HTML-Tag
    		 ('htmltag',Table,htmltag,+1,+4),
    		 # not HTML, but still using this syntax: error or inside XMP-tag !
    		 (error,Is,'<',+3),
    		  (error,AllNotIn,'>',+1),
    		  (error,Is,'>'),
    		 # normal text
    		 ('text',AllNotIn,'<',+1),
    		 # end of file
    		 ('eof',EOF,Here,-5),
    		)
          
    	

    I hope this doesn't scare you away :-) ... it's fast as hell.

Package Structure

    [TextTools]
           [Constants]
                  Sets.py
                  TagTables.py
           Doc/
           [Examples]
                  HTML.py
                  Loop.py
                  Python.py
                  RTF.py
                  RegExp.py
                  Tim.py
                  Words.py
                  altRTF.py
                  pytag.py
           [mxTextTools]
                  test.py
           TextTools.py
        

    Entries enclosed in brackets are packages (i.e. they are directories that include a __init__.py file). Ones with slashes are just ordinary subdirectories that are not accessible via import.

    The package TextTools imports everything needed from the other components. It is sometimes also handy to do a from mx.TextTools.Constants.TagTables import *.

    Examples/ contains a few demos of what the Tag Tables can do.

Optional Add-Ons for mxTextTools

Mike C. Fletcher is working on a Tag Table generator called SimpleParse. It works as parser generating front end to the Tagging Engine and converts a EBNF style grammar into a Tag Table directly useable with the tag() function.

Tony J. Ibbs has started to work on a meta-language for mxTextTools. It aims at simplifying the task of writing Tag Table tuples using a Python style syntax. It also gets rid off the annoying jump offset calculations.

Andrew Dalke has started work on a parser generator called Martel built upon mxTextTools which takes a regular expression grammer for a format and turns the resultant parsed tree into a set of callback events emulating the XML/SAX API. The results look very promising !

Support

Copyright & License

History & Future


© 1997-2000, Copyright by Marc-André Lemburg; All Rights Reserved. mailto: mal@lemburg.com

© 2000-2001, Copyright by eGenix.com Software GmbH; All Rights Reserved. mailto: info@egenix.com