Parsing document structure

In many scenarios, the basic steps in parsing source documents are similar. If your source does not contain properly nested structures that accurately represent the structure of the document (such as well-authored XML documents), you will have to re-create the structure that the author intended. Usually, text formatting, section numbering and other clues contain just enough information to do that.

In many cases, your source document will naturally be split up in a large number of “chunks”. These chunks may be lines or paragraphs in a plaintext documents, or tags of a certain type in a certain location in a HTML document. Regardless, it is often easy to generate a list of such chunks. See, in particular, Reading files in various formats.

Note

For those with a background in computer science and formal languages, a chunk is sort of the same thing as a token, but whereas a token typically is a few characters in length, a chunk is typically one to several sentences long. Splitting up a documents in chunks is also typically much simpler than the process of tokenization.

These chunks can be fed to a finite state machine, which looks at each chunk, determines what kind of structural element it probably is (eg. a headline, the start of a chapter, a item in a bulleted list…) by looking at the chunk in the context of previous chunks, and then explicitly re-creates the document structure that the author (presumably) intended.

FSMParser

The framework contains a class for creating such state machines, FSMParser. It is used with a set of the following objects:

Object Purpose
Recognizers Functions that look at a chunk and determines if it is a particular structural element.
Constructors Functions that creates a document element from a chunk (or series of chunks)
States Identifiers for the current state of the document being parsed, ie. “in-preamble”, “in-ordered-list”
Transitions mapping (current state(s), recognizer) -> (new state, constructor)

You initialize the parser with the transition table (which contains the other objects), then call it’s parse() method with a iterator of chunks, an initial state, and an initial constructor. The result of parse is a nested document object tree.

A simple example

Consider a very simple document format that only has three kinds of structural elements: a normal paragraph, preformatted text, and sections. Each section has a title and may contain paragraphs or preformatted text, which in turn may not contain anything else. All chunks are separated by double newlines

The section is identified by a header, which is any single-line string followed by a line of = characters of the same length. Any time a new header is encountered, this signals the end of the current section:

This is a header
================

A preformatted section is any chunk where each line starts with at least two spaces:

# some example of preformatted text
def world(name):
    return "Hello", name

A paragraph is anything else:

This is a simple paragraph.
It can contain short lines and longer lines.

(You might recognize this format as a very simple form of ReStructuredText).

Recognizers for these three elements are easy to build:

from ferenda import elements, FSMParser

def is_section(parser):
    chunk = parser.reader.peek()
    lines = chunk.split("\n")
    return (len(lines) == 2 and
            len(lines[0]) == len(lines[1]) and
            lines[1] == "=" * len(lines[0]))

def is_preformatted(parser):
    chunk = parser.reader.peek()
    lines=chunk.split("\n")
    not_indented = lambda x: not x.startswith("  ")
    return len(list(filter(not_indented,lines))) == 0

def is_paragraph(parser):
    return True

The elements module contains ready-built classes which we can use to build our constructors:

def make_body(parser):
    b = elements.Body()
    return parser.make_children(b)

def make_section(parser):
    chunk = parser.reader.next()
    title = chunk.split("\n")[0]
    s = elements.Section(title=title)
    return parser.make_children(s)
setattr(make_section,'newstate','section')
    
def make_paragraph(parser):
    return elements.Paragraph([parser.reader.next()])

def make_preformatted(parser):
    return elements.Preformatted([parser.reader.next()])

Note that any constructor which may contain sub-elements must itself call the make_children() method of the parser. That method takes a parent object, and then repeatedly creates child objects which it attaches to that parent object, until a exit condition is met. Each call to create a child object may, in turn, call make_children (not so in this very simple example).

The final step in putting this together is defining the transition table, and then creating, configuring and running the parser:

transitions = {("body", is_section): (make_section, "section"),
               ("section", is_paragraph): (make_paragraph, None),
               ("section", is_preformatted): (make_preformatted, None),
               ("section", is_section): (False, None)}

text = """First section
=============

This is a regular paragraph. It will not be matched by is_section
(unlike the above chunk) or is_preformatted (unlike the below chunk),
but by the catch-all is_paragraph. The recognizers are run in the
order specified by FSMParser.set_transitions().

    This is a preformatted section.
        It could be used for source code,
    +-------------------+
    |   line drawings   |
    +-------------------+
        or what have                 you.

Second section
==============

The above new section implicitly closed the first section which we
were in. This was made explicit by the last transition rule, which
stated that any time a section is encountered while in the "section"
state, we should not create any more children (False) but instead
return to our previous state (which in this case is "body", but for a
more complex language could be any number of states)."""

p = FSMParser()
p.set_recognizers(is_section, is_preformatted, is_paragraph)
p.set_transitions(transitions)
p.initial_constructor = make_body
p.initial_state = "body"
body = p.parse(text.split("\n\n"))
# print(elements.serialize(body))

The result of this parse is the following document object tree (passed through serialize()):

<Body>
  <Section title="First section">
    <Paragraph>
      <str>This is a regular paragraph. It will not be matched by is_section
(unlike the above chunk) or is_preformatted (unlike the below chunk),
but by the catch-all is_paragraph. The recognizers are run in the
order specified by FSMParser.set_transitions().</str>
    </Paragraph><Preformatted>
      <str>    This is a preformatted section.
        It could be used for source code,
    +-------------------+
    |   line drawings   |
    +-------------------+
        or what have                 you.</str>
    </Preformatted>
  </Section>
  <Section title="Second section">
    <Paragraph>
      <str>The above new section implicitly closed the first section which we
were in. This was made explicit by the last transition rule, which
stated that any time a section is encountered while in the "section"
state, we should not create any more children (False) but instead
return to our previous state (which in this case is "body", but for a
more complex language could be any number of states).</str>
    </Paragraph>
  </Section>
</Body>

	

    

Writing complex parsers

Recognizers

Recognizers are any callables that can be called with the parser object as only parameter (so no class- or instancemethods). Objects that implement __call__ are OK, as are lambda functions.

One pattern to use when creating parsers is to have a method on your docrepo class which defines a number of nested functions, then creates a transition table using those functions, create the parser with that transition table, and then return the initialized parser object. Your main parse method can then call this method, break the input document into suitable chunks, then call parse on the recieved parser object.

Constructors

Like recognizers, constructors may be any callable, and they are called with the parser object as the only parameter.

Constructors that return elements which in themselves do not contain sub-elements are simple to write – just return the created element (see eg make_paragraph or make_preformatted above).

Constructors that are to return elements that may contain subelement must first create the element, then call parser.:meth:ferenda.FSMParser.make_children with that element as a single argument. make_children will treat that element as a list, and append any sub-elements created to that list, before returning it.

The parser object

The parser object is passed to every recognizer and constructor. The most common use is to read the next available chunk from it’s reader property – this is an instance of a simple wrapper around the stream of chunks. The reader has two methods: peek and next, which both returns the next available chunk, but next also consumes the chunk in question. A recognizer typically calls parser.reader.peek(), a constructor typically calls parser.reader.next().

The parser object also has the following properties

Property Description
currentstate The current state of the parser, using whatever value for state that was defined in the transition table (typically a string)
debug boolean that indicates whether to emit debug messages (by default False)

There is also a parser._debug() method that emits debug messages, indicating current parser nesting level and current state, if parser.debug is True

The transition table

The transition table is a mapping between (currentstate(s), successful recognizer) and (constructor-or-false,newstate-or-None)

The transition table is used in the following way: All recognizers that can be applicable in the current state are tried in the specified order until one of them returns True. Using this pair of (currentstate, recognizer), the corresponding value tuple is looked up in the transition table.

constructor-or-False: …

newstate-or-None: …

The key in the transition table can also be a callable, which is called with (currentstate,symbol,parser?) and is expected to return a (constructor-or-false,newstate-or-None) tuple

Tips for debugging your parser

Two useful commands in the Devel module:

$ # sets debug, prints serialize(parser.parse(...))
$ ./ferenda-build.py devel fsmparse parser < chunks
$ # sets debug, returns name of matching function
$ ./ferenda-build.py devel fsmanalyze parser <currentstate> < chunk