The FSMParser class

class ferenda.FSMParser[source]

A configurable finite state machine (FSM) for parsing documents with nested structure. You provide a set of recognizers, a set of constructors, a transition table and a stream of document text chunks, and it returns a hierarchical document object structure.

See Parsing document structure.

set_recognizers(*args)[source]

Set the list of functions (or other callables) used in order to recognize symbols from the stream of text chunks. Recognizers are tried in the order specified here.

remove_recognizer(recognizer)[source]
set_transitions(transitions)[source]

Set the transition table for the state matchine.

Parameters:transitions – The transition table, in the form of a mapping between two tuples. The first tuple should be the current state (or a list of possible current states) and a callable function that determines if a particular symbol is recognized (currentstate, recognizer). The second tuple should be a constructor function (or False``) and the new state to transition into.
parse(chunks)[source]

Parse a document in the form of an iterable of suitable chunks – often lines or elements. each chunk should be a string or a string-like obje ct. Some examples:

p = FSMParser()
reader = TextReader("foo.txt")
body = p.parse(reader.getiterator(reader.readparagraph),"body", make_body)
body = p.parse(BeautifulSoup("foo.html").find_all("#main p"), "body", make_body)
body = p.parse(ElementTree.parse("foo.xml").find(".//paragraph"), "body", make_body)
Parameters:
  • chunks – The document to be parsed, as a list or any other iterable of text-like objects.
  • initialstate – The initial state for the machine. The state must be present in the transition table. This could be any object, but strings are preferrable as they make error messages easier to understand.
  • initialconstructor (callable) – A function that creates a document root object, and then fills it with child objects using .make_children()
Returns:

A document object tree.

analyze_symbol()[source]

Internal function used by make_children()

transition(currentstate, symbol)[source]

Internal function used by make_children()

make_child(constructor, childstate)[source]

Internal function used by make_children(), which calls one of the constructors defined in the transition table.

make_children(parent)[source]

Creates child nodes for the current (parent) document node.

Parameters:parent – The parent document node, as any list-like object (preferrably a subclass of ferenda.elements.CompoundElement)
Returns:The same parent object.