The WordReader class

class ferenda.WordReader[source]

Reads .docx and .doc-files (the latter with support from antiword) and converts them to a XML form that is slightly easier to deal with.

log = <logging.Logger object>
read(wordfile, intermediatefp, simplify=True)[source]

Converts the word file to a more easily parsed format.

Parameters:
  • wordfile – Path to original docfile
  • intermediatefp – An open filehandle to write the more parseable file to
Returns:

filetype (either “doc” or “docx”)

Return type:

str

word_to_docbook(indoc, outfp)[source]

Convert a old Word document (.doc) to a pseudo-docbook file through antiword.

word_to_ooxml(indoc, outfp, simplify)[source]

Extracts the raw OOXML file from a modern Word document (.docx).