The WordReader class

class ferenda.WordReader

Reads .docx and .doc-files (the latter with support from antiword) and converts them to a XML form that is slightly easier to deal with.

log = <logging.Logger object at 0x7f8074675ef0>
read(wordfile, intermediatefile)

Converts the word file to a more easily parsed format.

Parameters:
  • wordfile – Path to original docfile
  • intermediatefile – Where to store the more parseable file
Returns:

name of parseable file, filetype (either “doc” or “docx”)

Return type:

tuple

word_to_docbook(indoc, outdoc)

Convert a old Word document (.doc) to a pseudo-docbook file through antiword.

word_to_ooxml(indoc, outdoc)

Extracts the raw OOXML file from a modern Word document (.docx).