Reading files in various formats

The first step of parsing a document is often getting actual text from a file. For plain text files, this is not a difficult process, but for eg. Word and PDF documents some sort of library support is useful.

Ferenda contains three different classes that all deal with this problem. They do not have a unified interface, but instead contain different methods depending on the structure and capabilities of the file format they’re reading.

Reading plain text files

The TextReader class works sort of like a regular file object, and can read a plain text file line by line, but contains extra methods for reading files paragraph by paragraph or page by page. It can also produce generators that yield the file contents divided into arbitrary chunks, which is suitable as input for FSMParser.

Microsoft Word documents

The WordReader class can read both old-style .doc files and newer, XML-based .docx files. The former requires that antiword is installed, but the latter has no additional dependencies.

This class does not present any interface for actually reading the word document – instead, it converts the document to a XML file which is either based on the docbook output of antiword, or the raw OOXML found inside of the .docx file.

PDF documents

PDFReader reads PDF documents and makes them available as a list of pages, where each page contains a list of Textbox objects, which in turn contains a list of Textelement objects.

Its textboxes() method is a flexible way of getting a generator of suitable text chunks. By passing a “glue” function to that method, you can specify exact rules on which rows of text should be combined to form larger suitable chunks (eg. paragraphs). This stream of chunks can be fed directly as input to FSMParser.

Handling non-PDFs and scanned documents

The class can also handle any other type of document (such as Word/OOXML/WordPerfect/RTF) that OpenOffice or LibreOffice handles by first converting it to PDF using the soffice command line tool. This is done by specifiying the convert_to_pdf parameter.

If the PDF contains only scanned pages (without any OCR information), the pages can be run through the tesseract command line tool. You need to provide the main language of the document as the ocr_lang parameter, and you need to have installed the tesseract language files for that language.

Analyzing PDF documents

When processing a PDF file, the information contained in eg a Textbox object (position, size, font) is useful to determine what kind of content it might be, eg. if it’s set in a header-like font, it probably signals the start of a section, and if it’s a digit-like text set in a small font outside of the main content area, it’s probably a page number.

Information about eg page margins, header styles etc can be hardcoded in your processing code, but it’s also possible to use the companion class PDFAnalyzer can be used to statistically analyze a complete document and then make educated guesses about these metrics. It can also output histogram plots and an annotated version of the original PDF file with lines marking the identified margins, styles and text chunks (given a provided “glue” function identical to the one provided to textboxes())

The class is designed to be overridden if your document has particular rules about eg. header styles or additional margin metrics.