The `PDFReader` class¶

class ferenda.PDFReader(*args, **kwargs)¶

Parses PDF files and makes the content available as a object hierarchy. After calling read(), the PDFReader itself is a list of ferenda.pdfreader.Page objects, which each is a list of ferenda.pdfreader.Textbox objects, which each is a list of ferenda.pdfreader.Textelement objects.

Note

This class depends on the command line tool pdftohtml from poppler.

The class can also handle any other type of document (such as Word/OOXML/WordPerfect/RTF) that OpenOffice or LibreOffice handles by first converting it to PDF using the soffice command line tool (which then must be in your $PATH).

If the PDF contains only scanned pages (without any OCR information), the pages can be run through the tesseract command line tool (which, again, needs to be in your $PATH). You need to provide the main language of the document as the ocr_lang parameter, and you need to have installed the tesseract language files for that language.

tagname = u'div'¶

classname = u'pdfreader'¶

read(pdffile, workdir, images=True, convert_to_pdf=False, keep_xml=True, ocr_lang=None)¶

Initializes a PDFReader object from an existing PDF file. After initialization, the PDFReader contains a list of Page objects.

Parameters:

pdffile – The full path to the PDF file (or, if convert_to_pdf is set, any other document file)
workdir – A directory where intermediate files (particularly background PNG files) are stored
convert_to_pdf (bool) – If pdffile is any other type of document other than PDF, attempt to first convert it to PDF using the soffice command line tool (from OpenOffice/LibreOffice).
keep_xml (bool) – If False, remove the intermediate XML representation of the PDF that gets created in workdir. If true, keep it around to speed up subsequent parsing operations. If set to the special value "bz2", keep it but compress it with bz2.
ocr_lang – If provided, PDFReader will extract scanned images from the PDF file, and run an OCR program on it, using the ocr_lang language heuristics. (Note that this is not neccessarily an IETF language tag like “sv” or “en-GB”, but rather whatever the underlying tesseract program uses).
ocr_lang – str

is_empty()¶

textboxes(gluefunc=None, pageobjects=False, keepempty=False)¶

Return an iterator of the textboxes available.

gluefunc should be a callable that is called with (textbox, nextbox, prevbox), and returns True iff nextbox should be appended to textbox.

If pageobjects, the iterator can return Page objects to signal that pagebreak has ocurred (these Page objects may or may not have Textbox elements).

If keepempty, process and return textboxes that have no text content (these are filtered out by default)

drawboxes(outfile, gluefunc=None)¶

Create a copy of the parsed PDF file, but with the textboxes created by gluefunc clearly marked. Returns the name of the created pdf file.

..note:

This requires PyPDF2 and reportlab, which aren't installed
by default (and at least reportlab is not py3 compatible).

static re_dimensions()¶: search(string[, pos[, endpos]]) –> match object or None. Scan through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.

median_box_width(threshold=0)¶: Returns the median box width of all pages.

class ferenda.pdfreader.Page(*args, **kwargs)[source]¶

Represents a Page in a PDF file. Has width and height properties.

tagname = u'div'¶

classname = u'pdfpage'¶

id[source]¶

boundingbox(top=0, left=0, bottom=None, right=None)[source]¶: A generator of ferenda.pdfreader.Textbox objects that fit into the bounding box specified by the parameters.

crop(top=0, left=0, bottom=None, right=None)[source]¶: Removes any ferenda.pdfreader.Textbox objects that does not fit within the bounding box specified by the parameters.

class ferenda.pdfreader.Textbox(*args, **kwargs)[source]¶

A textbox is a amount of text on a PDF page, with top, left, width and height properties that specifies the bounding box of the text. The font property specifies the id of font used (use getfont() to get a dict of all font properties). A textbox consists of a list of Textelements which may differ in basic formatting (bold and or italics), but otherwise all text in a Textbox has the same font and size.

tagname = u'p'¶

classname = u'textbox'¶

as_xhtml(uri)[source]¶

getfont()[source]¶: Returns a fontspec dict of all properties of the font used.

class ferenda.pdfreader.Textelement(*args, **kwargs)[source]¶

Represent a single part of text where each letter has the exact same formatting. The tag property specifies whether the text as a whole is bold ('b') , italic('i' bold + italic ('bi') or regular (None).

tagname¶

The PDFReader class¶

The `PDFReader` class¶