The PDFReader class

class ferenda.PDFReader(pages=None, filename=None, workdir=None, images=True, convert_to_pdf=False, keep_xml=True, ocr_lang=None, fontspec=None, textdecoder=None)[source]

Parses PDF files and makes the content available as a object hierarchy. Calling the read() method returns a ferenda.pdfreader.PDFFile object, which is a list of ferenda.pdfreader.Page objects, which each is a list of ferenda.pdfreader.Textbox objects, which each is a list of ferenda.pdfreader.Textelement objects.

Note

This class depends on the command line tool pdftohtml from poppler.

The class can also handle any other type of document (such as Word/OOXML/WordPerfect/RTF) that OpenOffice or LibreOffice handles by first converting it to PDF using the soffice command line tool (which then must be in your $PATH).

If the PDF contains only scanned pages (without any OCR information), the pages can be run through the tesseract command line tool (which, again, needs to be in your $PATH). You need to provide the main language of the document as the ocr_lang parameter, and you need to have installed the tesseract language files for that language.

detect_footnotes = True
dims = 'bbox (?P<left>\\d+) (?P<top>\\d+) (?P<right>\\d+) (?P<bottom>\\d+)'
re_dimensions()

Scan through string looking for a match, and return a corresponding match object instance.

Return None if no position in the string matches.

ws_trans = {9: ' ', 10: ' ', 160: ' '}
tagname = 'div'
classname = 'pdfreader'
is_empty()[source]
textboxes(gluefunc=None, pageobjects=False, keepempty=False, startpage=0, pagecount=None, cache=True)[source]

Return an iterator of the textboxes available.

gluefunc should be a callable that is called with (textbox, nextbox, prevbox), and returns True iff nextbox should be appended to textbox.

If pageobjects, the iterator can return Page objects to signal that pagebreak has ocurred (these Page objects may or may not have Textbox elements).

If keepempty, process and return textboxes that have no text content (these are filtered out by default)

If cache, store the resulting list of textboxes for each page and return it the next time.

median_box_width(threshold=0)[source]

Returns the median box width of all pages.

string = <module 'string' from '/usr/lib/python3.5/string.py'>
class ferenda.pdfreader.Page(*args, **kwargs)[source]

Represents a Page in a PDF file. Has width and height properties.

tagname = 'div'
classname = 'pdfpage'
margins = None
id
boundingbox(top=0, left=0, bottom=None, right=None)[source]

A generator of ferenda.pdfreader.Textbox objects that fit into the bounding box specified by the parameters.

crop(top=0, left=0, bottom=None, right=None)[source]

Removes any ferenda.pdfreader.Textbox objects that does not fit within the bounding box specified by the parameters.

class ferenda.pdfreader.Textbox(*args, **kwargs)[source]

A textbox is a amount of text on a PDF page, with top, left, width and height properties that specifies the bounding box of the text. The fontid property specifies the id of font used (use getfont() to get a dict of all font properties). A textbox consists of a list of Textelements which may differ in basic formatting (bold and or italics), but otherwise all text in a Textbox has the same font and size.

tagname = 'p'
classname = 'textbox'
linespacing
as_xhtml(uri, parent_uri=None)[source]

Converts this object to a lxml.etree object (with children)

Parameters:uri (str) – If provided, gets converted to an @about attribute in the resulting XHTML.
font
class ferenda.pdfreader.Textelement(*args, **kwargs)[source]

Represent a single part of text where each letter has the exact same formatting. The tag property specifies whether the text as a whole is bold ('b') , italic('i' bold + italic ('bi') or regular (None).

as_xhtml(uri, parent_uri=None)[source]

Converts this object to a lxml.etree object (with children)

Parameters:uri (str) – If provided, gets converted to an @about attribute in the resulting XHTML.
tagname