The PDFReader class¶
-
class
ferenda.PDFReader(pages=None, filename=None, workdir=None, images=True, convert_to_pdf=False, keep_xml=True, ocr_lang=None, fontspec=None, textdecoder=None)[source]¶ Parses PDF files and makes the content available as a object hierarchy. Calling the
read()method returns aferenda.pdfreader.PDFFileobject, which is a list offerenda.pdfreader.Pageobjects, which each is a list offerenda.pdfreader.Textboxobjects, which each is a list offerenda.pdfreader.Textelementobjects.Note
This class depends on the command line tool pdftohtml from poppler.
The class can also handle any other type of document (such as Word/OOXML/WordPerfect/RTF) that OpenOffice or LibreOffice handles by first converting it to PDF using the
sofficecommand line tool (which then must be in your$PATH).If the PDF contains only scanned pages (without any OCR information), the pages can be run through the
tesseractcommand line tool (which, again, needs to be in your$PATH). You need to provide the main language of the document as theocr_langparameter, and you need to have installed the tesseract language files for that language.-
detect_footnotes= True¶
-
dims= 'bbox (?P<left>\\d+) (?P<top>\\d+) (?P<right>\\d+) (?P<bottom>\\d+)'¶
-
re_dimensions()¶ Scan through string looking for a match, and return a corresponding match object instance.
Return None if no position in the string matches.
-
ws_trans= {9: ' ', 10: ' ', 160: ' '}¶
-
tagname= 'div'¶
-
classname= 'pdfreader'¶
-
textboxes(gluefunc=None, pageobjects=False, keepempty=False, startpage=0, pagecount=None, cache=True)[source]¶ Return an iterator of the textboxes available.
gluefuncshould be a callable that is called with (textbox, nextbox, prevbox), and returns True iff nextbox should be appended to textbox.If
pageobjects, the iterator can return Page objects to signal that pagebreak has ocurred (these Page objects may or may not have Textbox elements).If
keepempty, process and return textboxes that have no text content (these are filtered out by default)If
cache, store the resulting list of textboxes for each page and return it the next time.
-
string= <module 'string' from '/usr/lib/python3.5/string.py'>¶
-
-
class
ferenda.pdfreader.Page(*args, **kwargs)[source]¶ Represents a Page in a PDF file. Has width and height properties.
-
tagname= 'div'¶
-
classname= 'pdfpage'¶
-
margins= None¶
-
id¶
-
boundingbox(top=0, left=0, bottom=None, right=None)[source]¶ A generator of
ferenda.pdfreader.Textboxobjects that fit into the bounding box specified by the parameters.
-
crop(top=0, left=0, bottom=None, right=None)[source]¶ Removes any
ferenda.pdfreader.Textboxobjects that does not fit within the bounding box specified by the parameters.
-
-
class
ferenda.pdfreader.Textbox(*args, **kwargs)[source]¶ A textbox is a amount of text on a PDF page, with top, left, width and height properties that specifies the bounding box of the text. The fontid property specifies the id of font used (use
getfont()to get a dict of all font properties). A textbox consists of a list of Textelements which may differ in basic formatting (bold and or italics), but otherwise all text in a Textbox has the same font and size.-
tagname= 'p'¶
-
classname= 'textbox'¶
-
linespacing¶
-
as_xhtml(uri, parent_uri=None)[source]¶ Converts this object to a
lxml.etreeobject (with children)Parameters: uri (str) – If provided, gets converted to an @aboutattribute in the resulting XHTML.
-
font¶
-
-
class
ferenda.pdfreader.Textelement(*args, **kwargs)[source]¶ Represent a single part of text where each letter has the exact same formatting. The
tagproperty specifies whether the text as a whole is bold ('b') , italic('i'bold + italic ('bi') or regular (None).-
as_xhtml(uri, parent_uri=None)[source]¶ Converts this object to a
lxml.etreeobject (with children)Parameters: uri (str) – If provided, gets converted to an @aboutattribute in the resulting XHTML.
-
tagname¶
-