The PDFReader
class¶
-
class
ferenda.
PDFReader
(pages=None, filename=None, workdir=None, images=True, convert_to_pdf=False, keep_xml=True, ocr_lang=None, fontspec=None)[source]¶ Parses PDF files and makes the content available as a object hierarchy. Calling the
read()
method returns aferenda.pdfreader.PDFFile
object, which is a list offerenda.pdfreader.Page
objects, which each is a list offerenda.pdfreader.Textbox
objects, which each is a list offerenda.pdfreader.Textelement
objects.Note
This class depends on the command line tool pdftohtml from poppler.
The class can also handle any other type of document (such as Word/OOXML/WordPerfect/RTF) that OpenOffice or LibreOffice handles by first converting it to PDF using the
soffice
command line tool (which then must be in your$PATH
).If the PDF contains only scanned pages (without any OCR information), the pages can be run through the
tesseract
command line tool (which, again, needs to be in your$PATH
). You need to provide the main language of the document as theocr_lang
parameter, and you need to have installed the tesseract language files for that language.-
dims
= 'bbox (?P<left>\\d+) (?P<top>\\d+) (?P<right>\\d+) (?P<bottom>\\d+)'¶
-
re_dimensions
()¶ Scan through string looking for a match, and return a corresponding match object instance.
Return None if no position in the string matches.
-
tagname
= 'div'¶
-
classname
= 'pdfreader'¶
-
textboxes
(gluefunc=None, pageobjects=False, keepempty=False)[source]¶ Return an iterator of the textboxes available.
gluefunc
should be a callable that is called with (textbox, nextbox, prevbox), and returns True iff nextbox should be appended to textbox.If
pageobjects
, the iterator can return Page objects to signal that pagebreak has ocurred (these Page objects may or may not have Textbox elements).If
keepempty
, process and return textboxes that have no text content (these are filtered out by default)
-
-
class
ferenda.pdfreader.
Page
(*args, **kwargs)[source]¶ Represents a Page in a PDF file. Has width and height properties.
-
tagname
= 'div'¶
-
classname
= 'pdfpage'¶
-
margins
= None¶
-
id
¶
-
boundingbox
(top=0, left=0, bottom=None, right=None)[source]¶ A generator of
ferenda.pdfreader.Textbox
objects that fit into the bounding box specified by the parameters.
-
crop
(top=0, left=0, bottom=None, right=None)[source]¶ Removes any
ferenda.pdfreader.Textbox
objects that does not fit within the bounding box specified by the parameters.
-
-
class
ferenda.pdfreader.
Textbox
(*args, **kwargs)[source]¶ A textbox is a amount of text on a PDF page, with top, left, width and height properties that specifies the bounding box of the text. The fontid property specifies the id of font used (use
getfont()
to get a dict of all font properties). A textbox consists of a list of Textelements which may differ in basic formatting (bold and or italics), but otherwise all text in a Textbox has the same font and size.-
tagname
= 'p'¶
-
classname
= 'textbox'¶
-
as_xhtml
(uri, parent_uri=None)[source]¶ Converts this object to a
lxml.etree
object (with children)Parameters: uri (str) – If provided, gets converted to an @about
attribute in the resulting XHTML.
-
font
¶
-
-
class
ferenda.pdfreader.
Textelement
(*args, **kwargs)[source]¶ Represent a single part of text where each letter has the exact same formatting. The
tag
property specifies whether the text as a whole is bold ('b'
) , italic('i'
bold + italic ('bi'
) or regular (None
).-
as_xhtml
(uri, parent_uri=None)[source]¶ Converts this object to a
lxml.etree
object (with children)Parameters: uri (str) – If provided, gets converted to an @about
attribute in the resulting XHTML.
-
tagname
¶
-