The `PDFAnalyzer` class¶

class ferenda.PDFAnalyzer(pdf)[source]¶

Create a analyzer for the given pdf file.

The primary purpose of an analyzer is to determine margins and other spatial metrics of a document, and identifiy common typographic styles for default text, title and headings. This is done by calling the metrics() method.

The analysis is done in several steps. The properties of all textboxes on each page is collected in several collections.Counter objects. These counters are then statistically analyzed in a series of functions to yield these metrics.

If different analyzis logic, or additional metrics, are desired, this class should be inherited and some methods/properties overridden.

Parameters:	pdf (ferenda.PDFReader) – The pdf file to analyze.

twopage = True¶: Whether or not the document is expected to have different margins depending on whether it’s a even or odd page.

style_significance_threshold = 0.005¶: “The amount of use (as compared to the rest of the document that a style must have to be considered significant.

header_significance_threshold = 0.002¶: The maximum amount (expressed as part of the entire text amount) of text that can occur on the top of the page for it to be considered part of the header.

footer_significance_threshold = 0.002¶: The maximum amount (expressed as part of the entire text amount) of text that can occur on the bottom of the page for it to be considered part of the footer.

pagination_min_size = 6¶: The minimum size (in points) that a page number can be. Used to distinguish page numbers from footnote numbers, which are typically set in miniscule sizes.

documents¶

Attempts to distinguish different logical document (eg parts with differing pagesizes/margins/styles etc) within this PDF.

You should override this method if you want to provide your own document segmentation logic.

Returns:	Tuples (startpage, pagecount, tag) for the different identified documents
Return type:	list

paginate(paginatepath=None, force=False)[source]¶: Attempt to identify the real page number from pagination numbers on the page

guess_pagenumber(page, probable_pagenumber=1)[source]¶

guess_pagenumber_candidates(page, probable_pagenumber)[source]¶

guess_pagenumber_boxes(page)[source]¶: Return a suitable number of textboxes to scan for a possible page number.

guess_pagenumber_select(candidates, probable_pagenumber)[source]¶

metrics(metricspath=None, plotpath=None, startpage=0, pagecount=None, force=False)[source]¶

Calculate and return the metrics for this analyzer.

metrics is a set of named properties in the form of a dict. The keys of the dict can represent margins or other measurements of the document (left/right margins, header/footer etc) or font styles used in the document (eg. default, title, h1 – h3). Style values are in turn dicts themselves, with the keys ‘family’ and ‘size’.

Parameters:	metricspath (str) – The path of a JSON file used as cache for the calculated metrics plotpath (str) – The path to write a PNG file with histograms for different values (for debugging). startpage (int) – starting page for the analysis startpage – number of pages to analyze (default: all available) force (bool) – Perform analysis even if cached JSON metrics exists.
Returns:	calculated metrics
Return type:	dict

The default implementation will try to find out values for the following metrics:

key	description
leftmargin	position of left margin (for odd pages if twopage = True)
rightmargin	position of right margin (for odd pages if twopage = True)
leftmargin_even	position of left margin for even pages
rightmargin_even	position of right margin for right pages
topmargin	position of header zone
bottommargin	position of footer zone
default	style used for default text
title	style used for main document title (on front page)
h1	style used for level 1 headings
h2	style used for level 2 headings
h3	style used for level 3 headings

Subclasses might add (or remove) from the above.

textboxes(startpage, pagecount)[source]¶: Generate a stream of (pagenumber, textbox) tuples consisting of all pages/textboxes from startpage to pagecount.

count_horizontal_margins(startpage, pagecount)[source]¶

Return a dict of Counter objects for all the horizontally oriented textbox properties (number of textboxes starting/ending at different positions).

The set of counters is determined by setup_horizontal_counters.

setup_horizontal_counters()[source]¶: Create initial set of horizontal counters.

count_horizontal_textbox(pagenumber, textbox, counters)[source]¶: Add a single textbox to the set of horizontal counters.

count_vertical_margins(startpage, pagecount)[source]¶

setup_vertical_counters()[source]¶

count_vertical_textbox(pagenumber, textbox, counters)[source]¶

count_styles(startpage, pagecount)[source]¶

count_styles_textbox(pagenumber, textbox, counter)[source]¶

analyze_vertical_margins(vcounters)[source]¶

analyze_horizontal_margins(vcounters)[source]¶

filterdict(counter, filter_func=None)[source]¶

findmargin(counter, trunc_func=<built-in function round>, quantize=False)[source]¶

fontsize_key(fonttuple)[source]¶

fontdict(fonttuple)[source]¶

analyze_styles(styles)[source]¶

drawboxes(outfilename, gluefunc=None, startpage=0, pagecount=None, counters=None, metrics=None)[source]¶: Create a copy of the parsed PDF file, but with the textboxes created by gluefunc clearly marked, and metrics shown on the page.

Note

This requires PyPDF2 and reportlab, which aren’t installed by default. Reportlab (3.*) only works on py27+ and py33+

plot(filename, margincounters, stylecounters, metrics)[source]¶

plot_margins(subplots, margin_counters, metrics, pagewidth, pageheight)[source]¶

plot_styles(plot, stylecounter, metrics, plt)[source]¶

The PDFAnalyzer class¶

The `PDFAnalyzer` class¶