The DocumentEntry class

class ferenda.DocumentEntry(path=None)

This class has two primary uses – it is used to represent and store aspects of the downloading of each document (when it was initially downloaded, optionally updated, and last checked, as well as the URL from which it was downloaded). It’s also used by the news_* methods to encapsulate various aspects of a document entry in an atom feed. Some properties and methods are used by both of these use cases, but not all.

Parameters:path (str) – If this file path is an existing JSON file, the object is initialized from that file.
orig_created = None

The first time we fetched the document from it’s original location.

id = None

The canonical uri for the document.

basefile = None

The basefile for the document.

orig_updated = None

The last time the content at the original location of the document was changed.

orig_checked = None

The last time we accessed the original location of this document, regardless of wheter this led to an update.

orig_url = None

The main url from where we fetched this document.

indexed_ts = None

The last time the metadata was indexed in a triplestore

indexed_dep = None

The last time the dependent files of the document was indexed

indexed_ft = None

The last time the document was indexed in a fulltext index

published = None

The date our parsed/processed version of the document was published.

updated = None

The last time our parsed/processed version changed in any way (due to the original content being updated, or due to changes in our parsing functionality.

title = None

A title/label for the document, as used in an Atom feed.

summary = None

A summary of the document, as used in an Atom feed.

url = None

The URL to the browser-ready version of the page, equivalent to what generated_url() returns.

content = None

A dict that represents metadata about the document file.

A dict that represents metadata about the document RDF metadata (such as it’s URI, length, MIME-type and MD5 hash).

save(path=None)

Saves the state of the documententry to a JSON file at path. If path is not provided, uses the path that the object was initialized with.

set_content(filename, url, mimetype=None, inline=False)

Sets the content property and calculates md5 hash for the file

Parameters:
  • filename – The full path to the document file
  • url – The full external URL that will be used to get the same document file
  • mimetype – The MIME-type used in the atom feed. If not provided, guess from file extension.
  • inline – whether to inline the document content in the file or refer to url

Sets the link property and calculate md5 hash for the RDF metadata.

Parameters:
  • filename – The full path to the RDF file for a document
  • url – The full external URL that will be used to get the same RDF file
  • mimetype – The MIME-type used in the atom feed. If not provided, guess from file extension.
calculate_md5(filename)

Given a filename, return the md5 value for the file’s content.

guess_type(filename)

Given a filename, return a MIME-type based on the file extension.