The DocumentStore class

class ferenda.DocumentStore(datadir, storage_policy='file', compression=None)[source]

Unifies handling of reading and writing of various data files during the download, parse and generate stages.

Parameters:
  • datadir (str) – The root directory (including docrepo path segment) where files are stored.
  • storage_policy (str) – Some repositories have documents in several formats, documents split amongst several files or embedded resources. If storage_policy is set to dir, then each document gets its own directory (the default filename being index +suffix), otherwise each doc gets stored as a file in a directory with other files. Affects path() (and therefore all other *_path methods)
  • compression (str) – Which compression method to use when storing files. Can be None (no compression), "gz", "bz2", "xz" or True (select best compression method, currently xz). NB: This only affects intermediate_path() and open_intermediate().
downloaded_suffixes = ['.html']
intermediate_suffixes = ['.xml']
invalid_suffixes = ['.invalid']
compression = None
resourcepath(resourcename)[source]
open_resource(resourcename, mode='r')[source]
path(basefile, maindir, suffix, version=None, attachment=None, storage_policy=None)[source]

Calculate a full filesystem path for the given parameters.

Parameters:
  • basefile (str) – The basefile of the resource we’re calculating a filename for
  • maindir (str) – The stage of processing, e.g. downloaded or parsed
  • suffix – Appropriate file suffix, e.g. .txt or .pdf
  • version (str) – Optional. The archived version id
  • attachment (str) – Optional. Any associated file needed by the main file.
  • storage_policy – Optional. Used to override storage_policy if needed

Note

This is a generic method with many parameters. In order to keep your code tidy and and loosely coupled to the actual storage policy, you should use methods like downloaded_path() or parsed_path() when possible.

Example:

>>> d = DocumentStore(datadir="/tmp/base")
>>> realsep = os.sep
>>> os.sep = "/"
>>> d.path('123/a', 'parsed', '.xhtml') == '/tmp/base/parsed/123/a.xhtml'
True
>>> d.storage_policy = "dir"
>>> d.path('123/a', 'parsed', '.xhtml') == '/tmp/base/parsed/123/a/index.xhtml'
True
>>> d.path('123/a', 'downloaded', None, 'r4711', 'appendix.txt') == '/tmp/base/archive/downloaded/123/a/r4711/appendix.txt'
True
>>> os.sep = realsep
Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • maindir – The processing stage directory (normally downloaded, parsed, or generated)
  • suffix (str) – The file extension including period (i.e. .txt, not txt)
  • version (str) – Optional, the archived version id
  • attachment (str) – Optional. Any associated file needed by the main file. Requires that storage_policy is set to dir. suffix is ignored if this parameter is used.
Returns:

The full filesystem path

Return type:

str

open(basefile, maindir, suffix, mode='r', version=None, attachment=None, compression=None)[source]

Context manager that opens files for reading or writing. The parameters are the same as for path(), and the note is applicable here as well – use open_downloaded(), open_parsed() et al if possible.

Example:

>>> store = DocumentStore(datadir="/tmp/base")
>>> with store.open('123/a', 'parsed', '.xhtml', mode="w") as fp:
...     res = fp.write("hello world")
>>> os.path.exists("/tmp/base/parsed/123/a.xhtml")
True
needed(basefile, action)[source]

Determine if we really need to perform action for the given basefile, or if the result of the action (in the form of the file that the action creates, or similar) is newer than all of the actions dependencies (in the form of source files for the action).

list_basefiles_for(action, basedir=None, force=True)[source]

Get all available basefiles that can be used for the specified action.

Parameters:
  • action (str) – The action for which to get available basefiles (parse, relate, generate or news)
  • basedir (str) – The base directory in which to search for available files. If not provided, defaults to self.datadir.
Returns:

All available basefiles

Return type:

generator

list_versions(basefile, action=None)[source]

Get all archived versions of a given basefile.

Parameters:
  • basefile (str) – The basefile to list archived versions for
  • action (str) – The type of file to look for (either downloaded, parsed or generated. If None, look for all types.
Returns:

All available versions for that basefile

Return type:

generator

list_attachments(basefile, action, version=None)[source]

Get all attachments for a basefile in a specified state

Parameters:
  • action (str) – The state (type of file) to look for (either downloaded, parsed or generated. If None, look for all types.
  • basefile (str) – The basefile to list attachments for
  • version (str) – The version of the basefile to list attachments for. If None, list attachments for the current version.
Returns:

All available attachments for the basefile

Return type:

generator

basefile_to_pathfrag(basefile)[source]

Given a basefile, returns a string that can safely be used as a fragment of the path for any representation of that file. The default implementation recognizes a number of characters that are unsafe to use in file names and replaces them with HTTP percent-style encoding.

Example:

>>> d = DocumentStore("/tmp")
>>> realsep = os.sep
>>> os.sep = "/"
>>> d.basefile_to_pathfrag('1998:204') == '1998/%3A204'
True
>>> os.sep = realsep

If you wish to override how document files are stored in directories, you can override this method, but you should make sure to also override pathfrag_to_basefile() to work as the inverse of this method.

Parameters:basefile (str) – The basefile to encode
Returns:The encoded path fragment
Return type:str
pathfrag_to_basefile(pathfrag)[source]

Does the inverse of basefile_to_pathfrag(), that is, converts a fragment of a file path into the corresponding basefile.

Parameters:pathfrag (str) – The path fragment to decode
Returns:The resulting basefile
Return type:str
archive(basefile, version)[source]

Moves the current version of a document to an archive. All files related to the document are moved (downloaded, parsed, generated files and any existing attachment files).

Parameters:
  • basefile (str) – The basefile of the document to archive
  • version (str) – The version id to archive under
downloaded_path(basefile, version=None, attachment=None)[source]

Get the full path for the downloaded file for the given basefile (and optionally archived version and/or attachment filename).

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
  • attachment (str) – Optional. Any associated file needed by the main file.
Returns:

The full filesystem path

Return type:

str

open_downloaded(basefile, mode='r', version=None, attachment=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for downloaded_path().

documententry_path(basefile, version=None)[source]

Get the full path for the documententry JSON file for the given basefile (and optionally archived version).

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
Returns:

The full filesystem path

Return type:

str

intermediate_path(basefile, version=None, attachment=None, suffix=None)[source]

Get the full path for the main intermediate file for the given basefile (and optionally archived version).

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
  • attachment – Optional. Any associated file created or retained in the intermediate step
Returns:

The full filesystem path

Return type:

str

open_intermediate(basefile, mode='r', version=None, attachment=None, suffix=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for intermediate_path().

parsed_path(basefile, version=None, attachment=None)[source]

Get the full path for the parsed XHTML file for the given basefile.

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
  • attachment (str) – Optional. Any associated file needed by the main file (created by parse())
Returns:

The full filesystem path

Return type:

str

open_parsed(basefile, mode='r', version=None, attachment=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for parsed_path().

serialized_path(basefile, version=None, attachment=None)[source]

Get the full path for the serialized JSON file for the given basefile.

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
Returns:

The full filesystem path

Return type:

str

open_serialized(basefile, mode='r', version=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for serialized_path().

distilled_path(basefile, version=None)[source]

Get the full path for the distilled RDF/XML file for the given basefile.

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
Returns:

The full filesystem path

Return type:

str

open_distilled(basefile, mode='r', version=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for distilled_path().

generated_path(basefile, version=None, attachment=None)[source]

Get the full path for the generated file for the given basefile (and optionally archived version and/or attachment filename).

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
  • attachment (str) – Optional. Any associated file needed by the main file.
Returns:

The full filesystem path

Return type:

str

open_generated(basefile, mode='r', version=None, attachment=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for generated_path().

annotation_path(basefile, version=None)[source]

Get the full path for the annotation file for the given basefile (and optionally archived version).

Parameters:
  • basefile (str) – The basefile for which to calculate the path
  • version (str) – Optional. The archived version id
Returns:

The full filesystem path

Return type:

str

open_annotation(basefile, mode='r', version=None)[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for annotation_path().

dependencies_path(basefile)[source]

Get the full path for the dependency file for the given basefile

Parameters:basefile (str) – The basefile for which to calculate the path
Returns:The full filesystem path
Return type:str
open_dependencies(basefile, mode='r')[source]

Opens files for reading and writing, c.f. open(). The parameters are the same as for dependencies_path().

atom_path(basefile)[source]

Get the full path for the atom file for the given basefile

Note

This is used by ferenda.DocumentRepository.news() and does not really operate on “real” basefiles. It might be removed. You probably shouldn’t use it unless you override news()

Parameters:basefile (str) – The basefile for which to calculate the path
Returns:The full filesystem path
Return type:str