The DocumentStore
class¶
-
class
ferenda.
DocumentStore
(datadir, storage_policy='file', compression=None)[source]¶ Unifies handling of reading and writing of various data files during the
download
,parse
andgenerate
stages.Parameters: - datadir (str) – The root directory (including docrepo path segment) where files are stored.
- storage_policy (str) – Some repositories have documents in several
formats, documents split amongst several
files or embedded resources. If
storage_policy
is set todir
, then each document gets its own directory (the default filename beingindex
+suffix), otherwise each doc gets stored as a file in a directory with other files. Affectspath()
(and therefore all other*_path
methods) - compression (str) – Which compression method to use when storing
files. Can be
None
(no compression),"gz"
,"bz2"
,"xz"
orTrue
(select best compression method, currently xz). NB: This only affectsintermediate_path()
andopen_intermediate()
.
-
downloaded_suffixes
= ['.html']¶
-
intermediate_suffixes
= ['.xml']¶
-
invalid_suffixes
= ['.invalid']¶
-
compression
= None¶
-
path
(basefile, maindir, suffix, version=None, attachment=None, storage_policy=None)[source]¶ Calculate a full filesystem path for the given parameters.
Parameters: - basefile (str) – The basefile of the resource we’re calculating a filename for
- maindir (str) – The stage of processing, e.g.
downloaded
orparsed
- suffix – Appropriate file suffix, e.g.
.txt
or.pdf
- version (str) – Optional. The archived version id
- attachment (str) – Optional. Any associated file needed by the main file.
- storage_policy – Optional. Used to override storage_policy if needed
Note
This is a generic method with many parameters. In order to keep your code tidy and and loosely coupled to the actual storage policy, you should use methods like
downloaded_path()
orparsed_path()
when possible.Example:
>>> d = DocumentStore(datadir="/tmp/base") >>> realsep = os.sep >>> os.sep = "/" >>> d.path('123/a', 'parsed', '.xhtml') == '/tmp/base/parsed/123/a.xhtml' True >>> d.storage_policy = "dir" >>> d.path('123/a', 'parsed', '.xhtml') == '/tmp/base/parsed/123/a/index.xhtml' True >>> d.path('123/a', 'downloaded', None, 'r4711', 'appendix.txt') == '/tmp/base/archive/downloaded/123/a/r4711/appendix.txt' True >>> os.sep = realsep
Parameters: - basefile (str) – The basefile for which to calculate the path
- maindir – The processing stage directory (normally
downloaded
,parsed
, orgenerated
) - suffix (str) – The file extension including period (i.e.
.txt
, nottxt
) - version (str) – Optional, the archived version id
- attachment (str) – Optional. Any associated file needed by the main file. Requires that
storage_policy
is set todir
.suffix
is ignored if this parameter is used.
Returns: The full filesystem path
Return type:
-
open
(basefile, maindir, suffix, mode='r', version=None, attachment=None, compression=None)[source]¶ Context manager that opens files for reading or writing. The parameters are the same as for
path()
, and the note is applicable here as well – useopen_downloaded()
,open_parsed()
et al if possible.Example:
>>> store = DocumentStore(datadir="/tmp/base") >>> with store.open('123/a', 'parsed', '.xhtml', mode="w") as fp: ... res = fp.write("hello world") >>> os.path.exists("/tmp/base/parsed/123/a.xhtml") True
-
needed
(basefile, action)[source]¶ Determine if we really need to perform action for the given basefile, or if the result of the action (in the form of the file that the action creates, or similar) is newer than all of the actions dependencies (in the form of source files for the action).
-
list_basefiles_for
(action, basedir=None, force=True)[source]¶ Get all available basefiles that can be used for the specified action.
Parameters: Returns: All available basefiles
Return type: generator
-
list_versions
(basefile, action=None)[source]¶ Get all archived versions of a given basefile.
Parameters: Returns: All available versions for that basefile
Return type: generator
-
list_attachments
(basefile, action, version=None)[source]¶ Get all attachments for a basefile in a specified state
Parameters: Returns: All available attachments for the basefile
Return type: generator
-
basefile_to_pathfrag
(basefile)[source]¶ Given a basefile, returns a string that can safely be used as a fragment of the path for any representation of that file. The default implementation recognizes a number of characters that are unsafe to use in file names and replaces them with HTTP percent-style encoding.
Example:
>>> d = DocumentStore("/tmp") >>> realsep = os.sep >>> os.sep = "/" >>> d.basefile_to_pathfrag('1998:204') == '1998/%3A204' True >>> os.sep = realsep
If you wish to override how document files are stored in directories, you can override this method, but you should make sure to also override
pathfrag_to_basefile()
to work as the inverse of this method.Parameters: basefile (str) – The basefile to encode Returns: The encoded path fragment Return type: str
-
pathfrag_to_basefile
(pathfrag)[source]¶ Does the inverse of
basefile_to_pathfrag()
, that is, converts a fragment of a file path into the corresponding basefile.Parameters: pathfrag (str) – The path fragment to decode Returns: The resulting basefile Return type: str
-
archive
(basefile, version)[source]¶ Moves the current version of a document to an archive. All files related to the document are moved (downloaded, parsed, generated files and any existing attachment files).
Parameters:
-
downloaded_path
(basefile, version=None, attachment=None)[source]¶ Get the full path for the downloaded file for the given basefile (and optionally archived version and/or attachment filename).
Parameters: Returns: The full filesystem path
Return type:
-
open_downloaded
(basefile, mode='r', version=None, attachment=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as fordownloaded_path()
.
-
documententry_path
(basefile, version=None)[source]¶ Get the full path for the documententry JSON file for the given basefile (and optionally archived version).
Parameters: Returns: The full filesystem path
Return type:
-
intermediate_path
(basefile, version=None, attachment=None, suffix=None)[source]¶ Get the full path for the main intermediate file for the given basefile (and optionally archived version).
Parameters: Returns: The full filesystem path
Return type:
-
open_intermediate
(basefile, mode='r', version=None, attachment=None, suffix=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as forintermediate_path()
.
-
parsed_path
(basefile, version=None, attachment=None)[source]¶ Get the full path for the parsed XHTML file for the given basefile.
Parameters: Returns: The full filesystem path
Return type:
-
open_parsed
(basefile, mode='r', version=None, attachment=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as forparsed_path()
.
-
serialized_path
(basefile, version=None, attachment=None)[source]¶ Get the full path for the serialized JSON file for the given basefile.
Parameters: Returns: The full filesystem path
Return type:
-
open_serialized
(basefile, mode='r', version=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as forserialized_path()
.
-
distilled_path
(basefile, version=None)[source]¶ Get the full path for the distilled RDF/XML file for the given basefile.
Parameters: Returns: The full filesystem path
Return type:
-
open_distilled
(basefile, mode='r', version=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as fordistilled_path()
.
-
generated_path
(basefile, version=None, attachment=None)[source]¶ Get the full path for the generated file for the given basefile (and optionally archived version and/or attachment filename).
Parameters: Returns: The full filesystem path
Return type:
-
open_generated
(basefile, mode='r', version=None, attachment=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as forgenerated_path()
.
-
annotation_path
(basefile, version=None)[source]¶ Get the full path for the annotation file for the given basefile (and optionally archived version).
Parameters: Returns: The full filesystem path
Return type:
-
open_annotation
(basefile, mode='r', version=None)[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as forannotation_path()
.
-
dependencies_path
(basefile)[source]¶ Get the full path for the dependency file for the given basefile
Parameters: basefile (str) – The basefile for which to calculate the path Returns: The full filesystem path Return type: str
-
open_dependencies
(basefile, mode='r')[source]¶ Opens files for reading and writing, c.f.
open()
. The parameters are the same as fordependencies_path()
.
-
atom_path
(basefile)[source]¶ Get the full path for the atom file for the given basefile
Note
This is used by
ferenda.DocumentRepository.news()
and does not really operate on “real” basefiles. It might be removed. You probably shouldn’t use it unless you overridenews()
Parameters: basefile (str) – The basefile for which to calculate the path Returns: The full filesystem path Return type: str