The util module

General library of small utility functions.

class ferenda.util.gYearMonth[source]
class ferenda.util.gYear[source]
class ferenda.util.TopCounter(**kwds)[source]

A mapping of well-known prefixes and their corresponding namespaces. Includes dc, dcterms, rdfs, rdf, skos, xsd, foaf, owl, xhv, prov and bibo.


Like os.makedirs(), but doesn’t raise an exception if the directory already exists.


Given a filename (typically one that you wish to create), ensures that the directory the file is in actually exists.

ferenda.util.robust_rename(old, new)[source]

Rename old to new no matter what (if the file exists, it’s removed, if the target dir doesn’t exist, it’s created)


Removes the path no matter what (unlike os.unlink(), does not raise an error if the file does not exist). If the path is a directory, the entire directory is removed.


Returns the name of the opened file held by fp, which can be either a regular file or a BZ2File.

ferenda.util.relurl(url, starturl)[source]

Works like os.path.relpath(), but for urls

>>> relurl("", "") == '../other/index.html'
>>> relurl("", "") == ''
ferenda.util.numcmp(x, y)[source]

Works like cmp in python 2, but compares two strings using a ‘natural sort’ order, ie “10” < “2”. Also handles strings that contains a mixture of numbers and letters, ie “2” < “2 a”.

Return negative if x<y, zero if x==y, positive if x>y.

>>> numcmp("10", "2")
>>> numcmp("2", "2 a")
>>> numcmp("3", "2 a")

Converts a string into a list of alternating string and integers. This makes it possible to sort a list of strings numerically even though they might not be fully convertable to integers

>>> split_numalpha('10 a §') == ['', 10, ' a §']
>>> split_numalpha("squared²") == ["squared²"]
>>> sorted(['2 §', '10 §', '1 §'], key=split_numalpha) == ['1 §', '2 §', '10 §']
ferenda.util.runcmd(cmdline, require_success=False, cwd=None, cmdline_encoding=None, output_encoding='utf-8')[source]

Run a shell command, wait for it to finish and return the results.

  • cmdline (str) – The full command line (will be passed through a shell)
  • require_success (bool) – If the command fails (non-zero exit code), raise ExternalCommandError
  • cwd – The working directory for the process to run

The returncode, all stdout output, all stderr output

Return type:



Normalize all whitespace in string so that only a single space between words is ever used, and that the string neither starts with nor ends with whitespace.

>>> normalize_space(" This is  a long \n string\n") == 'This is a long string'
ferenda.util.list_dirs(d, suffix=None, reverse=False)[source]

A generator that works much like os.listdir(), only recursively (and only returns files, not directories).

  • d (str) – The directory to start in
  • suffix (str or list) – Only return files with the given suffix
  • reverse – Returns result sorted in reverse alphabetic order
  • type

the full path (starting from d) of each matching file

Return type:


ferenda.util.replace_if_different(src, dst, archivefile=None)[source]

Like shutil.move(), except the src file isn’t moved if the dst file already exists and is identical to src. Also doesn’t require that the directory of dst exists beforehand.

Note: regardless of whether it was moved or not, src is always deleted.

  • src (str) – The source file to move
  • dst (str) – The destination file

True if src was copied to dst, False otherwise

Return type:


ferenda.util.copy_if_different(src, dest)[source]

Like shutil.copyfile(), except the src file isn’t copied if the dst file already exists and is identical to src. Also doesn’t require that the directory of dst exists beforehand.

param src:The source file to move
type src:str
param dst:The destination file
type dst:str
returns:True if src was copied to dst, False otherwise
class ferenda.util.OutfileIsNotNewer[source]
ferenda.util.outfile_is_newer(infiles, outfile)[source]

Check if a given outfile is newer than all of the given files in the infiles list.

Newer is defined as having more recent modification time. Returns True if so, a falsey value otherwise (including if outfile doesn’t exist).

If the outfile isn’t never, the value returned will evaluate to False in a bool context, but also contain a reason attribute containing a text description of which infiles file was never than outfile.

Create a symlink at dst pointing back to src on systems that support it. On other systems (i.e. Windows), copy src to dst (using copy_if_different())


Returns string with first character uppercased but otherwise unchanged.

>>> ucfirst("iPhone") == 'IPhone'

Converts a datetime object to a RFC 3339-style date

>>> rfc_3339_timestamp(datetime.datetime(2013, 7, 2, 21, 20, 25)) == '2013-07-02T21:20:25-00:00'

Converts a RFC 822-type date string (more-or-less the same as a HTTP-date) to an UTC-localized (naive) datetime.

>>> parse_rfc822_date("Mon, 4 Aug 1997 02:14:00 EST")
datetime.datetime(1997, 8, 4, 7, 14)
ferenda.util.strptime(datestr, format)[source]

Like datetime.strptime, but guaranteed to not be affected by current system locale – all datetime parsing is done using the C locale.

>>> strptime("Mon, 4 Aug 1997 02:14:05", "%a, %d %b %Y %H:%M:%S")
datetime.datetime(1997, 8, 4, 2, 14, 5)
ferenda.util.readfile(filename, mode='r', encoding='utf-8')[source]

Opens filename, reads it’s contents and returns them as a string.

ferenda.util.writefile(filename, contents, encoding='utf-8')[source]

Create filename and write contents to it.

ferenda.util.extract_text(html, start, end, decode_entities=True, strip_tags=True)[source]

Given html, a string of HTML content, and two substrings (start and end) present in this string, return all text between the substrings, optionally decoding any HTML entities and removing HTML tags.

>>> extract_text("<body><div><b>Hello</b> <i>World</i>&trade;</div></body>",
...              "<div>", "</div>") == 'Hello World™'
>>> extract_text("<body><div><b>Hello</b> <i>World</i>&trade;</div></body>",
...              "<div>", "</div>", decode_entities=False) == 'Hello World&trade;'
>>> extract_text("<body><div><b>Hello</b> <i>World</i>&trade;</div></body>",
...              "<div>", "</div>", strip_tags=False) == '<b>Hello</b> <i>World</i>™'
ferenda.util.merge_dict_recursive(base, other)[source]

Merges the other dict into the base dict. If any value in other is itself a dict and the base also has a dict for the same key, merge these sub-dicts (and so on, recursively).

>>> base = {'a': 1, 'b': {'c': 3}}
>>> other = {'x': 4, 'b': {'y': 5}}
>>> want = {'a': 1, 'x': 4, 'b': {'c': 3, 'y': 5}}
>>> got = merge_dict_recursive(base, other)
>>> got == want
>>> base == want
ferenda.util.resource_extract(resourceloader, name, outfile, params)[source]

Extract a resource from a configured ResourceLoader and perform variable substitutions on the contents of the resource.

  • resourceloader – A ResourceLoader instance
  • name – The named resource (eg ‘sparql/annotations.rq’)
  • outfile – Path to extract the resource to
  • params – A dict of parameters, to be used with regular string subtitutions in the resource file.

Get the “leaf” - fragment id or last segment - of a URI. Useful e.g. for getting a term from a “namespace like” URI.

>>> uri_leaf("") == 'title'
>>> uri_leaf("") == 'Concept'
>>> uri_leaf("") # returns None
ferenda.util.logtime(method, format='The operation took %(elapsed).3f sec', values={})[source]

A context manager that uses the supplied method and format string to log the elapsed time:

with util.logtime(log.debug,
                  "Basefile %(basefile)s took %(elapsed).3f s",

This results in a call like log.debug(“Basefile foo took 1.324 s”).

ferenda.util.switch_locale(newlocale='C', category=2)[source]

Temporarily change process locale to the C locale, for use when eg parsing English dates on a system that may have non-english locale.

>>> with switch_locale():
...     datetime.datetime.strptime("August 2013", "%B %Y")
datetime.datetime(2013, 8, 1, 0, 0)

convert Roman numeral to integer.

>>> from_roman("MCMLXXXIV")
ferenda.util.to_roman(i, lower=False)[source]
ferenda.util.increment(s, amount=1)[source]

increment a number, regardless if it’s a arabic number (int) or a roman number (str).


Transform a document title into a key useful for sorting and partitioning documents.

>>> title_sortkey("The 'viewstate' property") == 'viewstateproperty'
ferenda.util.parseresults_as_xml(parseres, depth=0)[source]

inspect the stack and return he location of the error (and if that’s in stdlib or thirdparty, the ferenda-or-project code line that called into the source)

ferenda.util.robust_fetch(method, url, logger, attempts=5, sleep=1, raise_for_status=True, *args, **kwargs)[source]
ferenda.util.cluster(iterable, maxgap=None, maxgap_ratio=10, remove_outliers=True)[source]