Citation parsing

In many cases, the text in the body of a document contains references (citations) to other documents in the same or related document collections. A good implementation of a document repository needs to find and express these references. In ferenda, references are expressed as basic hyperlinks which uses the rel attribute to specify the sort of relationship that the reference expresses. The process of citation parsing consists of analysing the raw text, finding references within that text, constructing sensible URIs for each reference, and formatting these as <a href="..." rel="...">[citation]</a> style links.

Since standards for expressing references / citations are very diverse, Ferenda requires that the docrepo programmer specifies the basic rules of how to recognize a reference, and how to put together the properties from a reference (such as year of publication, or page) into a URI.

The built-in solution

Ferenda uses the Pyparsing library in order to find and process citations. As an example, we’ll specify citation patterns and URI formats for references that occurr in RFC documents. These are primarily of three different kinds (examples come from RFC 2616):

  1. URL references, eg “GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.1”
  2. IETF document references, eg “STD 3”, “BCP 14” and “RFC 2068”
  3. Internal endnote references, eg “[47]” and “[33]”

We’d like to make sure that any URL reference gets turned into a link to that same URL, that any IETF document reference gets turned into the canonical URI for that document, and that internal endote references gets turned into document-relative links, eg “#endnote-47” and “#endnote-33”. (This requires that other parts of the parse() process has created IDs for these in doc.body, which we assume has been done).

Turning URL references in plain text into real links is so common that ferenda has built-in support for this. The support comes in two parts: First running a parser that detects URLs in the textual content, and secondly, for each match, running a URL formatter on the parse result.

At the end of your parse() method, do the following.

from ferenda import CitationParser
from ferenda import URIFormatter
import ferenda.citationpatterns
import ferenda.uriformats

# CitationParser is initialized with a list of pyparsing
# ParserElements (or any other object that has a scanString method
# that returns a generator of (tokens,start,end) tuples, where start
# and end are integer string indicies and tokens are dict-like
# objects)
citparser = CitationParser(ferenda.citationpatterns.url)

# URIFormatter is initialized with a list of tuples, where each
# tuple is a string (identifying a named ParseResult) and a function
# (that takes as a single argument a dict-like object and returns a
# URI string (possibly relative)
citparser.set_formatter(URIFormatter(("URLRef", ferenda.uriformats.url)))

citparser.parse_recursive(doc.body)

The parse_recursive() takes any elements document tree and modifies it in-place to mark up any references to proper Link objects.

Extending the built-in support

Building your own citation patterns and URI formats is fairly simple. First, specify your patterns in the form of a pyparsing parseExpression, and make sure that both the expression as a whole, and any individual significant properties, are named by calling .setResultName.

Then, create a set of formatting functions that takes the named properties from the parse expressions above and use them to create a URI.

Finally, initialize a CitationParser object from your parse expressions and a URIFormatter object that maps named parse expressions to their corresponding URI formatting function, and call parse_recursive()

from pyparsing import Word, nums

from ferenda import CitationParser
from ferenda import URIFormatter
import ferenda.citationpatterns
import ferenda.uriformats

# Create two ParserElements for IETF document references and internal
# references
rfc_citation = "RFC" + Word(nums).setResultsName("RFCRef")
bcp_citation = "BCP" + Word(nums).setResultsName("BCPRef")
std_citation = "STD" + Word(nums).setResultsName("STDRef")
ietf_doc_citation = (rfc_citation | bcp_citation | std_citation).setResultsName("IETFRef")

endnote_citation = ("[" + Word(nums).setResultsName("EndnoteID") + "]").setResultsName("EndnoteRef")

# Create a URI formatter for IETF documents (URI formatter for endnotes
# is so simple that we just use a lambda function below
def rfc_uri_formatter(parts):
    # parts is a dict-like object created from the named result parts
    # of our grammar, eg those ParserElement for which we've called
    # .setResultsName(), in this case eg. {'RFCRef':'2068'}

    # NOTE: If your document collection contains documents of this
    # type and you're republishing them, feel free to change these
    # URIs to URIs under your control,
    # eg. "http://mynetstandards.org/rfc/%(RFCRef)s/" and so on
    if 'RFCRef' in parts:
          return "http://www.ietf.org/rfc/rfc%(RFCRef)s.txt" % parts
    elif 'BCPRef' in parts:
          return "http://tools.ietf.org/rfc/bcp/bcp%(BCPRef)s.txt" % parts
    elif 'STDRef' in parts:
          return "http://rfc-editor.org/std/std%(STDRef)s.txt" % parts
    else:
          return None

# CitationParser is initialized with a list of pyparsing
# ParserElements (or any other object that has a scanString method
# that returns a generator of (tokens,start,end) tuples, where start
# and end are integer string indicies and tokens are dict-like
# objects)
citparser = CitationParser(ferenda.citationpatterns.url,
                           ietf_doc_citation,
                           endnote_citation)

# URIFormatter is initialized with a list of tuples, where each
# tuple is a string (identifying a named ParseResult) and a function
# (that takes as a single argument a dict-like object and returns a
# URI string (possibly relative)
citparser.set_formatter(URIFormatter(("url", ferenda.uriformats.url),
                                      ("IETFRef", rfc_uri_formatter),
                                      ("EndnoteRef", lambda d: "#endnote-%(EndnoteID)s" % d)))

citparser.parse_recursive(doc.body)

This turns this document

<body xmlns="http://www.w3.org/1999/xhtml" about="http://example.org/doc/">
  <h1>Main document</h1>
  <p>A naked URL: http://www.w3.org/pub/WWW/TheProject.html.</p>
  <p>Some IETF document references: See STD 3, BCP 14 and RFC 2068.</p>
  <p>An internal endnote reference: ...relevance ranking, cf. [47]</p>
  <h2>References</h2>
  <p id="endnote-47">47: Malmgren, Towards a theory of jurisprudential
    ranking</p>
</body>

Into this document:

<body xmlns="http://www.w3.org/1999/xhtml" about="http://example.org/doc/">
  <h1>Main document</h1>
  <p>
    A naked URL: <a href="http://www.w3.org/pub/WWW/TheProject.html"
		    rel="dcterms:references"
		    >http://www.w3.org/pub/WWW/TheProject.html</a>.
  </p>
  <p>
    Some IETF document references: See 
    <a href="http://rfc-editor.org/std/std3.txt"
       rel="dcterms:references">STD 3</a>,
    <a href="http://tools.ietf.org/rfc/bcp/bcp14.txt"
       rel="dcterms:references">BCP 14</a> and
    <a href="http://www.ietf.org/rfc/rfc2068.txt"
       rel="dcterms:references">RFC 2068</a>.
     </p>
  <p>
    An internal endnote reference: ...relevance ranking, cf.
    <a href="#endnote-47"
       rel="dcterms:references">[47]</a>
  </p>
  <h2>References</h2>
  <p id="endnote-47">47: Malmgren, Towards a theory of jurisprudential
    ranking</p>
</body>

Rolling your own

For more complicated situations you can skip calling parse_recursive() and instead do your own processing with the optional support of CitationParser.

This is needed in particular for complicated ParserElement objects which may contain several sub-ParserElement which needs to be turned into individual links. As an example, the text “under Article 56 (2), Article 57 or Article 100a of the Treaty establishing the European Community” may be matched by a single top-level ParseResult (and probably must be, if “Article 56 (2)” is to actually reference article 56(2) in the Treaty), but should be turned in to three separate links.

In those cases, iterate through your doc.body yourself, and for each text part do something like the following:

from ferenda import CitationParser, URIFormatter, citationpatterns, uriformats
from ferenda.elements import Link

citparser = CitationParser()
citparser.add_grammar(citationpatterns.url)
formatter = URIFormatter(("url", uriformats.url))

res = []
text = "An example: http://example.org/. That is all."

for node in citparser.parse_string(text):
    if isinstance(node,str):
        # non-linked text, add and continue
        res.append(node)
    if isinstance(node, tuple):
        (text, match) = node
        uri = formatter.format(match)
        if uri:
            res.append(Link(uri, text, rel="dcterms:references"))