Citation parsing¶
In many cases, the text in the body of a document contains references
(citations) to other documents in the same or related document
collections. A good implementation of a document repository needs to
find and express these references. In ferenda, references are
expressed as basic hyperlinks which uses the rel
attribute to
specify the sort of relationship that the reference expresses. The
process of citation parsing consists of analysing the raw text,
finding references within that text, constructing sensible URIs for
each reference, and formatting these as <a href="..."
rel="...">[citation]</a>
style links.
Since standards for expressing references / citations are very diverse, Ferenda requires that the docrepo programmer specifies the basic rules of how to recognize a reference, and how to put together the properties from a reference (such as year of publication, or page) into a URI.
The built-in solution¶
Ferenda uses the Pyparsing library in order to find and process citations. As an example, we’ll specify citation patterns and URI formats for references that occurr in RFC documents. These are primarily of three different kinds (examples come from RFC 2616):
- URL references, eg “GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.1”
- IETF document references, eg “STD 3”, “BCP 14” and “RFC 2068”
- Internal endnote references, eg “[47]” and “[33]”
We’d like to make sure that any URL reference gets turned into a link
to that same URL, that any IETF document reference gets turned into
the canonical URI for that document, and that internal endote
references gets turned into document-relative links, eg “#endnote-47”
and “#endnote-33”. (This requires that other parts of the
parse()
process has created IDs for
these in doc.body
, which we assume has been done).
Turning URL references in plain text into real links is so common that ferenda has built-in support for this. The support comes in two parts: First running a parser that detects URLs in the textual content, and secondly, for each match, running a URL formatter on the parse result.
At the end of your parse()
method,
do the following.
from ferenda import CitationParser
from ferenda import URIFormatter
import ferenda.citationpatterns
import ferenda.uriformats
# CitationParser is initialized with a list of pyparsing
# ParserElements (or any other object that has a scanString method
# that returns a generator of (tokens,start,end) tuples, where start
# and end are integer string indicies and tokens are dict-like
# objects)
citparser = CitationParser(ferenda.citationpatterns.url)
# URIFormatter is initialized with a list of tuples, where each
# tuple is a string (identifying a named ParseResult) and a function
# (that takes as a single argument a dict-like object and returns a
# URI string (possibly relative)
citparser.set_formatter(URIFormatter(("URLRef", ferenda.uriformats.url)))
citparser.parse_recursive(doc.body)
The parse_recursive()
takes any
elements
document tree and modifies it in-place to
mark up any references to proper Link
objects.
Extending the built-in support¶
Building your own citation patterns and URI formats is fairly
simple. First, specify your patterns in the form of a pyparsing
parseExpression, and make sure that both the expression as a whole,
and any individual significant properties, are named by calling
.setResultName
.
Then, create a set of formatting functions that takes the named properties from the parse expressions above and use them to create a URI.
Finally, initialize a CitationParser
object from
your parse expressions and a URIFormatter
object
that maps named parse expressions to their corresponding URI
formatting function, and call
parse_recursive()
from pyparsing import Word, nums
from ferenda import CitationParser
from ferenda import URIFormatter
import ferenda.citationpatterns
import ferenda.uriformats
# Create two ParserElements for IETF document references and internal
# references
rfc_citation = "RFC" + Word(nums).setResultsName("RFCRef")
bcp_citation = "BCP" + Word(nums).setResultsName("BCPRef")
std_citation = "STD" + Word(nums).setResultsName("STDRef")
ietf_doc_citation = (rfc_citation | bcp_citation | std_citation).setResultsName("IETFRef")
endnote_citation = ("[" + Word(nums).setResultsName("EndnoteID") + "]").setResultsName("EndnoteRef")
# Create a URI formatter for IETF documents (URI formatter for endnotes
# is so simple that we just use a lambda function below
def rfc_uri_formatter(parts):
# parts is a dict-like object created from the named result parts
# of our grammar, eg those ParserElement for which we've called
# .setResultsName(), in this case eg. {'RFCRef':'2068'}
# NOTE: If your document collection contains documents of this
# type and you're republishing them, feel free to change these
# URIs to URIs under your control,
# eg. "http://mynetstandards.org/rfc/%(RFCRef)s/" and so on
if 'RFCRef' in parts:
return "http://www.ietf.org/rfc/rfc%(RFCRef)s.txt" % parts
elif 'BCPRef' in parts:
return "http://tools.ietf.org/rfc/bcp/bcp%(BCPRef)s.txt" % parts
elif 'STDRef' in parts:
return "http://rfc-editor.org/std/std%(STDRef)s.txt" % parts
else:
return None
# CitationParser is initialized with a list of pyparsing
# ParserElements (or any other object that has a scanString method
# that returns a generator of (tokens,start,end) tuples, where start
# and end are integer string indicies and tokens are dict-like
# objects)
citparser = CitationParser(ferenda.citationpatterns.url,
ietf_doc_citation,
endnote_citation)
# URIFormatter is initialized with a list of tuples, where each
# tuple is a string (identifying a named ParseResult) and a function
# (that takes as a single argument a dict-like object and returns a
# URI string (possibly relative)
citparser.set_formatter(URIFormatter(("url", ferenda.uriformats.url),
("IETFRef", rfc_uri_formatter),
("EndnoteRef", lambda d: "#endnote-%(EndnoteID)s" % d)))
citparser.parse_recursive(doc.body)
This turns this document
<body xmlns="http://www.w3.org/1999/xhtml" about="http://example.org/doc/">
<h1>Main document</h1>
<p>A naked URL: http://www.w3.org/pub/WWW/TheProject.html.</p>
<p>Some IETF document references: See STD 3, BCP 14 and RFC 2068.</p>
<p>An internal endnote reference: ...relevance ranking, cf. [47]</p>
<h2>References</h2>
<p id="endnote-47">47: Malmgren, Towards a theory of jurisprudential
ranking</p>
</body>
Into this document:
<body xmlns="http://www.w3.org/1999/xhtml" about="http://example.org/doc/">
<h1>Main document</h1>
<p>
A naked URL: <a href="http://www.w3.org/pub/WWW/TheProject.html"
rel="dcterms:references"
>http://www.w3.org/pub/WWW/TheProject.html</a>.
</p>
<p>
Some IETF document references: See
<a href="http://rfc-editor.org/std/std3.txt"
rel="dcterms:references">STD 3</a>,
<a href="http://tools.ietf.org/rfc/bcp/bcp14.txt"
rel="dcterms:references">BCP 14</a> and
<a href="http://www.ietf.org/rfc/rfc2068.txt"
rel="dcterms:references">RFC 2068</a>.
</p>
<p>
An internal endnote reference: ...relevance ranking, cf.
<a href="#endnote-47"
rel="dcterms:references">[47]</a>
</p>
<h2>References</h2>
<p id="endnote-47">47: Malmgren, Towards a theory of jurisprudential
ranking</p>
</body>
Rolling your own¶
For more complicated situations you can skip calling
parse_recursive()
and instead do your
own processing with the optional support of
CitationParser
.
This is needed in particular for complicated ParserElement
objects
which may contain several sub-ParserElement
which needs to be
turned into individual links. As an example, the text “under Article
56 (2), Article 57 or Article 100a of the Treaty establishing the
European Community” may be matched by a single top-level ParseResult
(and probably must be, if “Article 56 (2)” is to actually reference
article 56(2) in the Treaty), but should be turned in to three
separate links.
In those cases, iterate through your doc.body
yourself, and for each
text part do something like the following:
from ferenda import CitationParser, URIFormatter, citationpatterns, uriformats
from ferenda.elements import Link
citparser = CitationParser()
citparser.add_grammar(citationpatterns.url)
formatter = URIFormatter(("url", uriformats.url))
res = []
text = "An example: http://example.org/. That is all."
for node in citparser.parse_string(text):
if isinstance(node,str):
# non-linked text, add and continue
res.append(node)
if isinstance(node, tuple):
(text, match) = node
uri = formatter.format(match)
if uri:
res.append(Link(uri, text, rel="dcterms:references"))