Relative Link Resolution
Many feed elements and attributes are URIs. Universal Feed Parser resolves relative URIs according to the XML:Base specification. We’ll see how that works in a minute, but first let’s talk about which values are treated as URIs.
Which Values Are URIs
These feed elements are treated as URIs, and resolved if they are relative:
In addition, several feed elements may contain HTML or XHTML markup. Certain elements and attributes in HTML can be relative URIs, and Universal Feed Parser will resolve these URIs according to the same rules as the feed elements listed above.
These feed elements may contain HTML or XHTML markup. In Atom feeds, whether these elements are treated as HTML depends on the value of the type attribute. In RSS feeds, these values are always treated as HTML.
When any of these feed elements contains HTML or XHTML markup, the following HTML elements are treated as URIs and are resolved if they are relative:
<a href=”…”>
<applet codebase=”…”>
<area href=”…”>
<audio src=”…”>
<blockquote cite=”…”>
<body background=”…”>
<del cite=”…”>
<form action=”…”>
<frame longdesc=”…”>
<frame src=”…”>
<head profile=”…”>
<iframe longdesc=”…”>
<iframe src=”…”>
<img longdesc=”…”>
<img src=”…”>
<img usemap=”…”>
<input src=”…”>
<input usemap=”…”>
<ins cite=”…”>
<link href=”…”>
<object classid=”…”>
<object codebase=”…”>
<object data=”…”>
<object usemap=”…”>
<q cite=”…”>
<script src=”…”>
<source src=”…”>
<video poster=”…”>
<video src=”…”>
How Relative URIs Are Resolved
Universal Feed Parser resolves relative URIs according to the XML:Base specification. This defines a hierarchical inheritance system, where one element can define the base URI for itself and all of its child elements, using an xml:base attribute. A child element can then override its parent’s base URI by redeclaring xml:base to a different value.
If no xml:base is specified, the feed has a default base URI defined in the Content-Location HTTP header.
If no Content-Location HTTP header is present, the URL used to retrieve the feed itself is the default base URI for all relative links within the feed. If the feed was retrieved via an HTTP redirect (any HTTP 3xx status code), then the final URL of the feed is the default base URI.
For example, an xml:base on the root-level element sets the base URI for all URIs in the feed.
xml:base on the root-level element
>>> import feedparser
>>> d = feedparser.parse("https://feedparser.readthedocs.io/en/latest/examples/base.xml")
>>> d.feed.link
'http://example.org/index.html'
>>> d.feed.generator_detail.href
'http://example.org/generator/'
An xml:base attribute on an <entry> overrides the xml:base on the parent <feed>.
Overriding xml:base on an <entry>
>>> import feedparser
>>> d = feedparser.parse("https://feedparser.readthedocs.io/en/latest/examples/base.xml")
>>> d.entries[0].link
'http://example.org/archives/000001.html'
>>> d.entries[0].author_detail.href
'http://example.org/about/'
An xml:base on <content> overrides the xml:base on the parent <entry>. In addition, whatever the base URI is for the <content> element (whether defined directly on the <content> element, or inherited from the parent element) is used as the base URI for the embedded HTML or XHTML markup within the content.
Relative links within embedded HTML
>>> import feedparser
>>> d = feedparser.parse("https://feedparser.readthedocs.io/en/latest/examples/base.xml")
>>> d.entries[0].content[0].value
'<p id="anchor1"><a href="http://example.org/archives/000001.html#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>'
The xml:base affects other attributes in the element in which it is declared.
xml:base and sibling attributes
>>> import feedparser
>>> d = feedparser.parse("https://feedparser.readthedocs.io/en/latest/examples/base.xml")
>>> d.entries[0].links[1].rel
'service.edit'
>>> d.entries[0].links[1].href
'http://example.com/api/client/37'
If no xml:base is specified on the root-level element, the default base URI is given in the Content-Location HTTP header. This can still be overridden by any child element that declares an xml:base attribute.
Content-Location HTTP header
>>> import feedparser
>>> d = feedparser.parse("https://feedparser.readthedocs.io/en/latest/examples/http_base.xml")
>>> d.feed.link
'http://example.org/index.html'
>>> d.entries[0].link
'http://example.org/archives/000001.html'
Finally, if no root-level xml:base is declared, and no Content-Location HTTP header is present, the URL of the feed itself is the default base URI. Again, this can still be overridden by any element that declares an xml:base attribute.
Feed URL as default base URI
>>> import feedparser
>>> d = feedparser.parse("https://feedparser.readthedocs.io/en/latest/examples/no_base.xml")
>>> d.feed.link
'http://feedparser.org/docs/examples/index.html
>>> d.entries[0].link
'http://example.org/archives/000001.html'
Disabling Relative URIs Resolution
Though not recommended, it is possible to disable Universal Feed Parser's relative
URI resolution by passing resolve_relative_uris=False
to feedparser.parse()
. This disables resolution within HTML content,
but not in other contexts such as entries[i].link.
How to disable relative URI resolution
>>> import feedparser
>>> d = feedparser.parse('https://feedparser.readthedocs.io/en/latest/examples/base.xml')
>>> d.entries[0].content[0].base
'http://example.org/archives/000001.html'
>>> print d.entries[0].content[0].value
<p id="anchor1"><a href="http://example.org/archives/000001.html#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>
>>> feedparser.RESOLVE_RELATIVE_URIS = 0
>>> d2 = feedparser.parse('https://feedparser.readthedocs.io/en/latest/examples/base.xml')
>>> d2.entries[0].content[0].base
'http://example.org/archives/000001.html'
>>> print d2.entries[0].content[0].value
<p id="anchor1"><a href="#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>