Character Encoding Detection
RFC 3023 defines the interaction between XML and HTTP as it relates to character encoding. XML and HTTP have different ways of specifying character encoding and different defaults in case no encoding is specified, and determining which value takes precedence depends on a variety of factors.
Introduction to Character Encoding
In XML, the character encoding is optional and may be given in the XML declaration in the first line of the document, like this:
<?xml version="1.0" encoding="utf-8"?>
If no encoding is given, XML supports the use of a Byte Order Mark to identify the document as some flavor of UTF-32, UTF-16, or UTF-8. Section F of the XML specification outlines the process for determining the character encoding based on unique properties of the Byte Order Mark in the first two to four bytes of the document.
If no encoding is specified and no Byte Order Mark is present, XML defaults to UTF-8.
HTTP uses MIME to define a method of specifying the character encoding, as part of the Content-Type HTTP header, which looks like this:
Content-Type: text/html; charset="utf-8"
If no charset is specified, HTTP defaults to iso-8859-1, but only for text/* media types. For other media types, the default encoding is undefined, which is where RFC 3023 comes in.
According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is
the encoding given in the
charset
parameter of the Content-Type HTTP header, orthe encoding given in the encoding attribute of the XML declaration within the document, or
utf-8.
On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is
the encoding given in the charset parameter of the Content-Type HTTP header, or
us-ascii.
Handling Incorrectly-Declared Encodings
Universal Feed Parser initially uses the rules specified in
RFC 3023 to determine the character encoding of
the feed. If parsing succeeds, then that’s that. If parsing fails,
Universal Feed Parser sets the bozo
bit to 1
and sets
bozo_exception
to feedparser.CharacterEncodingOverride
. Then it tries
to reparse the feed with the following character encodings:
the encoding specified in the XML declaration
the encoding sniffed from the first four bytes of the document (as per Section F)
the encoding auto-detected by the chardet, if installed
utf-8
windows-1252
If the character encoding can not be determined, Universal Feed Parser
sets the bozo
bit to 1
and sets bozo_exception
to
feedparser.CharacterEncodingUnknown
. In this case, parsed values will be
strings, not Unicode strings.
Handling Incorrectly-Declared Media Types
RFC 3023 only applies when the feed is served over HTTP with a Content-Type that declares the feed to be some kind of XML. However, some web servers are severely misconfigured and serve feeds with a Content-Type of text/plain, application/octet-stream, or some completely bogus media type.
Universal Feed Parser will attempt to parse such feeds, but it will
set the bozo
bit to 1
and set bozo_exception
to
feedparser.NonXMLContentType
.