Sanitization
Most feeds embed HTML markup within feed
elements. Some feeds even embed other types of markup, such as SVG or MathML.
Since many feed aggregators use a web browser (or browser component) to display
content, Universal Feed Parser sanitizes embedded markup to remove
things that could pose security risks.
These elements are sanitized by default:
Note
If the content is declared to be (or is determined to be)
text/plain, it will not be sanitized. This is to avoid data loss.
It is recommended that you check the content type in e.g.
entries[i].summary_detail.type
. If it is text/plain then
it has not been sanitized (and you should perform HTML escaping before
rendering the content).
HTML Sanitization
The following HTML elements are allowed by
default (all others are stripped):
a
abbr
acronym
address
area
article
aside
audio
b
big
blockquote
br
button
canvas
caption
center
cite
code
col
colgroup
command
datagrid
datalist
dd
del
details
dfn
dialog
dir
div
dl
dt
em
|
event-source
fieldset
figure
font
footer
form
h1
h2
h3
h4
h5
h6
header
hr
i
img
input
ins
kbd
keygen
label
legend
li
m
map
menu
meter
multicol
nav
nextid
noscript
ol
optgroup
|
option
output
p
pre
progress
q
s
samp
section
select
small
sound
source
spacer
span
strike
strong
sub
sup
table
tbody
td
textarea
tfoot
th
thead
time
tr
tt
u
ul
var
video
|
The following HTML attributes are allowed
by default (all others are stripped):
abbr
accept
accept-charset
accesskey
action
align
alt
autocomplete
autofocus
autoplay
axis
background
balance
bgcolor
bgproperties
border
bordercolor
bordercolordark
bordercolorlight
bottompadding
cellpadding
cellspacing
ch
challenge
char
charoff
charset
checked
choff
cite
class
clear
color
cols
colspan
compact
contenteditable
coords
data
datafld
datapagesize
datasrc
datetime
default
delay
dir
disabled
|
draggable
dynsrc
enctype
end
face
for
form
frame
galleryimg
gutter
headers
height
hidden
hidefocus
high
href
hreflang
hspace
icon
id
inputmode
ismap
keytype
label
lang
leftspacing
list
longdesc
loop
loopcount
loopend
loopstart
low
lowsrc
max
maxlength
media
method
min
multiple
name
nohref
noshade
nowrap
open
optimum
pattern
|
ping
point-size
poster
pqg
preload
prompt
radiogroup
readonly
rel
repeat-max
repeat-min
replace
required
rev
rightspacing
rows
rowspan
rules
scope
selected
shape
size
span
src
start
step
summary
suppress
tabindex
target
template
title
toppadding
type
unselectable
urn
usemap
valign
value
variable
volume
vrml
vspace
width
wrap
xml:lang
|
SVG Sanitization
The following SVG elements are allowed by default (all others are stripped):
a
animate
animateColor
animateMotion
animateTransform
circle
defs
desc
ellipse
font-face
font-face-name
font-face-src
|
foreignObject
g
glyph
hkern
line
linearGradient
marker
metadata
missing-glyph
mpath
path
polygon
|
polyline
radialGradient
rect
set
stop
svg
switch
text
title
tspan
use
|
The following SVG attributes are allowed by
default (all others are stripped):
accent-height
accumulate
additive
alphabetic
arabic-form
ascent
attributeName
attributeType
baseProfile
bbox
begin
by
calcMode
cap-height
class
color
color-rendering
content
cx
cy
d
descent
display
dur
dx
dy
end
fill
fill-opacity
fill-rule
font-family
font-size
font-stretch
font-style
font-variant
font-weight
from
fx
fy
g1
g2
glyph-name
gradientUnits
hanging
height
horiz-adv-x
horiz-origin-x
|
id
ideographic
k
keyPoints
keySplines
keyTimes
lang
marker-end
marker-mid
marker-start
markerHeight
markerUnits
markerWidth
mathematical
max
min
name
offset
opacity
orient
origin
overline-position
overline-thickness
panose-1
path
pathLength
points
preserveAspectRatio
r
refX
refY
repeatCount
repeatDur
requiredExtensions
requiredFeatures
restart
rotate
rx
ry
slope
stemh
stemv
stop-color
stop-opacity
strikethrough-position
strikethrough-thickness
stroke
|
stroke-dasharray
stroke-dashoffset
stroke-linecap
stroke-linejoin
stroke-miterlimit
stroke-opacity
stroke-width
systemLanguage
target
text-anchor
to
transform
type
u1
u2
underline-position
underline-thickness
unicode
unicode-range
units-per-em
values
version
viewBox
visibility
width
widths
x
x-height
x1
x2
xlink:actuate
xlink:arcrole
xlink:href
xlink:role
xlink:show
xlink:title
xlink:type
xml:base
xml:lang
xml:space
xmlns
xmlns:xlink
y
y1
y2
zoomAndPan
|
MathML Sanitization
The following MathML elements are
allowed by default (all others are stripped):
annotation
annotation-xml
maction
maligngroup
malignmark
math
menclose
merror
mfenced
mfrac
mglyph
mi
mlabeledtr
mlongdiv
mmultiscripts
|
mn
mo
mover
mpadded
mphantom
mprescripts
mroot
mrow
ms
mscarries
mscarry
msgroup
msline
mspace
msqrt
|
msrow
mstack
mstyle
msub
msubsup
msup
mtable
mtd
mtext
mtr
munder
munderover
none
semantics
|
The following MathML attributes are
allowed by default (all others are stripped):
accent
accentunder
actiontype
align
alignmentscope
altimg
altimg-height
altimg-valign
altimg-width
alttext
bevelled
charalign
close
columnalign
columnlines
columnspacing
columnspan
columnwidth
crossout
decimalpoint
denomalign
depth
dir
display
displaystyle
edge
encoding
equalcolumns
equalrows
fence
fontstyle
fontweight
form
|
frame
framespacing
groupalign
height
href
id
indentalign
indentalignfirst
indentalignlast
indentshift
indentshiftfirst
indentshiftlast
indenttarget
infixlinebreakstyle
largeop
length
linebreak
linebreakmultchar
linebreakstyle
lineleading
linethickness
location
longdivstyle
lquote
lspace
mathbackground
mathcolor
mathsize
mathvariant
maxsize
minlabelspacing
minsize
movablelimits
|
notation
numalign
open
other
overflow
position
rowalign
rowlines
rowspacing
rowspan
rquote
rspace
scriptlevel
scriptminsize
scriptsizemultiplier
selection
separator
separators
shift
side
src
stackalign
stretchy
subscriptshift
superscriptshift
symmetric
voffset
width
xlink:href
xlink:show
xlink:type
xmlns
xmlns:xlink
|
CSS Sanitization
The following CSS properties are allowed by
default in style attributes (all others are stripped):
azimuth
background-color
border-bottom-color
border-collapse
border-color
border-left-color
border-right-color
border-top-color
clear
color
cursor
direction
display
elevation
float
font
|
font-family
font-size
font-style
font-variant
font-weight
height
letter-spacing
line-height
overflow
pause
pause-after
pause-before
pitch
pitch-range
richness
|
speak
speak-header
speak-numeral
speak-punctuation
speech-rate
stress
text-align
text-decoration
text-indent
unicode-bidi
vertical-align
voice-family
volume
white-space
width
|
Note
Not all possible CSS values are allowed for these properties. The
allowable values are restricted by a whitelist and a regular expression that
allows color values and lengths. URIs
are not allowed, to prevent platypus attacks.
See the _HTMLSanitizer class for more details.
Whitelist, Don’t Blacklist
I am often asked why Universal Feed Parser is so hard-assed about
HTML and CSS sanitizing. To illustrate the problem, here is an incomplete list of
potentially dangerous HTML tags and
attributes:
script, which can contain malicious script
applet, embed, and object, which can automatically download and execute malicious code
meta, which can contain malicious redirects
onload, onunload, and all other on* attributes, which can contain malicious script
style, link, and the style attribute, which can contain malicious script
style? Yes, style. CSS definitions can contain executable code.
Embedding Javascript in CSS
This sample is taken from https://feedparser.readthedocs.io/en/latest/examples/rss20.xml:
<description>Watch out for
<span style="background: url(javascript:window.location='http://example.org/')">
nasty tricks</span></description>
This sample is more advanced, and does not contain the keyword javascript: that
many naive HTML sanitizers scan for:
<description>Watch out for
<span style="any: expression(window.location='http://example.org/')">
nasty tricks</span></description>
Internet Explorer for Windows will execute the Javascript in both of these examples.
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
Embedding encoded Javascript in CSS
To a browser, this:
<span style="any: expression(window.location='http://example.org/')">
is the same as this (without the line breaks):
<span style="any: expre
ssion(window
.location='h
ttp://exampl
e.org/')">
which is the same as this (without the line breaks):
<span style="any: expr
ession(win
dow.locati
on='http:/
/example.o
rg/')">
And so on, plus several other variations, plus every combination of every
variation.
The more I investigate, the more cases I find where Internet Explorer for
Windows will treat seemingly innocuous markup as code and blithely execute it.
This is why Universal Feed Parser uses a whitelist and not a
blacklist. I am reasonably confident that none of the elements or attributes on
the whitelist are security risks. I am not at all confident about elements or
attributes that I have not explicitly investigated. And I have no confidence at
all in my ability to detect strings within attribute values that Internet
Explorer for Windows will treat as executable code.
Disabling HTML Sanitization
Though not recommended, it is possible to disable Universal Feed Parser's
HTML sanitization by passing sanitize_html=False
to feedparser.parse()
.
When passing this flag you are responsible for manually sanitizing HTML from the feed.