ENH: add escape parameter to to_html() #2919

gdraps · 2013-02-23T22:17:49Z

Treating DataFrame content as plain text, rather than HTML markup, by escaping
everything (#2617) seems like the right default for to_html(), however, if a DataFrame contains HTML (example) or text already HTML escaped, it results in either unwanted escaping or double-escaping.

Changes in this PR:

make HTML escaping programmable through a new to_html() parameter named
escape (default True), allowing users to restore old to_html() behavior (<=0.10.0) by setting escape=False.
add & to the list of HTML chars escaped, so strings that happen to contain
HTML escape sequences or reserved entities, such as "<", are displayed
properly.

ghost · 2013-02-25T15:53:43Z

Good addition, but I think there are actually 3 cases to be handled, and
disagree with your choice of default behaviour.

Escaping <> by default is important to prevent XSS. your PR
keeps that and that's fine.

Running the following example in qtconsole/ipnb (or opening the HTML in a browser,I guess)

import pandas as pd
class AsHtml(object):
   def _repr_html_(self):
       return pd.DataFrame(["<b>str<ing1 &amp; &lt; &a</b>"]).to_html()
AsHtml()

currently produces

<b>str<ing1 & < &a</b>

and with your PR it produces:

<b>str<ing1 &amp; &lt; &a</b>

It seems to me that when a user has HTML in his frame and is using to_html(), he
would usually want those html entities to display properly (as & not & for example).

So there are 3 cases:

escape only <> but display html entities properly (the current default)
escape everything (your PR's suggested default behaviour)
don't escape anything (escape=False in your PR)

I think 1 should stay the default, and 2/3 should become possible by
specifying the escape argument.

gdraps · 2013-02-25T19:15:45Z

Thanks for the feedback. It's interesting because I've come to embrace black/white escaping rules (i.e., text is markup or not) based on experience with web frameworks. Mixing < and & in the same string and expecting & to render as simply & when escape=True is not what I would expect, but I may be biased.

In web frameworks, the general practice is to explicitly identify markup and treat all other text as suspect. Below are two examples of how you'd add markup to non-markup text with markupsafe, written by the author of flask and also used by flask.

>>> from markupsafe import Markup
>>> bold = Markup("<b>%s</b>")
>>> print bold % "str<ing1 & < &a"
<b>str&lt;ing1 &amp; &lt; &amp;a</b>
>>> print Markup("<b>") + "str<ing1 & < &a" + Markup("</b>")
<b>str&lt;ing1 &amp; &lt; &amp;a</b>

Furthermore, one must explicitly identify strings as Markup to prevent double-escaping.

>>> print bold % Markup("str&lt;ing1 &amp; &lt; &amp;a")
<b>str&lt;ing1 &amp; &lt; &amp;a</b>
>>> print bold % "str&lt;ing1 &amp; &lt; &amp;a"
<b>str&amp;lt;ing1 &amp;amp; &amp;lt; &amp;amp;a</b>

I'd be interested to hear your feedback on the following patterns and whether there is a better API for pandas users to safely markup DataFrame content.

import pandas as pd
from markupsafe import Markup
from IPython.core.display import display_html

bold = Markup("<b>%s</b>")
display_html(pd.DataFrame(["str<ing1 & < &a"]).to_html(), raw=True)
display_html(pd.DataFrame([bold % "str<ing1 & < &a"]).to_html(escape=False), raw=True)
display_html(pd.DataFrame([bold % Markup("str&lt;ing1 &amp; &lt; &amp;a")]).to_html(escape=False), raw=True)

ghost · 2013-02-25T19:41:50Z

That's nice, but I'm not sure that addresses the problem I mentioned.
The case not covered is when we want tags to be escaped (for example
<script src=http://evil.com/diabolical.js></script>), but '&amp' to display
as &.

Let me put it this way: suppose a user of pandas is happy with the way
things currently work. Suppose your PR is merged, what would the user
have to do to keep things behaving as they are?

If you're aware of security issues that make merely <> escaping ineffective,
please speak up.

Also note that GH displays '&' when you write '&' in a comment, and GH
is quite a popular web application :)

ghost · 2013-02-27T01:17:40Z

I'm going 180 on this.
http://wonko.com/post/html-escaping
https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet

In a nutshell, you can't tell what markup context the output might somehow end up embedded in
and so many characters are potentially dangerous, including quotes, ampersands,
percentage signs, braces and alarmingly, still others.

It seems an unlikely attack vector, but common experience shows that's bad reasoning.
So +1 for an escape smackdown, and thanks @gdraps for stirring this back up.

wesm · 2013-04-09T00:33:13Z

Merge status?

ghost · 2013-04-09T00:50:32Z

it's the right thing to do but i'm afraid it'd really inconvenience some users.
defaults bikeshedding?
dunno.

gdraps · 2013-04-09T01:25:31Z

For strict compatibility with 0.10.1, either a third escape mode can be added (e.g., 'compat' in addition to True/False), in which & is not escaped, or & escaping can be removed from this PR. The main goal was to add an escape parm to give users the ability to restore 0.10.0 behavior.

wesm · 2013-04-09T01:38:42Z

I'm comfortable with you merging as long as you put something in the What's New so we can point to it

gdraps · 2013-04-09T13:42:34Z

Rebased to master and added a few words to RELEASE.rst and v0.11.0.txt. Let me know if I only need to update one or the other.

wesm · 2013-04-10T07:34:45Z

merged, thanks!

jreback · 2013-04-10T16:38:26Z

Failing in master on p3.3

======================================================================
FAIL: test_to_html_escaped (pandas.tests.test_format.TestDataFrameFormatting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.3_with_system_site_packages/lib/python3.3/site-packages/pandas-0.11.0.dev_d749b91-py3.3-linux-x86_64.egg/pandas/tests/test_format.py", line 307, in test_to_html_escaped
    self.assertEqual(xp, rs)
AssertionError: '<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: rig [truncated]... != '<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: rig [truncated]...
Diff is 1049 characters long. Set self.maxDiff to None to see it.
----------------------------------------------------------------------
Ran 2960 tests in 122.311s

gdraps · 2013-04-10T17:26:33Z

Shucks, I can look into it later today.

jreback · 2013-04-10T18:24:15Z

FYI These are failing in 3.2

======================================================================
ERROR: test_rplot1 (pandas.tests.test_rplot.TestRPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tests/test_rplot.py", line 242, in test_rplot1
    self.plot.render(self.fig)
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 881, in render
    adjust_subplots(fig, axes_grid, last_trellis, new_layers[-1])
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 759, in adjust_subplots
    label1 = "%s = %s" % (trellis.by[0], trellis.group_grid[index / trellis.cols][index % trellis.cols][0])
TypeError: list indices must be integers, not float
======================================================================
ERROR: test_rplot2 (pandas.tests.test_rplot.TestRPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tests/test_rplot.py", line 252, in test_rplot2
    self.plot.render(self.fig)
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 881, in render
    adjust_subplots(fig, axes_grid, last_trellis, new_layers[-1])
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 753, in adjust_subplots
    label1 = "%s = %s" % (trellis.by[1], trellis.group_grid[index / trellis.cols][index % trellis.cols])
TypeError: list indices must be integers, not float
======================================================================
ERROR: test_rplot3 (pandas.tests.test_rplot.TestRPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tests/test_rplot.py", line 262, in test_rplot3
    self.plot.render(self.fig)
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 881, in render
    adjust_subplots(fig, axes_grid, last_trellis, new_layers[-1])
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 756, in adjust_subplots
    label1 = "%s = %s" % (trellis.by[0], trellis.group_grid[index / trellis.cols][index % trellis.cols])
TypeError: list indices must be integers, not float
----------------------------------------------------------------------
Ran 3093 tests in 300.638s

gdraps · 2013-04-11T07:27:14Z

Ok, test_to_html_escaped failed due to unsafe reliance on dict key ordering. The character escapes were stored in a dict, even though & must be escaped first to prevent double escaping. py3.3's hash randomization was wonderfully effective at uncovering this every other run. I pushed a new commit to this branch with the dict replaced by an ordered dict. Let me know if I should open a new PR instead.

ghost · 2013-04-11T07:57:50Z

cherry picked in master. Try and get travis-ci installed, it'll do the py3 testing for you
if you don't use tox.

ghost mentioned this pull request Mar 27, 2013

Provide a template-engine based way of rendering pandas data objects #3190

Closed

gdraps added 2 commits April 9, 2013 09:30

ENH: add escape parameter to to_html()

698d881

DOC: mention new to_html() escape argument and & escaping

f3d01cb

wesm merged commit f3d01cb into pandas-dev:master Apr 10, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add escape parameter to to_html() #2919

ENH: add escape parameter to to_html() #2919

gdraps commented Feb 23, 2013

ghost commented Feb 25, 2013

gdraps commented Feb 25, 2013

ghost commented Feb 25, 2013

ghost commented Feb 27, 2013

wesm commented Apr 9, 2013

ghost commented Apr 9, 2013

gdraps commented Apr 9, 2013

wesm commented Apr 9, 2013

gdraps commented Apr 9, 2013

wesm commented Apr 10, 2013

jreback commented Apr 10, 2013

gdraps commented Apr 10, 2013

jreback commented Apr 10, 2013

gdraps commented Apr 11, 2013

ghost commented Apr 11, 2013

ENH: add escape parameter to to_html() #2919

ENH: add escape parameter to to_html() #2919

Conversation

gdraps commented Feb 23, 2013

ghost commented Feb 25, 2013

gdraps commented Feb 25, 2013

ghost commented Feb 25, 2013

ghost commented Feb 27, 2013

wesm commented Apr 9, 2013

ghost commented Apr 9, 2013

gdraps commented Apr 9, 2013

wesm commented Apr 9, 2013

gdraps commented Apr 9, 2013

wesm commented Apr 10, 2013

jreback commented Apr 10, 2013

gdraps commented Apr 10, 2013

jreback commented Apr 10, 2013

gdraps commented Apr 11, 2013

ghost commented Apr 11, 2013