Add EscapeHTML Function for ExtractLinks #266

ianmilligan1 · 2018-08-31T20:51:49Z

Is your feature request related to a problem? Please describe.
When working with graph files in sigma.js, sometimes we have backslashes that trip up sigma's interpreter. This is apparently due to Javascript's XML parser, which is more finicky than the standard XML schema.

This is a relatively rare problem, affecting 2 of the 117 GEXF files tested. I think it is because it is only crashing on backslashes, whereas most URLs are forward slashes. So when there's a unexpected direction slash in a link, something goes awry. /shrug

The issue can be found when loading up the graph and it fails to appear due to invalid XML. However, it still opens fine in Gephi and the issue isn't caught by XML linters.

Example lines that break Sigma visualization:

    <node id="n10206" label="nbu.ca\french\agm7">
    <attvalues>
      <attvalue for="v_label" value="nbu.ca\french\agm7" />

In the GEXF file if the above is changed to:

    <node id="n10206" label="nbu.cafrenchagm7">
    <attvalues>
      <attvalue for="v_label" value="nbu.cafrenchagm7" />

The graph appears.

In the other case, it is a similar problem.

Describe the solution you'd like
The GEXF files should have the backslash escaped. One option, proposed by @greebie, is to perhaps implement a new function escapeHTML() that can fix this. Or maybe even more granularly, since it appears to just be backslashes, maybe an escapeBackslashes.

Describe alternatives you've considered
It could also be handled by graphpass or sigma. This probably does make the most sense.

Additional context
I think this is a classic edge case - something very unexpected in probably the hundreds of millions of links that we've processed in the WALK collection.

The text was updated successfully, but these errors were encountered:

greebie · 2018-08-31T21:55:20Z

I'm going to see if I can trace why the Javascript parser is doing this with backslashes.

In terms of backslashes in a url, they are permitted, but the standard is that one uses forward slashes instead. We could simply create a udf to convert them as desired.

I think we should also consider where we create an implicit class for these edge cases, or a general all-purpose filtering tool (like our own version of boilerpipe). For instance, it seems like we get <script> tags as labels at times, which don't seem too interesting for network analysis. Urls with lots of whitespace also read to me as something we would want to lose as a rule.

greebie · 2018-08-31T22:14:39Z

Chrome will escape backslashes in an attribute in the console. I think we are getting errors because I have told Ruby that the code is xml-safe. But technically, XML should support backslash.

greebie · 2018-08-31T22:25:59Z

Okay - the issue is that in javascript \ implies an escape or unicode character will follow. Since it's possible that people would want to use the xml in a website, we should probably offer a script that will convert things like newlines etc. to escaped entities.

ianmilligan1 · 2018-08-31T23:04:38Z

Ok - thanks @greebie.

I’m not totally following you here. By script, do you mean an AUT UDF? Or is this something that can be handled elsewhere?

greebie · 2018-08-31T23:08:56Z

UDF = "User defined function" - just a function that is not part of Apache Spark.

ianmilligan1 · 2018-08-31T23:12:06Z

Yes I know what a UDF is.

What are you proposing above? The original plan as outlined in the issue, or something else?

greebie · 2018-08-31T23:15:42Z

The question is 1) Create a implicit class containing tools to remove things now or in future that cause problems (faulty urls, whitespace, javascript, ) or 2) create a general function that just removes all (what we think is) junk in one fell swoop.

example of #1: ExtractLinks (x).removeBackslashes().removeWhitespace() (etc.)
example of #2: RemoveKnownJunk(ExtractLinks (x))

I realize that we have this specific issue, but the overall issue is the ability to clean data with wonky stuff. I am asking whether it's better to look at the bigger picture problem since we are looking at the smaller one.

ianmilligan1 · 2018-08-31T23:23:24Z

Ok thanks for clarifying. There’s a lot of thinking out loud above and I was trying to see what you are proposing.

My preference would be for option number one. But let’s circle around to this on Wednesday during our next team meeting.

ianmilligan1 · 2018-09-13T17:42:09Z

In this case, we decided it would be best to focus on the domain extractor bug which is causing this problem.

ianmilligan1 added enhancement feature labels Aug 31, 2018

ianmilligan1 assigned greebie Aug 31, 2018

ianmilligan1 closed this as completed Sep 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EscapeHTML Function for ExtractLinks #266

Add EscapeHTML Function for ExtractLinks #266

ianmilligan1 commented Aug 31, 2018

greebie commented Aug 31, 2018

greebie commented Aug 31, 2018

greebie commented Aug 31, 2018

ianmilligan1 commented Aug 31, 2018

greebie commented Aug 31, 2018

ianmilligan1 commented Aug 31, 2018

greebie commented Aug 31, 2018 •

edited

Loading

ianmilligan1 commented Aug 31, 2018

ianmilligan1 commented Sep 13, 2018

Add EscapeHTML Function for ExtractLinks #266

Add EscapeHTML Function for ExtractLinks #266

Comments

ianmilligan1 commented Aug 31, 2018

greebie commented Aug 31, 2018

greebie commented Aug 31, 2018

greebie commented Aug 31, 2018

ianmilligan1 commented Aug 31, 2018

greebie commented Aug 31, 2018

ianmilligan1 commented Aug 31, 2018

greebie commented Aug 31, 2018 • edited Loading

ianmilligan1 commented Aug 31, 2018

ianmilligan1 commented Sep 13, 2018

greebie commented Aug 31, 2018 •

edited

Loading