Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EscapeHTML Function for ExtractLinks #266

Closed
ianmilligan1 opened this issue Aug 31, 2018 · 9 comments
Closed

Add EscapeHTML Function for ExtractLinks #266

ianmilligan1 opened this issue Aug 31, 2018 · 9 comments

Comments

@ianmilligan1
Copy link
Member

Is your feature request related to a problem? Please describe.
When working with graph files in sigma.js, sometimes we have backslashes that trip up sigma's interpreter. This is apparently due to Javascript's XML parser, which is more finicky than the standard XML schema.

This is a relatively rare problem, affecting 2 of the 117 GEXF files tested. I think it is because it is only crashing on backslashes, whereas most URLs are forward slashes. So when there's a unexpected direction slash in a link, something goes awry. /shrug

The issue can be found when loading up the graph and it fails to appear due to invalid XML. However, it still opens fine in Gephi and the issue isn't caught by XML linters.

Example lines that break Sigma visualization:

    <node id="n10206" label="nbu.ca\french\agm7">
    <attvalues>
      <attvalue for="v_label" value="nbu.ca\french\agm7" />

In the GEXF file if the above is changed to:

    <node id="n10206" label="nbu.cafrenchagm7">
    <attvalues>
      <attvalue for="v_label" value="nbu.cafrenchagm7" />

The graph appears.

In the other case, it is a similar problem.

Describe the solution you'd like
The GEXF files should have the backslash escaped. One option, proposed by @greebie, is to perhaps implement a new function escapeHTML() that can fix this. Or maybe even more granularly, since it appears to just be backslashes, maybe an escapeBackslashes.

Describe alternatives you've considered
It could also be handled by graphpass or sigma. This probably does make the most sense.

Additional context
I think this is a classic edge case - something very unexpected in probably the hundreds of millions of links that we've processed in the WALK collection.

@greebie
Copy link
Contributor

greebie commented Aug 31, 2018

I'm going to see if I can trace why the Javascript parser is doing this with backslashes.

In terms of backslashes in a url, they are permitted, but the standard is that one uses forward slashes instead. We could simply create a udf to convert them as desired.

I think we should also consider where we create an implicit class for these edge cases, or a general all-purpose filtering tool (like our own version of boilerpipe). For instance, it seems like we get <script> tags as labels at times, which don't seem too interesting for network analysis. Urls with lots of whitespace also read to me as something we would want to lose as a rule.

@greebie
Copy link
Contributor

greebie commented Aug 31, 2018

Chrome will escape backslashes in an attribute in the console. I think we are getting errors because I have told Ruby that the code is xml-safe. But technically, XML should support backslash.

@greebie
Copy link
Contributor

greebie commented Aug 31, 2018

Okay - the issue is that in javascript \ implies an escape or unicode character will follow. Since it's possible that people would want to use the xml in a website, we should probably offer a script that will convert things like newlines etc. to escaped entities.

@ianmilligan1
Copy link
Member Author

Ok - thanks @greebie.

I’m not totally following you here. By script, do you mean an AUT UDF? Or is this something that can be handled elsewhere?

@greebie
Copy link
Contributor

greebie commented Aug 31, 2018

UDF = "User defined function" - just a function that is not part of Apache Spark.

@ianmilligan1
Copy link
Member Author

Yes I know what a UDF is.

What are you proposing above? The original plan as outlined in the issue, or something else?

@greebie
Copy link
Contributor

greebie commented Aug 31, 2018

The question is 1) Create a implicit class containing tools to remove things now or in future that cause problems (faulty urls, whitespace, javascript, ) or 2) create a general function that just removes all (what we think is) junk in one fell swoop.

example of #1: ExtractLinks (x).removeBackslashes().removeWhitespace() (etc.)
example of #2: RemoveKnownJunk(ExtractLinks (x))

I realize that we have this specific issue, but the overall issue is the ability to clean data with wonky stuff. I am asking whether it's better to look at the bigger picture problem since we are looking at the smaller one.

@ianmilligan1
Copy link
Member Author

Ok thanks for clarifying. There’s a lot of thinking out loud above and I was trying to see what you are proposing.

My preference would be for option number one. But let’s circle around to this on Wednesday during our next team meeting.

@ianmilligan1
Copy link
Member Author

In this case, we decided it would be best to focus on the domain extractor bug which is causing this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants