-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EscapeHTML Function for ExtractLinks #266
Comments
I'm going to see if I can trace why the Javascript parser is doing this with backslashes. In terms of backslashes in a url, they are permitted, but the standard is that one uses forward slashes instead. We could simply create a udf to convert them as desired. I think we should also consider where we create an implicit class for these edge cases, or a general all-purpose filtering tool (like our own version of boilerpipe). For instance, it seems like we get <script> tags as labels at times, which don't seem too interesting for network analysis. Urls with lots of whitespace also read to me as something we would want to lose as a rule. |
Chrome will escape backslashes in an attribute in the console. I think we are getting errors because I have told Ruby that the code is xml-safe. But technically, XML should support backslash. |
Okay - the issue is that in javascript \ implies an escape or unicode character will follow. Since it's possible that people would want to use the xml in a website, we should probably offer a script that will convert things like newlines etc. to escaped entities. |
Ok - thanks @greebie. I’m not totally following you here. By script, do you mean an AUT UDF? Or is this something that can be handled elsewhere? |
UDF = "User defined function" - just a function that is not part of Apache Spark. |
Yes I know what a UDF is. What are you proposing above? The original plan as outlined in the issue, or something else? |
The question is 1) Create a implicit class containing tools to remove things now or in future that cause problems (faulty urls, whitespace, javascript, ) or 2) create a general function that just removes all (what we think is) junk in one fell swoop. example of #1: ExtractLinks (x).removeBackslashes().removeWhitespace() (etc.) I realize that we have this specific issue, but the overall issue is the ability to clean data with wonky stuff. I am asking whether it's better to look at the bigger picture problem since we are looking at the smaller one. |
Ok thanks for clarifying. There’s a lot of thinking out loud above and I was trying to see what you are proposing. My preference would be for option number one. But let’s circle around to this on Wednesday during our next team meeting. |
In this case, we decided it would be best to focus on the domain extractor bug which is causing this problem. |
Is your feature request related to a problem? Please describe.
When working with graph files in sigma.js, sometimes we have backslashes that trip up sigma's interpreter. This is apparently due to Javascript's XML parser, which is more finicky than the standard XML schema.
This is a relatively rare problem, affecting 2 of the 117 GEXF files tested. I think it is because it is only crashing on backslashes, whereas most URLs are forward slashes. So when there's a unexpected direction slash in a link, something goes awry. /shrug
The issue can be found when loading up the graph and it fails to appear due to invalid XML. However, it still opens fine in Gephi and the issue isn't caught by XML linters.
Example lines that break Sigma visualization:
In the GEXF file if the above is changed to:
The graph appears.
In the other case, it is a similar problem.
Describe the solution you'd like
The GEXF files should have the backslash escaped. One option, proposed by @greebie, is to perhaps implement a new function
escapeHTML()
that can fix this. Or maybe even more granularly, since it appears to just be backslashes, maybe anescapeBackslashes
.Describe alternatives you've considered
It could also be handled by graphpass or sigma. This probably does make the most sense.
Additional context
I think this is a classic edge case - something very unexpected in probably the hundreds of millions of links that we've processed in the WALK collection.
The text was updated successfully, but these errors were encountered: