-
Notifications
You must be signed in to change notification settings - Fork 29
Add support for HTML file to extract-regexes.pl #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/extract/extract-regexes.pl
Outdated
my $extractor = $language2extractor{$language}; | ||
if ($extractor and -x $extractor) { | ||
print STDERR "$extractor '$json->{file}'\n"; | ||
my $original_path = $json->{file}; | ||
|
||
if ($language eq "html") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed: I suggest you try an implementation for the html-extractor that recursively calls the meta-program to request JS extraction, and then parses its results.
… to extract-regexps-html.py
# hardcoded js extractor location | ||
output = subprocess.run(['./src/javascript/extract-regexps.js', './src/html/temp-js-content.js'], capture_output=True, text=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern that I still have is that the location of the js regex extractor is hardcoded in the html regex extractor. I thought about letting the meta-program pass the location to the html regex extractor, but that would require creating a special case for html file in the meta-program, which I think would create the modular design.
with open(file_path) as fp: | ||
soup = BeautifulSoup(fp, 'html.parser') | ||
|
||
js_from_html = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it appropriate to merge all of the JS script tags like this, or would it be better to make one file per <script>? I am not sure the DOM JS semantics, could we get compilation failures as a result?
return js_from_html | ||
|
||
def extract_regexes(json_tempfile): | ||
output = subprocess.run(['./extract-regexes.pl', json_tempfile.name], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# create temp-js-content.js based on the location of extract-regexes.pl | ||
js_tempfile = tempfile.NamedTemporaryFile(suffix='.js', mode='w+t') | ||
js_tempfile.writelines(js_from_html) | ||
js_tempfile.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why seek? You do not use this object again.
# create temp json file to pass to the meta-program | ||
json_tempfile = tempfile.NamedTemporaryFile(suffix='.json', mode='w+t') | ||
json_tempfile.writelines(json.dumps({"file": js_tempfile.name, "language": "javascript"})) | ||
json_tempfile.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why seek? You do not use this object again.
js_tempfile.writelines(js_from_html) | ||
js_tempfile.close() | ||
|
||
# create temp json file to pass to the meta-program |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit -- composing the input file for the meta-tool -- should be part of the function extract_regexes
.
json_tempfile.close() | ||
|
||
# call the meta-program | ||
print(extract_regexes(json_tempfile, file_path), end = '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget, does this output include the name of the file being scanned? (which you are about to delete)? What are the implications for users?
Should we (1) name the temp file more uniquely, and then (2) run a search-replace on the returned data to use the name of the original HTML file instead?
LGTM, thank you. |
Summary
extract-regexes.pl used to only support python and js files. I added support for html files so that now extract-regexes.pl can process three types of files
Implementation Details
The input html file is first processed by beautifulsoup to combine all of its script tags' content into a new, temporary js file. Then, that js file is fed into the already-existing javascript extractor. After it's done, the temporary js file is deleted. A new test for html file is also created in ./src/extract/test/html