Skip to content

Add support for HTML file to extract-regexes.pl #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 12, 2021

Conversation

du201
Copy link
Contributor

@du201 du201 commented Jun 26, 2021

Summary

extract-regexes.pl used to only support python and js files. I added support for html files so that now extract-regexes.pl can process three types of files

Implementation Details

The input html file is first processed by beautifulsoup to combine all of its script tags' content into a new, temporary js file. Then, that js file is fed into the already-existing javascript extractor. After it's done, the temporary js file is deleted. A new test for html file is also created in ./src/extract/test/html

my $extractor = $language2extractor{$language};
if ($extractor and -x $extractor) {
print STDERR "$extractor '$json->{file}'\n";
my $original_path = $json->{file};

if ($language eq "html") {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed: I suggest you try an implementation for the html-extractor that recursively calls the meta-program to request JS extraction, and then parses its results.

# hardcoded js extractor location
output = subprocess.run(['./src/javascript/extract-regexps.js', './src/html/temp-js-content.js'], capture_output=True, text=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern that I still have is that the location of the js regex extractor is hardcoded in the html regex extractor. I thought about letting the meta-program pass the location to the html regex extractor, but that would require creating a special case for html file in the meta-program, which I think would create the modular design.

with open(file_path) as fp:
soup = BeautifulSoup(fp, 'html.parser')

js_from_html = ''
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it appropriate to merge all of the JS script tags like this, or would it be better to make one file per <script>? I am not sure the DOM JS semantics, could we get compilation failures as a result?

return js_from_html

def extract_regexes(json_tempfile):
output = subprocess.run(['./extract-regexes.pl', json_tempfile.name],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# create temp-js-content.js based on the location of extract-regexes.pl
js_tempfile = tempfile.NamedTemporaryFile(suffix='.js', mode='w+t')
js_tempfile.writelines(js_from_html)
js_tempfile.seek(0)
Copy link
Owner

@davisjam davisjam Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why seek? You do not use this object again.

# create temp json file to pass to the meta-program
json_tempfile = tempfile.NamedTemporaryFile(suffix='.json', mode='w+t')
json_tempfile.writelines(json.dumps({"file": js_tempfile.name, "language": "javascript"}))
json_tempfile.seek(0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why seek? You do not use this object again.

js_tempfile.writelines(js_from_html)
js_tempfile.close()

# create temp json file to pass to the meta-program
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit -- composing the input file for the meta-tool -- should be part of the function extract_regexes.

json_tempfile.close()

# call the meta-program
print(extract_regexes(json_tempfile, file_path), end = '')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget, does this output include the name of the file being scanned? (which you are about to delete)? What are the implications for users?

Should we (1) name the temp file more uniquely, and then (2) run a search-replace on the returned data to use the name of the original HTML file instead?

@davisjam davisjam merged commit dd912d2 into davisjam:master Jul 12, 2021
@davisjam
Copy link
Owner

LGTM, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants