Add support for HTML file to extract-regexes.pl #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

davisjam merged 9 commits into davisjam:master from du201:master

Jul 12, 2021

Contributor

du201 commented Jun 26, 2021

Summary

extract-regexes.pl used to only support python and js files. I added support for html files so that now extract-regexes.pl can process three types of files

Implementation Details

The input html file is first processed by beautifulsoup to combine all of its script tags' content into a new, temporary js file. Then, that js file is fed into the already-existing javascript extractor. After it's done, the temporary js file is deleted. A new test for html file is also created in ./src/extract/test/html

du201 added 3 commits

June 25, 2021 14:25


          Add html regex extractor

afaca28


          Install beautiful-soup when run configure

d0c344b


          prepare for pull request

b857cc6

davisjam reviewed

View reviewed changes

src/extract/extract-regexes.pl Outdated

               my $extractor = $language2extractor{$language};
               if ($extractor and -x $extractor) {
                 print STDERR "$extractor '$json->{file}'\n";
+                my $original_path = $json->{file};
+                if ($language eq "html") {

Owner

davisjam Jun 28, 2021

As discussed: I suggest you try an implementation for the html-extractor that recursively calls the meta-program to request JS extraction, and then parses its results.


          Improve on modularity: move html-dealing code from extract-regexes.pl…

f5f3826

… to extract-regexps-html.py

du201 commented

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated

		# hardcoded js extractor location
		output = subprocess.run(['./src/javascript/extract-regexps.js', './src/html/temp-js-content.js'], capture_output=True, text=True)

Contributor Author

du201 Jun 30, 2021

The only concern that I still have is that the location of the js regex extractor is hardcoded in the html regex extractor. I thought about letting the meta-program pass the location to the html regex extractor, but that would require creating a special case for html file in the meta-program, which I think would create the modular design.


          Improve modularity: let html regex extractor recursively call extract…

b18b7ea

…-regexes.pl

davisjam requested changes

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated Show resolved Hide resolved

src/extract/src/html/extract-regexps-html.py Outdated

+              with open(file_path) as fp:
+                  soup = BeautifulSoup(fp, 'html.parser')
+              js_from_html = ''

Owner

davisjam Jun 30, 2021

Is it appropriate to merge all of the JS script tags like this, or would it be better to make one file per <script>? I am not sure the DOM JS semantics, could we get compilation failures as a result?

src/extract/src/html/extract-regexps-html.py Show resolved Hide resolved


          Switch to tempfile and add some functions

8ac48c5

davisjam reviewed

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated

+                  return js_from_html
+              def extract_regexes(json_tempfile):
+                  output = subprocess.run(['./extract-regexes.pl', json_tempfile.name],

Owner

davisjam Jun 30, 2021

PTAL https://stackoverflow.com/questions/918154/relative-paths-in-python

davisjam reviewed

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated

+              # create temp-js-content.js based on the location of extract-regexes.pl
+              js_tempfile = tempfile.NamedTemporaryFile(suffix='.js', mode='w+t')
+              js_tempfile.writelines(js_from_html)
+              js_tempfile.seek(0)

Owner

davisjam Jun 30, 2021 •

edited

Loading

Why seek? You do not use this object again.

davisjam reviewed

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated

+              # create temp json file to pass to the meta-program
+              json_tempfile = tempfile.NamedTemporaryFile(suffix='.json', mode='w+t')
+              json_tempfile.writelines(json.dumps({"file": js_tempfile.name, "language": "javascript"}))
+              json_tempfile.seek(0)

Owner

davisjam Jun 30, 2021

Why seek? You do not use this object again.

davisjam reviewed

View reviewed changes

src/extract/src/html/extract-regexps-html.py Show resolved Hide resolved

du201 added 2 commits

July 1, 2021 10:17


          Switch from relative path to absolute path in html extractor

a5404d5


          Fix tempfile issue and absolute path issue (now html extractor can be…

e8ed1b4

… run from any directory)

davisjam reviewed

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated

+              js_tempfile.writelines(js_from_html)
+              js_tempfile.close()
+              # create temp json file to pass to the meta-program

Owner

davisjam Jul 12, 2021

This bit -- composing the input file for the meta-tool -- should be part of the function extract_regexes.

davisjam reviewed

View reviewed changes

src/extract/src/html/extract-regexps-html.py Outdated

+              json_tempfile.close()
+              # call the meta-program
+              print(extract_regexes(json_tempfile, file_path), end = '')

Owner

davisjam Jul 12, 2021

I forget, does this output include the name of the file being scanned? (which you are about to delete)? What are the implications for users?

Should we (1) name the temp file more uniquely, and then (2) run a search-replace on the returned data to use the name of the original HTML file instead?


          Reorganize extract_regexes function

dc8c5f0

davisjam merged commit dd912d2 into davisjam:master

Owner

davisjam commented Jul 12, 2021

LGTM, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet