-
Notifications
You must be signed in to change notification settings - Fork 29
Add support for HTML file to extract-regexes.pl #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
afaca28
Add html regex extractor
du201 d0c344b
Install beautiful-soup when run configure
du201 b857cc6
prepare for pull request
du201 f5f3826
Improve on modularity: move html-dealing code from extract-regexes.pl…
du201 b18b7ea
Improve modularity: let html regex extractor recursively call extract…
du201 8ac48c5
Switch to tempfile and add some functions
du201 a5404d5
Switch from relative path to absolute path in html extractor
du201 e8ed1b4
Fix tempfile issue and absolute path issue (now html extractor can be…
du201 dc8c5f0
Reorganize extract_regexes function
du201 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
#!/usr/bin/env python3 | ||
# Description: This file takes in a html file from extract-regexes.pl, finds all the script | ||
# tags, and combine the JS in them into a temporary js file. It then sends the path of the | ||
# temporary js file back to extract-regexes.pl to let it pipeline the js file to the javascript | ||
# extractor. After extract-regexes.pl finishes extracting, The temporary JS file will be | ||
# deleted by extract-regexes.pl. | ||
|
||
from bs4 import BeautifulSoup | ||
import sys | ||
import subprocess | ||
import json | ||
import tempfile | ||
import os | ||
|
||
def extract_js(file_path): | ||
with open(file_path) as fp: | ||
soup = BeautifulSoup(fp, 'html.parser') | ||
|
||
js_from_html = '' | ||
for script in soup.find_all('script'): | ||
js_from_html += script.string | ||
|
||
return js_from_html | ||
|
||
def extract_regexes(js_from_html, file_path): | ||
js_tempfile = tempfile.NamedTemporaryFile(suffix='.js', mode='w+t', delete = False) | ||
js_tempfile.writelines(js_from_html) | ||
js_tempfile.close() | ||
|
||
# create temp json file to pass to the meta-program | ||
json_tempfile = tempfile.NamedTemporaryFile(suffix='.json', mode='w+t', delete = False) | ||
json_tempfile.writelines(json.dumps({"file": js_tempfile.name, "language": "javascript"})) | ||
json_tempfile.close() | ||
|
||
output = subprocess.run( | ||
[os.path.join(os.environ['VULN_REGEX_DETECTOR_ROOT'], 'src/extract/extract-regexes.pl'), | ||
json_tempfile.name], | ||
capture_output=True, text=True) | ||
|
||
# delete the temp js and json file | ||
os.remove(js_tempfile.name) | ||
os.remove(json_tempfile.name) | ||
|
||
output_json = json.loads(output.stdout) | ||
output_json['file'] = file_path | ||
return json.dumps(output_json) | ||
|
||
file_path = sys.argv[1] | ||
js_from_html = extract_js(file_path) | ||
|
||
# call the meta-program | ||
davisjam marked this conversation as resolved.
Show resolved
Hide resolved
|
||
print(extract_regexes(js_from_html, file_path), end = '') | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"file": "./test/html/t.html", "language": "html"} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<header> | ||
<script> | ||
let script_var = 5; | ||
console.log('test'); | ||
'abc'.match(/def/); | ||
new RegExp('aaa'); | ||
</script> | ||
|
||
<script> | ||
var re = /abcsdxxx/; | ||
</script> | ||
|
||
</header> | ||
|
||
<body> | ||
<h1>My First Heading</h1> | ||
<p>My first paragraph.</p> | ||
<script> | ||
var re = /abcsdsdfdf/; | ||
</script> | ||
</body> | ||
|
||
</html> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"file":"test/js/t.js"} | ||
{"file": "./test/javascript/t.js", "language": "javascript"} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,3 +3,6 @@ | |
var re = /abc/; | ||
'abc'.match(/def/); | ||
new RegExp('aaa'); | ||
|
||
var re_string = '\\w+'; | ||
new RegExp(re_string); |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.