[jsinterp] Actual JS interpreter #11272

sulyi · 2016-11-23T01:47:53Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

I've started to implement an actual JavaScript syntax parser.
-- EDIT --
And moved on making an interpreter.

yan12125 · 2016-11-25T12:13:13Z

youtube_dl/jsinterp.py

+_STRING_RE = r'%s|%s' % (_SINGLE_QUOTED, _DOUBLE_QUOTED)
+
+_INTEGER_RE = r'%(hex)s|%(dec)s|%(oct)s' % {'hex': __HEXADECIMAL_RE, 'dec': __DECIMAL_RE, 'oct': __OCTAL_RE}
+_FLOAT_RE = r'%(dec)s\.%(dec)s' % {'dec': __DECIMAL_RE}


Oh, thx @yan12125!

yan12125 · 2016-11-25T12:16:29Z

youtube_dl/jsinterp.py

 _NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*'

+_SINGLE_QUOTED = r"""'(?:[^'\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^'\\\\]*'"""
+_DOUBLE_QUOTED = r'''"(?:[^"\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^"\\\\]*"'''


I guess you misuse \. For example:

>>> repr(re.match(r"""'(?:[^'\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^'\\\\]*'""", r"""'\'""")) 'None'

I'll check it, but I've borrowed that from utils though.

Sure it wasn't right, but r"""'\'""" shouldn't be matched anyway (it's not closed).
This kinda looks ok to me:

>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", r"""'\''""")) '<_sre.SRE_Match object; span=(0, 4), match="\'\\\\\'\'">' >>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", r"""'\'""")) 'None' >>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", """'\''""")) '<_sre.SRE_Match object; span=(0, 2), match="\'\'">' >>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", """'\'""")) '<_sre.SRE_Match object; span=(0, 2), match="\'\'">'

yan12125 · 2016-11-25T12:23:30Z

youtube_dl/jsinterp.py

+_BOOL_RE = r'true|false'
+# XXX: it seams group cannot be refed this way
+# r'/(?=[^*])[^/\n]*/(?![gimy]*(?P<reflag>[gimy])[gimy]*\g<reflag>)[gimy]{0,4}'
+_REGEX_RE = r'/(?=[^*])[^/\n]*/[gimy]{0,4}'


>>> re.match(r'/(?=[^*])[^/\n]*/[gimy]{0,4}', r'''/\/\/\//''') <_sre.SRE_Match object; span=(0, 3), match='/\\/'>

Hopefully I've managed to improve on it a little.
--- edit ---
They can't be multiline, can they? I'll need to check that.

They can't be multiline, can they?

Yep. According to ECMA 262 5.1, CR (U+000D), LF (U+000A), LS (U+2028) and PS (U+2029) are not allowed in RegExp literals

Thx. I'll need to read that, a couple more times.

yan12125 · 2016-11-25T12:51:41Z

youtube_dl/jsinterp.py

+
+# _ARRAY_RE = r'\[(%(literal)s\s*,\s*)*(%(literal)s\s*)?\]' % {'literal': _LITERAL_RE}
+# _VALUE_RE = r'(?:%(literal)s)|(%(array)s)' % {'literal': _LITERAL_RE, 'array': _ARRAY_RE}
+_CALL_RE = r'\.?%(name)s\s*\(' % {'name': _NAME_RE}  # function or method!


Function calls are complex. For example:

from youtube_dl.jsinterp import JSInterpreter jsi = JSInterpreter(''' function a(x) { return x; } function b(x) { return x; } function c() { return [a, b][0](0); } ''') print(jsi.call_function('c'))

I've added test.
Tokenizing seams to be fine, but I haven't migrated the interpreter and old one does not support this.

yan12125 · 2016-11-25T13:23:50Z

youtube_dl/jsinterp.py

 ]
 _ASSIGN_OPERATORS = [(op + '=', opfunc) for op, opfunc in _OPERATORS]
 _ASSIGN_OPERATORS.append(('=', lambda cur, right: right))

+_RESERVED_RE = r'(?:function|var|(?P<ret>return))\s'


Sorry but Javascript is not context-free. For examlpe:

code3 = ''' a = {'var': 3}; function c() { return a.var; } ''' jsi = JSInterpreter(code3) print(jsi.call_function('c'))

I've added test, but I don't see any problem.
And what do you mean @yan12125 "not context-free"?
Didn't you wanted to say not regular?
--- edit---
Sry, you're right.
Although, according to http://stackoverflow.com/questions/30697267/is-javascript-a-context-free-language:

That object literals must not contain duplicate property names and that function parameter lists must not contain duplicate identifiers are two rules that cannot be expressed using (finite) context-free grammars.

Duplicated keys/parameter names are another issue, which can be ignored in parsing and checked in semantic checking. In youtube-dl it's safe to assume all inputs are valid Javascript so there's no need to handle it.

yan12125 · 2016-11-29T07:01:51Z

A notice: OrderedDict are not available in Python 2.6. There was a proposal to drop 2.6 but no consensus yet (#5697)

sulyi · 2016-11-29T07:52:24Z

For some reason code this had been missed by code inspector of the IDE I'm working with. I'll try to come up a workaround. Thanks.

- missing enumerate in op_ids and aop_ids - order of relation and operator regex in input_element

sulyi

I've left import line.

Also a bunch of changes got in that shouldn't have.

sulyi · 2016-12-01T05:23:55Z

I've just realise my original idea, that _next_statement method would do the lexical analysis and interpret_statement the parsing is fraud. To yield a statement parsing has to had happened, since Statement is one of the symbols (along with FunctionDeclaration) replacing SourceElement in the syntactic grammar.

- new class TokenStream with peek and pop methods - _assign_expression handling precedence - new logical, unary, equality and relation operators - yet another try replacing OrderedDict - minor change in lexical grammar allowing identifiers to match reserved words _chk_id staticmethod has been added to handle it in syntactic grammar

Supports: - arrays - expressions - calls - assignment - variable declaration - blocks - return statement - element and property access Semantics not yet implemented, tho.

sulyi · 2018-06-09T10:34:36Z

youtube_dl/extractor/youtube.py

@@ -12,7 +12,7 @@
 import traceback

 from .common import InfoExtractor, SearchInfoExtractor
-from ..jsinterp import JSInterpreter
+from ..jsinterp2 import JSInterpreter


Accidental changes, but oddly passed test_youtube_signature. For me it fails due to JSArrayPrototype._slice doesn't handle arguments correctly since not being implemented yet.

- Fixes TestCase class names

Tatsh · 2018-06-10T06:16:09Z

This is of course, very neat. But a lot of Chrome's (and maybe others) standard library for many things are implemented in JavaScript. Instead of making the built-ins (like String.prototype.match) in Python, why not write them in JavaScript (where possible)?

sulyi · 2018-06-10T06:15:44Z

youtube_dl/jsinterp2/jsinterp.py

+            try:
+                ref = (self.this[id] if id in self.this else
+                       self.global_vars[id])
+            except KeyError:


I think JSInterpreter#extract_function is useful filtering code before execution, therefore I'd like to continue supporting it. But here it needs to get object from outer context, and that behaviour is not in spec. I'd like to suggest a flag that disables these kind of features.

sulyi · 2018-06-10T06:24:49Z

@Tatsh That sounds great, and I remember seeing such implementation somewhere, but I didn't understand it, well enough to try to adopt it. Can you help?

Tatsh · 2018-06-10T15:29:29Z

Main thing is to get the basics of the interpreter in, which includes the built-in types, and then the JavaScript portions can be written very similarly to how polyfills are written today. So you would not need to implement Array.isArray() in Python if you have the === operator working for comparing function references and .constructor property working on all objects. Then code Array.isArray = function (x) { return x.constructor === Array }; and make this load at runtime before anything else.

I can take a look later since this does interest me. Mainly this was needed for me to get past CloudFlare anti-DDoS without needing cookies.

sulyi · 2018-06-10T16:27:47Z

Well, I've already implemented operator === and most constructors I believe, but neither have been tested properly.
Also spec of isArray states:

If Type(arg) is not Object, return false.
If the value of the [[Class]] internal property of arg is "Array", then return true.
Return false.

I'm not sure when 1. would execute or whether your solution takes care of it.
My solution for this particular function would look like this:

from .internals import jstype, object_type

def _is_array(arg):
   if jstype(arg) is not object_type:
      return False
   if arg.jsclass == 'Array':
      return True
  return False

I'm not sticking to it or claim that it's elegant, but this is how I'm able to solve the task at hand.
I have way less xp programming in js than in py. So, if you could tell me what needs to be get done for this to work I can probably take care of it, but to actually implement it that would be much harder to do it like this on my own.

- adds `jsgrammar.LINETERMINATORSEQ_RE` - lexer `tstream.TokenStream` checks for lineterminators in tokens - adds `tstream.Token` - refractors `tstream.TokenStream` and `jsparser.Parser` and to use it

sulyi · 2018-06-10T21:07:50Z

I've added feature to lexer (tstream.TokenStream) ability to handle line terminators. This is the first step in implementing correct line reporting.

This is also useful to have in my other plan to change the test suite using json files instead of py to generate test cases. My reasoning behind it is if jsparser.Parser would support converting AST to estree or some pretty similar format, it would be possible to easily compare it against the output of acorn or some other parser.

- Adds `jsbuilt_ins.nan` and `jsbuilt_ins.infinity` - Adds arithmetic operator overload to `jsbuilt_ins.jsnumber.JSNumberPrototype` - Adds equality operator overload to `jsinterp.Reference` - Adds better strict equality and typeof operator in `tstream`

- Refractors `Context` and `Reference` classes into their own module named `environment` (saves local import in `tstream`)

sulyi added 4 commits November 23, 2016 02:34

[jsinterp] Actual parsing

d328b8c

[jsinterp] Handling comments

2c85715

[jsinterp] Parsing expr (cleanup needed)

cc895cd

[jsinterp] Calling field and test

8c87a18

sulyi mentioned this pull request Nov 24, 2016

New JSInterpreter Features #11292

Closed

8 tasks

yan12125 reviewed Nov 25, 2016

View reviewed changes

sulyi added 6 commits November 25, 2016 21:54

[jsinterp] Clean up

2076b0b

[jsinterp] Quick regex fixes (thx to yan12125)

da73cd9

[jsinterp] Complex call test (thx to yan12125)

71a485f

[jsinterp] String literal regex change

8842f08

[jsinterp] Reject method call when name is empty (+reminder TOTOs)

c485fe7

[jsinterp] Simpler regex regex (+more TOTO)

ba5a400

yan12125 added the pending-fixes label Nov 26, 2016

yan12125 self-assigned this Nov 26, 2016

sulyi added 2 commits November 28, 2016 06:53

[jsinterp] Lexer overhaul

b089388

[jsinterp] Value parsing

9bd5dee

sulyi added 3 commits November 30, 2016 07:37

[jsinterp] No OrderedDict

aa7eb3d

[jsinterp] Parser mock up

a0fa6bf

[jsinterp] Minor quick fixes

67d5653

- missing enumerate in op_ids and aop_ids - order of relation and operator regex in input_element

sulyi commented Nov 30, 2016

View reviewed changes

sulyi added 3 commits December 3, 2016 06:32

[jsinterp] Adding _operator_expression using reversed polish notation

f6005dc

[jsinterp] Parser - take one (untested)

f605783

Supports: - arrays - expressions - calls - assignment - variable declaration - blocks - return statement - element and property access Semantics not yet implemented, tho.

sulyi commented Jun 9, 2018

View reviewed changes

sulyi added 6 commits June 10, 2018 03:09

[jsinterp] Rename js2test to jstests

b8a1742

- Fixes TestCase class names

[jsinterp] Fixing incomplete refactor

848aa79

[jsinterp] revert youtube_dl/extractor/youtube.py (yet again)

bbea188

[jsinterp] Adding JSArrayPrototype#_slice

37d6306

[jsinterp] TODOs in JSStringPrototype#_split

8060889

[jsinterp] Fixing broken Assignment Expression

a8c640e

sulyi changed the title ~~[jsinterp] Actual parsing~~ [jsinterp] Actual JS interpreter Jun 10, 2018

sulyi commented Jun 10, 2018

View reviewed changes

[jsinterp] Adding handling lineterminator

a33b47e

- adds `jsgrammar.LINETERMINATORSEQ_RE` - lexer `tstream.TokenStream` checks for lineterminators in tokens - adds `tstream.Token` - refractors `tstream.TokenStream` and `jsparser.Parser` and to use it

sulyi added 2 commits June 11, 2018 07:47

[jsinterp] Adding delete and void operators

c0ef911

- Refractors `Context` and `Reference` classes into their own module named `environment` (saves local import in `tstream`)

dstftw force-pushed the master branch from c486aa9 to 5ee7ae5 Compare December 9, 2018 15:38

dstftw force-pushed the master branch from d99bab0 to e118a87 Compare January 23, 2019 18:40

mengmo mentioned this pull request Mar 5, 2019

New API spacemeowx2/DouyuHTML5Player#28

Closed

brouxco mentioned this pull request Apr 4, 2020

Add dependency for JS parser streamlink/streamlink#2534

Closed

dstftw force-pushed the master branch from 5e26784 to da2069f Compare September 13, 2020 13:52

yan12125 removed their assignment Dec 27, 2020

fstirlitz mentioned this pull request Jul 14, 2021

RTP: fix extraction yt-dlp/yt-dlp#497

Merged

sulyi mentioned this pull request Sep 8, 2021

[Feature Request] Implement JavaScript interpreter (or find a better dependency than PhantomJS) yt-dlp/yt-dlp#923

Closed

2 tasks

sulyi closed this Sep 8, 2021

dirkf mentioned this pull request Oct 15, 2021

Regarding Youtube-dl download speed is limited by the YouTube website #30097

Closed

dirkf mentioned this pull request Nov 1, 2021

[YouTube] Unthrottle downloads by responding to the "n" parameter challenge #30184

Closed

11 tasks

yyms3275 mentioned this pull request Nov 29, 2021

YouTube access restriction issue nhCoder/YouTubeExtractor#52

Open

dirkf mentioned this pull request Sep 4, 2022

[youtube] nsig extraction failed: You may experience throttling for some formats yt-dlp/yt-dlp#4635

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jsinterp] Actual JS interpreter #11272

[jsinterp] Actual JS interpreter #11272

sulyi commented Nov 23, 2016 •

edited

Loading

yan12125 Nov 25, 2016

sulyi Nov 25, 2016

yan12125 Nov 25, 2016

sulyi Nov 25, 2016

sulyi Nov 26, 2016 •

edited

Loading

yan12125 Nov 25, 2016

sulyi Nov 25, 2016 •

edited

Loading

yan12125 Nov 26, 2016

sulyi Nov 26, 2016

yan12125 Nov 25, 2016

sulyi Nov 25, 2016

yan12125 Nov 25, 2016 •

edited

Loading

sulyi Nov 25, 2016 •

edited

Loading

yan12125 Nov 26, 2016

yan12125 commented Nov 29, 2016

sulyi commented Nov 29, 2016

sulyi left a comment

sulyi commented Dec 1, 2016

sulyi Jun 9, 2018

Tatsh commented Jun 10, 2018

sulyi Jun 10, 2018

sulyi commented Jun 10, 2018

Tatsh commented Jun 10, 2018

sulyi commented Jun 10, 2018

sulyi commented Jun 10, 2018

[jsinterp] Actual JS interpreter #11272

[jsinterp] Actual JS interpreter #11272

Conversation

sulyi commented Nov 23, 2016 • edited Loading

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sulyi Nov 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sulyi Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yan12125 Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

sulyi Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yan12125 commented Nov 29, 2016

sulyi commented Nov 29, 2016

sulyi left a comment

Choose a reason for hiding this comment

sulyi commented Dec 1, 2016

Choose a reason for hiding this comment

Tatsh commented Jun 10, 2018

Choose a reason for hiding this comment

sulyi commented Jun 10, 2018

Tatsh commented Jun 10, 2018

sulyi commented Jun 10, 2018

sulyi commented Jun 10, 2018

sulyi commented Nov 23, 2016 •

edited

Loading

sulyi Nov 26, 2016 •

edited

Loading

sulyi Nov 25, 2016 •

edited

Loading

yan12125 Nov 25, 2016 •

edited

Loading

sulyi Nov 25, 2016 •

edited

Loading