Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax for removing elements from JSON responses #1447

Closed
ameshkov opened this issue May 8, 2021 · 19 comments
Closed

Syntax for removing elements from JSON responses #1447

ameshkov opened this issue May 8, 2021 · 19 comments

Comments

@ameshkov
Copy link
Member

ameshkov commented May 8, 2021

We often face issues when we need to modify JSON strings. Usually, we use $replace for that, but it is FAR from ideal.

What we really need is a new modifier similar to $replace that will work with JSON's specifically.

Now, the question is what syntax should we use that can be understood and can be tested/debugged as easy as a regular expression.

I suggest using jq syntax for that: https://stedolan.github.io/jq/

Here's why:

  1. jq is written in C under a permissive license and we can use it as a library: https://github.com/stedolan/jq/blob/master/COPYING
  2. jq can be used in conjunction with curl to test the jq expressions: https://stedolan.github.io/jq/tutorial/
  3. jq expressions can also be tested online: https://jqplay.org/

@sfionov @sxgunchenko any ideas on what syntax of this modifier could be?
@AdguardTeam/filters-maintainers what do you think about this idea?

edit: relevant: AdguardTeam/Scriptlets#183 (comment)

Maybe using jq is not that sane, we'd better find something that has implementations in both JS and C/C++

@ngorskikh
Copy link
Member

ngorskikh commented May 16, 2022

@ameshkov How about this for a rule spec:

$jq modifier

$jq rules modify the response of a matching request by applying a JQ program to it. They are intended for JSON manipulation.

Syntax

  • ||example.org^$jq=program – replace the response with the result of running a JQ program program with the original response as the input.

Due to the way rule parsing works, the characters $ and , must be escaped with \ inside program.
Refer to the JQ 1.6 manual for the syntax and semantics of JQ programs.

Exceptions

Basic URL exceptions shall not disable $jq rules. They can be disabled as described below:

  • @@||example.org^$jq – disable all $jq rules for responses from URLs matching ||example.org^.
  • @@||example.org^$jq=text – disable all $jq rules with the value of the jq modifier equal to text for responses from URLs matching ||example.org^.
  • $jq rules can also be disabled by $document, $content and $urlblock exception rules.

Restrictions

  • $jq rules are only allowed in trusted filters.
  • $jq rules are not compatible with any other modifiers except $domain, $third-party, $app, $important, $match-case, and $xmlhttprequest.
  • $jq rules do not apply if the size of the original response is more than 3 MB.

Notes

  • When multiple $jq rules match the same request, they are sorted in lexicographical order,
    the first rule is applied to the original response, and each of the remaining rules is applied
    to the result of applying the previous one.
  • The result of applying a $jq rule to a response is all of the outputs of the JQ program
    specified by the rule, separated by a line feed ('\n') character.
  • If a JQ program fails to compile, or an error occurs while during execution, the modified
    response contains the error message.

Examples

  • ||example.org^$jq=del(.foo) — remove the key "foo" from all responses from example.org.
  • ||example.org^$jq={"answer": [2 as \$two|\$two * 20\,2]|add} — replace all responses from example.org with {"answer":42}. Note that , and $ need to be escaped.

@ngorskikh
Copy link
Member

@ameshkov For the JS side of things, there are at least two possibilities:

@sfionov
Copy link
Member

sfionov commented May 16, 2022

I think that we should provide a full list of supported functions, if it is possible. For example, we definitely will not support halt and halt_error :)

@ngorskikh
Copy link
Member

@sfionov I didn't want to commit to a specific list until we have the JavaScript implementation. Also, we might remove something during further testing :)

By the way, halt and halt_error are not a problem: they halt the JQ program, not the whole AdGuard :)

@ameshkov
Copy link
Member Author

@ngorskikh

Here's what bothers me about jq, there is no viable javascript implementation, the existing one is heavy as shit.

We need to find a query language that is:

  1. Implemented in both C++ and Javascript
  2. The implementation is REALLY lightweight (up to a couple KB of code).

Note that on the extension side most likely we'll need to run this code in the content script (i.e. hook JSON.parse) and the code size & performance is really important there.

Maybe instead of that we may agree on a limited subset of syntax and implement it ourselves in JS?

Here's what we need:

  1. Selecting an element by it's path (similar to xpath, jq, jsonpath).
  2. Filtering elements by the attribute value (=,<,>,contains).

@sfionov
Copy link
Member

sfionov commented May 16, 2022

btw, jq-web wasm code is 860KB (286KB compressed).
image

Not great, not terrible.

But wasm may have problems with review process.

@ngorskikh
Copy link
Member

This native JS implementation is 33K minified:

➜  Downloads curl -s 'https://raw.githubusercontent.com/mwh/jqjs/master/jq.js' | uglifyjs > jq.min.js
➜  Downloads ls -lha jq.min.js 
-rw-r--r--  1 ngorskikh  staff    33K May 16 16:23 jq.min.js

@ameshkov
Copy link
Member Author

This is still a lot. We may use it as a "base" implementation, but to strip it we need to figure out what subset is okay.

@ngorskikh
Copy link
Member

JMESPath is 18K minified: https://github.com/jmespath/jmespath.js/blob/master/artifacts/jmespath.min.js

Probably still too fat :)

@ngorskikh
Copy link
Member

Btw, it looks like WebAssembly should be allowed in manifest v3 extensions: https://bugs.chromium.org/p/chromium/issues/detail?id=1173354#c60
there are reports that it works in Chrome Canary

@ngorskikh
Copy link
Member

@ameshkov I've been thinking: if we really want an extremely small implementation, then isn't it gonna be simpler to just extend the existing json-prune scriptlet by addressing the issues described in AdguardTeam/Scriptlets#183?

I presume, we'll still want to implement the same functionality as a native CoreLibs rule?
If not, then let the Scriptlet guys figure it out :)

If yes, then how about extending the json-prune syntax so that each component of the path can contain a conditional, something like this:

recs_group.[] some-conditional(args).tiles.[].[] has-key("ad_origin")

or even a conditional that itself accepts a path expression (this one will be a pain to parse):

recs_group.[] some-conditional(args).tiles.[].[] has(type value-equals("ad"))

This should solve the two issues from AdguardTeam/Scriptlets#183 and shouldn't be too hard to implement in both the scriptlet and CoreLibs.

The problem I see with this solution is that we'll most probably want more and more features down the line, which will be increasingly more difficult to implement. While with jq we can already do pretty much anything we want with a json document, without any additional work on our side :)

@ameshkov
Copy link
Member Author

ameshkov commented May 18, 2022

This is also a solution, but in this case we'll need to provide a web-based debug tool. In theory, this is doable.

The cons:

  • json-prune allows specifying multiple selectors which is IMO redundant, it's much easier to add multiple rules instead.
  • the syntax does not look like any other existing JSON selection syntax.

What if we use a subset of JSONPath instead, strip it of unnecessary functions and extend it if needed with necessary functions (has-key, value-equals, etc)?

Here's the original JSONPath implementation in Javascript (taken from here: https://code.google.com/archive/p/jsonpath/)

/* JSONPath 0.8.5 - XPath for JSON
 *
 * Copyright (c) 2007 Stefan Goessner (goessner.net)
 * Licensed under the MIT (MIT-LICENSE.txt) licence.
 *
 * Proposal of Chris Zyp goes into version 0.9.x
 * Issue 7 resolved
 */
function jsonPath(obj, expr, arg) {
   var P = {
      resultType: arg && arg.resultType || "VALUE",
      result: [],
      normalize: function(expr) {
         var subx = [];
         return expr.replace(/[\['](\??\(.*?\))[\]']|\['(.*?)'\]/g, function($0,$1,$2){return "[#"+(subx.push($1||$2)-1)+"]";})  /* http://code.google.com/p/jsonpath/issues/detail?id=4 */
                    .replace(/'?\.'?|\['?/g, ";")
                    .replace(/;;;|;;/g, ";..;")
                    .replace(/;$|'?\]|'$/g, "")
                    .replace(/#([0-9]+)/g, function($0,$1){return subx[$1];});
      },
      asPath: function(path) {
         var x = path.split(";"), p = "$";
         for (var i=1,n=x.length; i<n; i++)
            p += /^[0-9*]+$/.test(x[i]) ? ("["+x[i]+"]") : ("['"+x[i]+"']");
         return p;
      },
      store: function(p, v) {
         if (p) P.result[P.result.length] = P.resultType == "PATH" ? P.asPath(p) : v;
         return !!p;
      },
      trace: function(expr, val, path) {
         if (expr !== "") {
            var x = expr.split(";"), loc = x.shift();
            x = x.join(";");
            if (val && val.hasOwnProperty(loc))
               P.trace(x, val[loc], path + ";" + loc);
            else if (loc === "*")
               P.walk(loc, x, val, path, function(m,l,x,v,p) { P.trace(m+";"+x,v,p); });
            else if (loc === "..") {
               P.trace(x, val, path);
               P.walk(loc, x, val, path, function(m,l,x,v,p) { typeof v[m] === "object" && P.trace("..;"+x,v[m],p+";"+m); });
            }
            else if (/^\(.*?\)$/.test(loc)) // [(expr)]
               P.trace(P.eval(loc, val, path.substr(path.lastIndexOf(";")+1))+";"+x, val, path);
            else if (/^\?\(.*?\)$/.test(loc)) // [?(expr)]
               P.walk(loc, x, val, path, function(m,l,x,v,p) { if (P.eval(l.replace(/^\?\((.*?)\)$/,"$1"), v instanceof Array ? v[m] : v, m)) P.trace(m+";"+x,v,p); }); // issue 5 resolved
            else if (/^(-?[0-9]*):(-?[0-9]*):?([0-9]*)$/.test(loc)) // [start:end:step]  phyton slice syntax
               P.slice(loc, x, val, path);
            else if (/,/.test(loc)) { // [name1,name2,...]
               for (var s=loc.split(/'?,'?/),i=0,n=s.length; i<n; i++)
                  P.trace(s[i]+";"+x, val, path);
            }
         }
         else
            P.store(path, val);
      },
      walk: function(loc, expr, val, path, f) {
         if (val instanceof Array) {
            for (var i=0,n=val.length; i<n; i++)
               if (i in val)
                  f(i,loc,expr,val,path);
         }
         else if (typeof val === "object") {
            for (var m in val)
               if (val.hasOwnProperty(m))
                  f(m,loc,expr,val,path);
         }
      },
      slice: function(loc, expr, val, path) {
         if (val instanceof Array) {
            var len=val.length, start=0, end=len, step=1;
            loc.replace(/^(-?[0-9]*):(-?[0-9]*):?(-?[0-9]*)$/g, function($0,$1,$2,$3){start=parseInt($1||start);end=parseInt($2||end);step=parseInt($3||step);});
            start = (start < 0) ? Math.max(0,start+len) : Math.min(len,start);
            end   = (end < 0)   ? Math.max(0,end+len)   : Math.min(len,end);
            for (var i=start; i<end; i+=step)
               P.trace(i+";"+expr, val, path);
         }
      },
      eval: function(x, _v, _vname) {
         try { return $ && _v && eval(x.replace(/(^|[^\\])@/g, "$1_v").replace(/\\@/g, "@")); }  // issue 7 : resolved ..
         catch(e) { throw new SyntaxError("jsonPath: " + e.message + ": " + x.replace(/(^|[^\\])@/g, "$1_v").replace(/\\@/g, "@")); }  // issue 7 : resolved ..
      }
   };

   var $ = obj;
   if (expr && obj && (P.resultType == "VALUE" || P.resultType == "PATH")) {
      P.trace(P.normalize(expr).replace(/^\$;?/,""), obj, "$");  // issue 6 resolved
      return P.result.length ? P.result : false;
   }
} 

Here's an example of using it:
https://jsfiddle.net/28h9v6ko/2/

@ngorskikh
Copy link
Member

ngorskikh commented Jun 20, 2022

@ameshkov How about this for a rule spec:

$jsonprune modifier

$jsonprune rules modify the JSON response of a matching request by removing JSON items that match a modified (see below)
JSONPath expression. They do not modify responses which are not valid JSON.

Syntax

  • ||example.org^$jsonprune=expression – remove items that match the modified JSONPath expesssion expression from the response.

Due to the way rule parsing works, the characters $ and , must be escaped with \ inside expression.

The modified JSONPath syntax has the following differences from the original:

  1. Script expressions are not supported.
  2. The supported filter expressions are:
    2.1. ?(has <key>) -- true if the current object has the specified key.
    2.2. ?(key-eq <key> <value>) -- true if the current object has the specified key,
    and its value is equal to the specified value.
    2.3. ?(key-substr <key> <value>) -- true if the specified value is a substring
    of the value of the specified key of the current object.
  3. Whitespace outside of double- or single-quoted strings has no meaning.
  4. Both double- and single-quoted strings can be used.
  5. Expressions ending with .. are not supported.
  6. Multiple array slices can be specified in square brackets.

There are various online tools for testing JSONPath expressions, here's a couple examples:
https://jsonpath.herokuapp.com/
https://jsonpath.com/

Keep in mind, though, that all JSONPath implementations on this planet have unique features/quirks and are subtly incompatible with each other.

Exceptions

Basic URL exceptions shall not disable $jsonprune rules. They can be disabled as described below:

  • @@||example.org^$jsonprune – disable all $jsonprune rules for responses from URLs matching ||example.org^.
  • @@||example.org^$jsonprune=text – disable all $jsonprune rules with the value of the jsonprune modifier equal to text for responses from URLs matching ||example.org^.
  • $jsonprune rules can also be disabled by $document, $content and $urlblock exception rules.

Restrictions

  • $jsonprune rules are not compatible with any other modifiers except $domain, $third-party, $app, $important, $match-case, and $xmlhttprequest.
  • $jsonprune rules do not apply if the size of the original response is more than 3 MB.

Notes

  • When multiple $jsonprune rules match the same request, they are sorted in lexicographical order,
    the first rule is applied to the original response, and each of the remaining rules is applied
    to the result of applying the previous one.

Examples

  • ||example.org^$jsonprune=\$..[one\, "two three"] — remove all occurences of the keys "one" and "two three" anywhere in the JSON document.
  • ||example.org^$jsonprune=\$.a[?(has ad_origin)] – remove all children of a that have an ad_origin key.
  • ||example.org^$jsonprune=\$.*.*[?(key-eq 'Some key' 'Some value')] – remove all items that are at nesting level 3 and have a property "Some key" equal to "Some value".

@ameshkov
Copy link
Member Author

ameshkov commented Jun 20, 2022

Looks good. One more thing to add to Notes is a link to a web app to debug JSONPath expressions. Just mention that it does not support our custom functions.

@ameshkov
Copy link
Member Author

$jsonprune rules are only allowed in trusted filters.

I actually don't think we need this limitation for $jsonprune.

@ngorskikh
Copy link
Member

$jsonprune rules are only allowed in trusted filters.

I actually don't think we need this limitation for $jsonprune.

Well, we do have it for $replace rules. It would be easy to e.g. direct a browser to a malicious or tracking resource by substituting some URL in a JSON response. I think we shouldn't allow untrusted filters to manipulate any responses.

@ameshkov
Copy link
Member Author

$replace is much more powerful though, $jsonprune only allows removing elements, not replacing it.

@ngorskikh
Copy link
Member

Hypothetically, a malicius filter could remove some kind of hash/signature/checksum and then the client would not check it, and could potentially load and execute a malicius resource. Seems far fetched, ofc, but one can not be too careful.
Then there's a possibility for just pure vandalism, breaking sites in subtle ways by removing certain JSON data. Although, something like this could also be done with basic URL filters.
Alright, I'll remove the requirement, but I have voiced my concerns for the record :)

@ameshkov
Copy link
Member Author

It's just not consistent with the scriptlets, jsonprune scriptlet is allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants