-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML search: represent indices using JavaScript Map instead of object literals #13097
base: master
Are you sure you want to change the base?
HTML search: represent indices using JavaScript Map instead of object literals #13097
Conversation
[skip ci]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this would be a problem but I'm a little unclear on how the solution actually works? See comments
@@ -514,8 +514,8 @@ const Search = { | |||
searchTerms.forEach((word) => { | |||
const files = []; | |||
const arr = [ | |||
{ files: terms[word], score: Scorer.term }, | |||
{ files: titleTerms[word], score: Scorer.title }, | |||
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a little confused by how this might work, as it doesn't appear that JavaScript supports setting the __proto__
property in an object. Am I missing something? Should we be using a map to store these terms instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing JSON, an object key named "__proto__"
is accepted, and converted to the JavaScript object key ["__proto__"]
-- distinct from the __proto__
property (a reference to a class prototype).
During read-access, introducing hasOwnProperty
allows us to filter out the latter, so that we don't mistakenly retrieve the prototype class reference from terms
.
I'm not certain whether the below example will help to clarify, but I find experimentation helpful to understand some of the behaviours:
> x = JSON.parse('{}')
{}
> y = JSON.parse('{"__proto__": "testing"}')
{ ['__proto__']: 'testing' }
> '__proto__' in x
true
> x.hasOwnProperty('__proto__')
false
> '__proto__' in y
true
> y.hasOwnProperty('__proto__')
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to respond to your Map
question: that is certainly possible, although I feel that doing so might add some runtime overhead (since we'd need to load the native JS object contents from the index into respective Map
objects).
As I understand it, Map
objects are also more difficult to make immutable in JavaScript than Object
and Array
instances. Freezing a Map
object doesn't prevent its private contents from being updated:
> x = {'foo': 'bar'}
{ foo: 'bar' }
> Object.freeze(x)
{ foo: 'bar' }
> x['foo'] = 'baz'
'baz'
> x['foo']
'bar'
> y = new Map([['foo', 'bar']])
Map(1) { 'foo' => 'bar' }
> Object.freeze(y)
Map(1) { 'foo' => 'bar' }
> y.set('foo', 'baz')
Map(1) { 'foo' => 'baz' }
> y.get('foo')
'baz'
So another factor is that, to the extent that I'd like to see the search index contents become immutable at load-time in future (#13098), I'd prefer not to migrate to the Map
type unless there is a compelling reason to (maybe avoiding prototype pollution is that reason -- but I don't think we're exposed to that currently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it, this makes sense to me re: the property. TIL!
I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one extra piece of optional feedback: since this is kind of counter-intuitive, you might want to add a comment on this line explaining what you're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, OK - I'll try to add some phrasing to explain this succinctly. Thanks for the code review!
The JS prototype property should _not_ be found in unrelated documents
@@ -514,8 +514,8 @@ const Search = { | |||
searchTerms.forEach((word) => { | |||
const files = []; | |||
const arr = [ | |||
{ files: terms[word], score: Scorer.term }, | |||
{ files: titleTerms[word], score: Scorer.title }, | |||
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it, this makes sense to me re: the property. TIL!
I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.
Would this work as an alternative? // in Search.setIndex()
const frozen_index = JSON.parse(index, (key, value) => {
typeof value === "object" && value !== null
? Object.freeze(value)
: value;
}); If we change A |
@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep. |
In fact, no, my mistake: |
… index contents Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
I've pushed the change to use |
Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in. |
That's reasonable, yep, and I've no strong opinions on whether we include or omit the |
The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes
Without the reviver function, I recall the |
Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me? |
The test case in 0fff170 was inaccurate -- it demonstrated that an object However, there was still a bug: a document text term |
const reviver = (k, v) => (typeof v === "object" && v !== null) ? Object.freeze(v) : v; | ||
Search._index = JSON.parse(index, reviver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @wlach is right that immutability is orthoganal here. I think 'simply' using JSON.parse
rather than an object literal would solve the __proto__
case. Would we then need to keep the hasOwnProperty
checks?
An alternative would be to use a reviver function that constructs Map
objects/serialise the index with Map
literals, and use .get()
, but that may also be expensive.
const reviver = (k, v) => (typeof v === "object" && v !== null) ? Object.freeze(v) : v; | |
Search._index = JSON.parse(index, reviver); | |
Search._index = JSON.parse(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @wlach is right that immutability is orthoganal here. I think 'simply' using
JSON.parse
rather than an object literal would solve the__proto__
case.
Ok, yep - I can revert 443fd21 to remove the use of Object.freeze
/ the reviver function.
Would we then need to keep the
hasOwnProperty
checks?
We should keep those, I think. Without them, queries for __proto__
may continue to inadvertently retrieve object prototypes instead of getting an undefined
result.
An alternative would be to use a reviver function that constructs
Map
objects/serialise the index withMap
literals, and use.get()
, but that may also be expensive.
What do you think about constructing those Map
objects directly in the searchindex.js
file format, instead of using JSON.parse
and a reviver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about constructing those Map objects directly in the searchindex.js file format, instead of using JSON.parse and a reviver?
IMO this is definitely worth a shot, I suspect it'll be quite fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm giving it a try - it doesn't seem too tricky; the Python serialization of the search index becomes a little more complex. One challenge that I have yet to work out is how the Python unit tests should parse/load the resulting index (ref:
Line 70 in 15b46dc
return json.loads(searchindex[16:-1]) |
indices = ast.literal_eval( | ||
searchindex[16:-1] | ||
.replace('new Map', '') | ||
.replace('null', 'None') | ||
.replace('false', 'False') | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fragile and inelegant; it's a compromise to avoid introducing the complexity of a Python-implemented JavaScript parser dependency given the use of new Map(...)
within searchindex.js
files.
[skip ci]
I haven't gotten around to this comprehensive performance testing yet, but want to share some notes: Initial resultsInitially the native The page-load searchindex-evaluation time on my machine for a query for There is a small amount of index size bloat -- not so much from the Observations so farThere seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize. I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the |
Feature or Bugfix
Purpose
basic
theme.Detail
Ensure that the user input query term is found as a named property of the relevant index objects before accessing them, usingObject.hasOwnProperty
.Map
instances instead ofObject
literals.Relates
May resolve # 13098 (conditional on inclusion ofJSON.parse
reviver /Object.freeze
logic)Edit: update description to reflect JavaScript Map-based approach.