Skip to content

Conversation

@AArnott
Copy link
Collaborator

@AArnott AArnott commented Jul 12, 2021

This isn't on by default, and while this first trivial implementation allocates every string and passes it through string.Intern(string), many optimizations could be made to it either in this library directly or by apps through overridding the virtual method.

Points to decide on:

  • Should the collection be built into the formatter, leaving MessagePackSerializationOptions unchanged?
  • Should our initial, default implementation be smarter about which strings to intern (e.g. strings not longer than x characters)?
  • Should we avoid the initial string allocation by creating our own Dictionary<ReadOnlyMemory<byte>, string> interned string collection? I'm thinking "yes" but we may want to micro-benchmark to make sure it's really a net win.
  • As written, is this a behavior breaking change because the StandardResolver now includes a string resolver, such that if any app adds a string resolver, it used to work if their resolver worked after the StandardResolver but now must test before it?
  • Perf concerns from always using a formatter for string, even when the option to intern hasn't been turned on?
  • Some very well-developed string interning functionality was developed for MSBuild and exposed for everyone via the Microsoft.NET.StringTools package. Can we take a dependency on that? An alternative would be a source copy, but I would loathe to own something like that.
  • Measure and report on perf impact of string interning on deserialization.

Closes #634

@AArnott AArnott added this to the v2.3 milestone Jul 12, 2021
@AArnott AArnott requested a review from neuecc July 12, 2021 16:31
@AArnott AArnott self-assigned this Jul 12, 2021
This isn't on by default, and while this first trivial implementation allocates every string and passes it through `string.Intern(string)`, many optimizations could be made to it either in this library directly or by apps through overridding the virtual method.

Closes MessagePack-CSharp#634
@AArnott AArnott force-pushed the fix634 branch 2 times, most recently from 57a513d to e422d78 Compare July 12, 2021 20:31
Copy link
Member

@neuecc neuecc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing me to Microsoft.NET.StringTools.
I now understand the concept and implementation of WeakInternedString.
Very interesting and inspiring!
Implementing this into the serializer will get a huge positive impact.
(Certainly, I'm curious the additional dictionary lookup/weakreference lookup cost)

In any case, I am very positive that this implementation will be included.

//

It would be preferable if it were not included in the options, but it is going to be difficult to use.
If the default is on, it's fine, but it's not good to turn it off (it's difficult to have a Resolver with a standard string formatter).

Add new reference, if you are going to include it in the standard, I would be concerned about Unity's support for it, but since it is a simple lib, you may as well include it.

This gets us a few things:
1. Intern strings without wastefully allocating them first in the common fast paths.
1. Switch to a string collection that retains only weak references.
@AArnott
Copy link
Collaborator Author

AArnott commented Jul 13, 2021

(Certainly, I'm curious the additional dictionary lookup/weakreference lookup cost)

I'll measure it and report back.

Add new reference, if you are going to include it in the standard, I would be concerned about Unity's support for it, but since it is a simple lib, you may as well include it.

I'm pushing an update that should fix Unity support. We already have a dependency on System.Memory and one other. This adds exactly one more (which depends on the same two that we already depended on).

@AArnott
Copy link
Collaborator Author

AArnott commented Jul 13, 2021

It would be preferable if it were not included in the options, but it is going to be difficult to use.
If the default is on, it's fine, but it's not good to turn it off (it's difficult to have a Resolver with a standard string formatter).

Well, I can remove all changes from the Options class and if our default interning behavior isn't what people want, they can override our string resolver. Is that good?

Also revert the Options changes. If the app needs to customize string interning, they can do so by changing the resolver to their own formatter.
This caused a test failure on `net472`.
@AArnott AArnott marked this pull request as ready for review July 14, 2021 14:05
@neuecc
Copy link
Member

neuecc commented Jul 15, 2021

Changes confirmed.
Yes, in this case, I think the person who wants to make the change has a strong will, so the change by Resolver seems to be sufficient.

@AArnott
Copy link
Collaborator Author

AArnott commented Jul 25, 2021

The perf impact is significant:
Using PerfNetFramework, looking at the 10000 iteration of Person[] deserialize test:

Test Before After
MessagePack for C# 162.78 ms 376.28 ms
MessagePack for C# (LZ4) 173.57 ms 385.75 ms

@AArnott
Copy link
Collaborator Author

AArnott commented Jul 25, 2021

It looks to me like this may be inescapable perf cost if you want interning. So the question is, should we have it off by default? Should activating it be by applying a resolver that will pick this as a string formatter, or should this be a bool property on MessagePackSerializationOptions?

Name                                                                                                                                                        Inc %   IncExc % Exc
messagepack!StringInterningFormatter.Deserialize                                                                                                            11.2   569  2.9 146
+ microsoft.net.stringtools!WeakStringCacheInterner.InternableToString                                                                                        6.1   311  0.8  42
|+ microsoft.net.stringtools!WeakStringCacheInterner.Intern                                                                                                   5.3   268  0.3  14
||+ microsoft.net.stringtools!WeakStringCache.GetOrCreateEntry                                                                                                5.0   254  0.4  18
|| + microsoft.net.stringtools!Microsoft.NET.StringTools.WeakStringCache+StringWeakHandle.GetString(value class Microsoft.NET.StringTools.InternableString&)  1.5    76  0.2   8
|| + microsoft.net.stringtools!InternableString.GetHashCode                                                                                                   1.4    71  1.4  70
|| + system.collections.concurrent.il!System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.__Canon].TryGetValueInternal(!0,int32,!1&)     0.9    46  0.8  41
|| + coreclr!?                                                                                                                                                0.7    37  0.7  33
|| + system.collections.concurrent.il!System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,!1&)                   0.1     3  0.1   3
|| + system.collections.concurrent.il!System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.__Canon].TryAdd(!0,!1)                         0.0     1  0.0   0
|| + system.collections.concurrent.il!System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.__Canon].get_Count()                           0.0     1  0.0   0

@neuecc
Copy link
Member

neuecc commented Jul 26, 2021

Oh, sure, it can be hard to make it the default.
How about adding WithOption to Resolver as well?
WithInternedString, WithImmutableCollections, etc...
(or make this an extension method to MessagePackSerializerOptions)

@AArnott
Copy link
Collaborator Author

AArnott commented Jul 27, 2021

I think the easiest opt-in would be based on a MessagePackSerializerOptions property. That would have the added benefit of any formatter dealing with strings knowing whether string interning is desirable, although I don't know of any other formatter that would care.
So for an option that really is formatter-specific as this arguably is, putting the option on the formatter (while keeping it immutable, so it would be copy-on-write style) may make more sense -- it just makes it both harder to discover and harder to activate since a resolver will also have to change to point at the new formatter.

@AArnott AArnott modified the milestones: v2.3, v2.4 Jul 29, 2021
@cd21h
Copy link
Contributor

cd21h commented Jul 30, 2021

I'd love to have this, but with option to opt-in particular property.
For example, having having Address class with State, City and Street, I'd like to intern State and City only.

Maybe we should define an attribute to opt-in for string interning.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 22, 2021

@Shatl That's an interesting idea. I'll look into it.

It is still available via the `StringInterningFormatter`, which can be turned on by default by aggregating a custom resolver,
or by specifying this on a `[Key]` or `[DataMember]` member:

```cs
[MessagePackFormatter(typeof(StringInterningFormatter))]
```
@AArnott AArnott requested a review from neuecc August 22, 2021 03:05
@AArnott
Copy link
Collaborator Author

AArnott commented Aug 22, 2021

The latest iteration has it off by default, and documents a few ways by which users can turn it on.

Copy link
Member

@neuecc neuecc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By being optional, there is no problem in provide it.
However, I think that Resolver should also be provided as a counterpart to Formatter.
That is, a StringInterningResolver.

before any of the standard ones, like this:

```cs
var options = MessagePackSerializerOptions.Standard.WithResolver(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By being optional, there is no problem in provide it.
However, I think that Resolver should also be provided as a counterpart to Formatter.
That is, a StringInterningResolver.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this resolver only point to the one formatter, such that it would still need to be combined with other resolvers for every case?

I'm going to complete this PR for now since we can always add a resolver in a subsequent PR based on your answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants