Skip to content

Improve performance on json to psobject conversion #7482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 10, 2018

Conversation

powercode
Copy link
Collaborator

@powercode powercode commented Aug 8, 2018

PR Summary

Changes to the JsonObject class to make conversions from json to PSObjects faster.

Three main changes:
Convert from JArray directly to object instead of creating a list and do list.ToArray().
Do not use PSObject to check for existing members, since that is a very slow and allocation heavy (sic!) code path.
PreAllocation the members in PSObject by using the newly added constructor accepting an initial member count.

This amounts to a speed-up in the order of 7x.

Cmdlets that benefit are Convertfrom-Json and Invoke-RestMethod.

PR Checklist

Using pattern matching to simplify code.
Removing some redundant initializations.
nameof(input) instead of "input" in exception.
Simplification of assignment
Removing code redundancies
@markekraus
Copy link
Contributor

$json = @'
{
    "id": "string",
    "data": [
        {                       
        "accountId": 0,
        "productId": 0,
        "resourceLocationId": 0,
        "consumedServiceId": 0,
        "departmentId": 0,
        "accountOwnerEmail": "string",
        "accountName": "string",
        "serviceAdministratorId": "string",
        "subscriptionId": 0,
        "subscriptionGuid": "string",
        "subscriptionName": "string",
        "date": "2017-04-27T23:01:43.799Z",
        "product": "string",
        "meterId": "string",
        "meterCategory": "string",
        "meterSubCategory": "string",
        "meterRegion": "string",
        "meterName": "string",
        "consumedQuantity": 0,
        "resourceRate": 0,
        "Cost": 0,
        "resourceLocation": "string",
        "consumedService": "string",
        "instanceId": "string",
        "serviceInfo1": "string",
        "serviceInfo2": "string",
        "additionalInfo": "string",
        "tags": "string",
        "storeServiceIdentifier": "string",
        "departmentName": "string",
        "costCenter": "string",
        "unitOfMeasure": "string",
        "resourceGroup": "string"
        }
    ],
    "nextLink": "string"
}
'@
$json = 1..1000 | % { $json }
$json = '[' + ($json -join ',') + ']'
Measure-Command {$Json | ConvertFrom-Json } | % TotalMilliseconds

Runs in 497.8867 miliseconds on the most recent build in this PR vs 1104.9008 ms in preview.4 and 318.1642ms in 5.1. Much better!! Thanks @powercode

Copy link
Contributor

@markekraus markekraus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@powercode
Copy link
Collaborator Author

powercode commented Aug 8, 2018

Did you test that with a crossgen build?

I know numbers! I have a lot of numbers. A lot of people don't know that, but I have the best numbers!

Same code but with 20000 iterations - I get 27091 ms on 6.0.3 and 3635 ms on the PR.

Curious about the difference.
My crossgen:ed build has 177 ms for 1000 iterations vs 1312 ms on 6.0.3.

Start-PSBuild -Crossgen -Configuration release

{
string errorMsg = string.Format(CultureInfo.InvariantCulture,
WebCmdletStrings.DuplicateKeysInJsonString, entry.Key);
var errorMsg = string.Format(CultureInfo.InvariantCulture, WebCmdletStrings.DuplicateKeysInJsonString, entry.Key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: CurrentCulture should be used when constructing error message. A flaw in the original code.

}
obj = returnHashTable ?
PopulateHashTableFromJDictionary(dictionary, out error) :
PopulateFromJDictionary(dictionary, memberTracker, out error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reaching here, we already know the size of the JObject, so maybe we can replace memberTracker with new DuplicateMemberSet(dictionary.Count)? (just like what you did in other places)
Then DefaultMemberSetCapacity won't be needed.

foreach (var entry in entries)
{
if (string.IsNullOrEmpty(entry.Key))
{
string errorMsg = string.Format(CultureInfo.InvariantCulture,
WebCmdletStrings.EmptyKeyInJsonString);
var errorMsg = string.Format(CultureInfo.InvariantCulture, WebCmdletStrings.EmptyKeyInJsonString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: CurrentCulture should be used when constructing error strings. It's a flaw in the original code.

{
string errorMsg = string.Format(CultureInfo.InvariantCulture,
WebCmdletStrings.KeysWithDifferentCasingInJsonString, property.Name, entry.Key);
var errorMsg = string.Format(CultureInfo.InvariantCulture, WebCmdletStrings.KeysWithDifferentCasingInJsonString, propertyName, entry.Key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: CurrentCulture should be used when constructing error message.

@@ -255,8 +246,7 @@ private static Hashtable PopulateHashTableFromJDictionary(JObject entries, out E
// does not throw when encountering duplicates and just uses the last entry.
if (result.ContainsKey(entry.Key))
{
string errorMsg = string.Format(CultureInfo.InvariantCulture,
WebCmdletStrings.DuplicateKeysInJsonString, entry.Key);
string errorMsg = string.Format(CultureInfo.InvariantCulture, WebCmdletStrings.DuplicateKeysInJsonString, entry.Key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: same here

@daxian-dbw
Copy link
Member

Good perf work @powercode! The changes look good to me in general. Left a few minor comments.

@daxian-dbw daxian-dbw self-assigned this Aug 8, 2018
Copy link
Member

@daxian-dbw daxian-dbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! 🎉

@powercode
Copy link
Collaborator Author

I looked a bit at the difference in perf to Windows PowerShell.

Newtonsoft.Json.JsonConvert.DeserializeObject is slower than System.Web.Script.Serialization.JavaScriptObjectDeserializer. Looking at the code to see if I can figure out what the root cause is.

Copy link
Contributor

@dantraMSFT dantraMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the minor suggestion, LGTM.

}
obj = returnHashTable ?
PopulateHashTableFromJDictionary(dictionary, out error) :
PopulateFromJDictionary(dictionary, new DuplicateMemberSet(dictionary.Count), out error);
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: the way the if/else block is structured, it implies a path that is not covered. I suggest using if/else if/else and use the else block to provide an explicit comment.

if (object isJObject dictionary)
{}
else if (obj is JArray list)
{}
else
{ # nothing to do; return the obj as is.
}

Copy link
Collaborator Author

@powercode powercode Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving the return into the try statement so I get

if (object is JObject dictionary)
{...}
else if (obj is JArray list)
{///}
// if obj is a anything but JArray or JObject, we just return as is
return obj;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's clear anyway that we return obj as is for the other cases.

Or how about

switch (obj)
{
    case JObject dictionary:
        return returnHashTable
                   ? PopulateHashTableFromJDictionary(dictionary, out error)
                   : PopulateFromJDictionary(dictionary, new DuplicateMemberSet(dictionary.Count), out error);
    case JArray list:
        return returnHashTable
                   ? PopulateHashTableFromJArray(list, out error)
                   : PopulateFromJArray(list, out error);
    default: return obj;
}

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes the alternative really clear.

@iSazonov iSazonov changed the title Perf: Improve performance on json to psobject conversion Improve performance on json to psobject conversion Aug 9, 2018
}

/// <summary>
/// Convert a Json string back to an object of type PSObject or Hashtable depending on parameter <paramref name="returnHashTable"/>.
/// Convert a Json string back to an object of type PSObject or HashTable depending on parameter <paramref name="returnHashTable"/>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name is Hashtable not HashTable.

foreach (var entry in entries)
{
if (string.IsNullOrEmpty(entry.Key))
{
string errorMsg = string.Format(CultureInfo.InvariantCulture,
WebCmdletStrings.EmptyKeyInJsonString);
var errorMsg = string.Format(CultureInfo.CurrentCulture, WebCmdletStrings.EmptyKeyInJsonString);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use our help method StringUtil.Format. Below too.

{
error = null;
PSObject result = new PSObject();
var result = new PSObject(entries.Count);
foreach (var entry in entries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to replace the var with explicit type. Please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is clearly stated by the ctor. Why repeat it?

Our coding guidelines has

var example = obj as Example;

as an example. I don't see how this is any different. There can be no confusion as to what type result has.

In general, I think it would be better to consider types as something the compiler uses to ensure we don't use something in an unintended way.

Personally, I find all the type annotations reducing readability, since they get in the way of the essence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, as background, I really recommend Herb Sutters https://herbsutter.com/2013/08/12/gotw-94-solution-aaa-style-almost-always-auto/

Copy link
Collaborator Author

@powercode powercode Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@iSazonov iSazonov Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we both say about foreach (var entry in entries) ? Seems you comment var result = new PSObject(entries.Count);.

The foreach (var entry in entries) is not your code so it is my minor comment with "Please". Feel free to ignore.

Prevously we discussed this - not all code readers use power IDEs with IntelliSense and it is better for readability to use explicit type until this type is not obvious from the name of a variable or local context. Compare: foreach (var entry in intEntries) and foreach (var entry in entries) .
Above the entries variable is defined as JObject entries - at first glance it is not at all clear what type of the entry is.

Copy link
Collaborator Author

@powercode powercode Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the type of entry is of interest here. It is some entity that has a Key and a Value. Should that type change, nothing in the code would need to change, as long as it has a Key and a Value.
That is captured nicely with the use of var.

I think we should only specify the type explicitly when it is important for some reason that the value is of a certain type. That way, it calls attention to it - something interesting is happening here since we explicitly specified the type. It may be that a variable is declared as a base type instead of the actual type returned by a function.

The C# coding conventions on msdn has actually captured this:

Use implicit typing for local variables when the type of the variable is obvious from the right side of the assignment, or when the precise type is not important.

Use implicit typing to determine the type of the loop variable in for and foreach loops.

The use of var and an explicit type is not just a matter of taste. They express different things. The explicit type should be used then the type matters, not as some security blanket.

var should be used when we want to express that the type is not important here, as long as it walks like a duck and quacks like a duck.

It is a semantics change. Specifying a type may for example trigger implicit type conversion operators, or throw exceptions (in the foreach case).

{
JValue theValue = entry.Value as JValue;
// Value
JValue theValue = (JValue)entry.Value;
result.Properties.Add(new PSNoteProperty(entry.Key, theValue.Value));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember, we discussed this earlier and concluded that it is safer to use as in the place.

/cc @markekraus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using casting when we are for sure expecting the cast-to type.
If the value does happen to be a different type, then it's a bug that needs to be fixed, and casting operator will make the bug to surface easily (while as in this case might make the bug harder to be noticed).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bug that needs to be fixed

@daxian-dbw If I remember right in previous discussion we concluded that we don't want an exception here. So I ping @markekraus to confirm or reject this.

Also we could consider the change as out of the PR and revert and open a tracking issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov I don't recall any reason why we need as over casting here.

@@ -142,10 +124,10 @@ private static PSObject PopulateFromJDictionary(JObject entries, out ErrorRecord

// Case sensitive duplicates should normally not occur since JsonConvert.DeserializeObject
// does not throw when encountering duplicates and just uses the last entry.
if (result.Properties.Any(psPropertyInfo => psPropertyInfo.Name.Equals(entry.Key, StringComparison.InvariantCulture)))
if (memberTracker.TryGetValue(entry.Key, out var maybePropertyName)
&& string.Compare(entry.Key, maybePropertyName, StringComparison.CurrentCulture) == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case if we have a duplicate key we return "null". So we should remove the string.Compare at all.

/cc @markekraus What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We throw different errors in different cases. A duplicate key with same casing is a hard "NO", while a duplicate key with different casing is a workaround-able.

{
result.Add(((JValue)element).Value);
// Value
result[index] = ((JValue)element).Value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same about as.

/cc @markekraus

{
// Value
JValue theValue = entry.Value as JValue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same about as. Here we don't use cast - the pattern we approved previously.

/cc @markekraus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a bit inconsistent in this file.

{
result.Add(((JValue)element).Value);
// Value
result[index] = ((JValue)element).Value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same about as.

/cc @markekraus

@@ -24,45 +19,49 @@ namespace Microsoft.PowerShell.Commands
[SuppressMessage("Microsoft.Naming", "CA1704:IdentifiersShouldBeSpelledCorrectly")]
public static class JsonObject
{
private const int maxDepthAllowed = 100;
private class DuplicateMemberSet : HashSet<string>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DuplicateMemberHashSet looks more readable.

private const int maxDepthAllowed = 100;
private class DuplicateMemberSet : HashSet<string>
{
public DuplicateMemberSet(int capacity) : base(capacity, StringComparer.CurrentCultureIgnoreCase)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect StringComparer.InvariantCultureIgnoreCase. Why would we want to do culture sensetive compares (if we get Json string from Internet)?
This is consistent with the comment in the line 125.

/cc @markekraus

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov Regarding HashTable: StyleCop complains that the word Hashtable is spelled incorrectly, hence the change. I'm not religious about it, and can go either way. Also, the parameter is named returnHashTable. Confusing. @dantraMSFT, @markekraus Any native English speaker care to weigh in here? Maybe change all references to Hashtable casing? And maybe add hashtable as a dictionary work.

HashTable looks ugly to me :)

Copy link
Collaborator Author

@powercode powercode Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct that CurrentCultureIgnoreCase is wrong. The previous code used the PSObject.Properties, and that uses OrdinalIgnoreCase, for it's member dictionary.

Not sure if we should use InvariantCultureIgnoreCase or OrdinalIgnoreCase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for OrdinalIgnoreCase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about OrdinalIgnoreCase

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@powercode I think if we reference Hashtable class in comments we should use quotes like 'Hashtable' to bypass StypeCop warnings. As for variable names we should follow Camel style so returnHashTable is ok.

@iSazonov iSazonov requested a review from markekraus August 9, 2018 09:11
Using OrdinalIgnoreCase for DuplicateMemberHashSet.

Renaming DuplicateMemberSet -> DuplicateMemberHashSet
Renaming returnHashTable -> returnHashtable

Changing help to not use the HashTable casing.
@powercode
Copy link
Collaborator Author

powercode commented Aug 10, 2018

Regarding the as controversy: how about using a type switch on the JTokens?

                switch (entry.Value)
                {
                    case JArray list:
                    {
                        // Array
                        var listResult = PopulateHashTableFromJArray(list, out error);
                        if (error != null)
                        {
                            return null;
                        }

                        result.Add(entry.Key, listResult);
                        break;
                    }
                    case JObject dic:
                    {
                        // Dictionary
                        var dicResult = PopulateHashTableFromJDictionary(dic, out error);
                        if (error != null)
                        {
                            return null;
                        }

                        result.Add(entry.Key, dicResult);
                        break;
                    }
                    case JValue value:
                    {
                        result.Add(entry.Key, value.Value);
                        break;
                    }
                }

Doesn't that clearly capture what we want to express?

@iSazonov
Copy link
Collaborator

Good catch! I agree with using a type switch.

@powercode
Copy link
Collaborator Author

powercode commented Aug 10, 2018

@iSazonov I'm getting code factor errors on

switch (x)
{
    case 1:
    {
        break;
    }
    case 2:
    {
        break;
    }
}

It seems to want it like

switch (x)
{
    case 1:
    {
        break;
    }

    case 2:
    {
        break;
    }
}

That feels strange. Do you know if it is intentional or a side effect?

Copy link
Collaborator

@iSazonov iSazonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@markekraus
Copy link
Contributor

@powercode I believe that spacing rule is intentional. It is easier to read code when there is some white space separating the end of one code block and the beginning of another.

@iSazonov
Copy link
Collaborator

@markekraus Thanks for review!

@daxian-dbw daxian-dbw merged commit 1243891 into PowerShell:master Aug 10, 2018
@TravisEz13 TravisEz13 added the CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log label Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants