Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Enhance ParseXML function with intuitive parse format and add MarshalXML function to convert parsed output back to XML #35075

Closed
wants to merge 3 commits into from

Conversation

shazlehu
Copy link
Contributor

@shazlehu shazlehu commented Sep 9, 2024

Description:

Current implementation of ParseXML

The current implementation of the ParseXML produces output that is difficult for users to manipulate, particularly when the XML is complicated. The current design loses the sense of key:value pairs which are helpful for understanding the data, and, more importantly, always parses the XML into arrays, which are difficult to manipulate in OTTL, and don't provide a stable path that can be used to access the data.

This change proposes an additional parsing version for the ParseXML function that would parse XML into a nested map, where the keys are the paths to the data, and the values are the data itself, or in the case of nodes with both values and attributes, using the special keys xml_attributes and xml_value. It still maintains the order of nodes using the special key xml_ordering. Additionally, the proposed implementation can optionally "flatten" arrays, such as the EventData node in the following XML, into a single map. The arrays must have a single, common attribute, and no other children (in this case, EventData has only Data children). This would allow users to access the data more intuitively and would allow for easier manipulation of the data in OTTL.

This proposed change also includes a new function, MarshalXML, to reconstruct the XML from the parsed map. This would allow users to manipulate the data in OTTL, and then convert it back to XML, typically for ingestion into another system.

The proposed implementation would be backward compatible with the current implementation, allowing the user to specify the new implementation as an optional version parameter to the ParseXML function.

Consider the following XML from a Windows event log, a common telemetry source:

<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
    <System>
        <Provider Name='Microsoft-Windows-Security-Auditing'
            Guid='{54849625-5478-4994-a5ba-3e3b0328c30d}' />
        <EventID>4625</EventID>
        <Version>0</Version>
        <Level>0</Level>
        <Task>12544</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8010000000000000</Keywords>
        <TimeCreated SystemTime='2024-09-04T08:38:09.7477579Z' />
        <EventRecordID>1361885</EventRecordID>
        <Correlation ActivityID='{b67ee0c2-a671-0001-5f6b-82e8c1eeda01}' />
        <Execution ProcessID='656' ThreadID='2276' />
        <Channel>Security</Channel>
        <Computer>samuel-vahala</Computer>
        <Security />
    </System>
    <EventData>
        <Data Name='SubjectUserSid'>S-1-0-0</Data>
        <Data Name='SubjectUserName'>-</Data>
        <Data Name='SubjectDomainName'>-</Data>
        <Data Name='SubjectLogonId'>0x0</Data>
        <Data Name='TargetUserSid'>S-1-0-0</Data>
        <Data Name='TargetUserName'>mr_rogers</Data>
        <Data Name='TargetDomainName'>-</Data>
        <Data Name='Status'>0xc000006d</Data>
        <Data Name='FailureReason'>%%2313</Data>
        <Data Name='SubStatus'>0xc0000064</Data>
        <Data Name='LogonType'>3</Data>
        <Data Name='LogonProcessName'>NtLmSsp </Data>
        <Data Name='AuthenticationPackageName'>NTLM</Data>
        <Data Name='WorkstationName'>D-508</Data>
        <Data Name='TransmittedServices'>-</Data>
        <Data Name='LmPackageName'>-</Data>
        <Data Name='KeyLength'>0</Data>
        <Data Name='ProcessId'>0x0</Data>
        <Data Name='ProcessName'>-</Data>
        <Data Name='IpAddress'>194.169.175.45</Data>
        <Data Name='IpPort'>0</Data>
    </EventData>
</Event>

The current ParseXML function renders this output, in a flattened view, using dot-notation:

Path Value
children.0.children.0["attributes"]["Guid"] {54849625-5478-4994-a5ba-3e3b0328c30d}
children.0.children.0["attributes"]["Name"] Microsoft-Windows-Security-Auditing
children.0.children.0["tag"] Provider
children.0.children.1["content"] 4625
children.0.children.1["tag"] EventID
children.0.children.10["attributes"]["ProcessID"] 656
children.0.children.10["attributes"]["ThreadID"] 4652
children.0.children.10["tag"] Execution
children.0.children.11["content"] Security
children.0.children.11["tag"] Channel
children.0.children.12["content"] samuel-windows
children.0.children.12["tag"] Computer
children.0.children.13["tag"] Security
children.0.children.2["content"] 0
children.0.children.2["tag"] Version
children.0.children.3["content"] 0
children.0.children.3["tag"] Level
children.0.children.4["content"] 12544
children.0.children.4["tag"] Task
children.0.children.5["content"] 0
children.0.children.5["tag"] Opcode
children.0.children.6["content"] 0x8010000000000000
children.0.children.6["tag"] Keywords
children.0.children.7["attributes"]["SystemTime"] 2024-08-15T20:03:15.9454501Z
children.0.children.7["tag"] TimeCreated
children.0.children.8["content"] 57220
children.0.children.8["tag"] EventRecordID
children.0.children.9["attributes"]["ActivityID"] {b67ee0c2-a671-0001-5f6b-82e8c1eeda01}
children.0.children.9["tag"] Correlation
children.0.tag System
children.1.children.0["attributes"]["Name"] SubjectUserSid
children.1.children.0["content"] S-1-0-0
children.1.children.0["tag"] Data
children.1.children.1["attributes"]["Name"] SubjectUserName
children.1.children.1["content"] -
children.1.children.1["tag"] Data
children.1.children.10["attributes"]["Name"] LogonType
children.1.children.10["content"] 3
children.1.children.10["tag"] Data
children.1.children.11["attributes"]["Name"] LogonProcessName
children.1.children.11["content"] NtLmSsp
children.1.children.11["tag"] Data
children.1.children.12["attributes"]["Name"] AuthenticationPackageName
children.1.children.12["content"] NTLM
children.1.children.12["tag"] Data
children.1.children.13["attributes"]["Name"] WorkstationName
children.1.children.13["content"] -
children.1.children.13["tag"] Data
children.1.children.14["attributes"]["Name"] TransmittedServices
children.1.children.14["content"] -
children.1.children.14["tag"] Data
children.1.children.15["attributes"]["Name"] LmPackageName
children.1.children.15["content"] -
children.1.children.15["tag"] Data
children.1.children.16["attributes"]["Name"] KeyLength
children.1.children.16["content"] 0
children.1.children.16["tag"] Data
children.1.children.17["attributes"]["Name"] ProcessId
children.1.children.17["content"] 0x0
children.1.children.17["tag"] Data
children.1.children.18["attributes"]["Name"] ProcessName
children.1.children.18["content"] -
children.1.children.18["tag"] Data
children.1.children.19["attributes"]["Name"] IpAddress
children.1.children.19["content"] 103.151.140.135
children.1.children.19["tag"] Data
children.1.children.2["attributes"]["Name"] SubjectDomainName
children.1.children.2["content"] -
children.1.children.2["tag"] Data
children.1.children.20["attributes"]["Name"] IpPort
children.1.children.20["content"] 0
children.1.children.20["tag"] Data
children.1.children.3["attributes"]["Name"] SubjectLogonId
children.1.children.3["content"] 0x0
children.1.children.3["tag"] Data
children.1.children.4["attributes"]["Name"] TargetUserSid
children.1.children.4["content"] S-1-0-0
children.1.children.4["tag"] Data
children.1.children.5["attributes"]["Name"] TargetUserName
children.1.children.5["content"] Administrator
children.1.children.5["tag"] Data
children.1.children.6["attributes"]["Name"] TargetDomainName
children.1.children.6["content"] domain
children.1.children.6["tag"] Data
children.1.children.7["attributes"]["Name"] Status
children.1.children.7["content"] 0xc000006d
children.1.children.7["tag"] Data
children.1.children.8["attributes"]["Name"] FailureReason
children.1.children.8["content"] %%2313
children.1.children.8["tag"] Data
children.1.children.9["attributes"]["Name"] SubStatus
children.1.children.9["content"] 0xc000006a
children.1.children.9["tag"] Data
children.1.tag EventData
tag Event

This view is extremely opaque, and difficult to manipulate in OTTL. The current design also doesn't provide a stable path that can be used to access the data. For example, the path to the EventID is children.0.children.1["content"], which is not intuitive.

Proposed implementation:

The proposed implementation would parse the XML into nested maps. In the case where a node has multiple children with the same name, the children would be stored in an array. If a node has no attributes or children, its value would be stored as a string. Otherwise, it would be stored in the special key xml_value. The xml_attributes key would store the attributes of the node, and the special key xml_ordering would store the order of the children of the node, which is important for marshaling the nodes in the same order back into XML. For example, the schema definition for <EventData> in a Windows Event log specifies that the <Data> nodes are a sequence (<xs:sequence>), so the order of the nodes should be preserved:

<xs:complexType name="EventDataType">
    <xs:sequence>
        <xs:choice
            minOccurs="0"
            maxOccurs="unbounded"
        >
            <xs:element name="Data"
                type="DataType"
             />
            <xs:element name="ComplexData"
                type="ComplexDataType"
             />
        </xs:choice>
        <xs:element name="Binary"
            type="hexBinary"
            minOccurs="0"
         />
    </xs:sequence>
    <xs:attribute name="Name"
        type="string"
        use="optional"
     />
</xs:complexType>

We prefix the special keys with xml_ to avoid conflicts with the actual data. It is a limitation that if the original XML uses tags xml_value, xml_attributes, xml_ordering, or xml_flattened_array, it will not parse correctly. A future enhancement could be to add an optional parameter to the function to allow users to specify a different prefix.

Path Value
Event.EventData.Data.0["xml_attributes"]["Name"] SubjectUserSid
Event.EventData.Data.0["xml_value"] S-1-0-0
Event.EventData.Data.1["xml_attributes"]["Name"] SubjectUserName
Event.EventData.Data.10["xml_attributes"]["Name"] LogonType
Event.EventData.Data.10["xml_value"] 3
Event.EventData.Data.11["xml_attributes"]["Name"] LogonProcessName
Event.EventData.Data.11["xml_value"] NtLmSsp
Event.EventData.Data.12["xml_attributes"]["Name"] AuthenticationPackageName
Event.EventData.Data.12["xml_value"] NTLM
Event.EventData.Data.13["xml_attributes"]["Name"] WorkstationName
Event.EventData.Data.14["xml_attributes"]["Name"] TransmittedServices
Event.EventData.Data.15["xml_attributes"]["Name"] LmPackageName
Event.EventData.Data.16["xml_attributes"]["Name"] KeyLength
Event.EventData.Data.16["xml_value"] 0
Event.EventData.Data.17["xml_attributes"]["Name"] ProcessId
Event.EventData.Data.17["xml_value"] 0x0
Event.EventData.Data.18["xml_attributes"]["Name"] ProcessName
Event.EventData.Data.19["xml_attributes"]["Name"] IpAddress
Event.EventData.Data.19["xml_value"] 103.151.140.135
Event.EventData.Data.2["xml_attributes"]["Name"] SubjectDomainName
Event.EventData.Data.20["xml_attributes"]["Name"] IpPort
Event.EventData.Data.20["xml_value"] 0
Event.EventData.Data.3["xml_attributes"]["Name"] SubjectLogonId
Event.EventData.Data.3["xml_value"] 0x0
Event.EventData.Data.4["xml_attributes"]["Name"] TargetUserSid
Event.EventData.Data.4["xml_value"] S-1-0-0
Event.EventData.Data.5["xml_attributes"]["Name"] TargetUserName
Event.EventData.Data.5["xml_value"] Administrator
Event.EventData.Data.6["xml_attributes"]["Name"] TargetDomainName
Event.EventData.Data.6["xml_value"] domain
Event.EventData.Data.7["xml_attributes"]["Name"] Status
Event.EventData.Data.7["xml_value"] 0xc000006d
Event.EventData.Data.8["xml_attributes"]["Name"] FailureReason
Event.EventData.Data.8["xml_value"] %%2313
Event.EventData.Data.9["xml_attributes"]["Name"] SubStatus
Event.EventData.Data.9["xml_value"] 0xc000006a
Event.EventData.xml_ordering Data.0,Data.1,Data.2,Data.3,Data.4,Data.5,Data.6,Data.7,Data.8,Data.9,Data.10,Data.11,Data.12,Data.13,Data.14,Data.15,Data.16,Data.17,Data.18,Data.19,Data.20
Event.System.Channel Security
Event.System.Computer samuel-windows
Event.System.Correlation.xml_attributes.ActivityID {b67ee0c2-a671-0001-5f6b-82e8c1eeda01}
Event.System.EventID 4625
Event.System.EventRecordID 57220
Event.System.Execution.xml_attributes.ProcessID 656
Event.System.Execution.xml_attributes.ThreadID 4652
Event.System.Keywords 0x8010000000000000
Event.System.Level 0
Event.System.Opcode 0
Event.System.Provider.xml_attributes.Guid {54849625-5478-4994-a5ba-3e3b0328c30d}
Event.System.Provider.xml_attributes.Name Microsoft-Windows-Security-Auditing
Event.System.Task 12544
Event.System.TimeCreated.xml_attributes.SystemTime 2024-08-15T20:03:15.9454501Z
Event.System.Version 0
Event.System.xml_ordering Provider,EventID,Version,Level,Task,Opcode,Keywords,TimeCreated,EventRecordID,Correlation,Execution,Channel,Computer,Security
Event.xml_attributes.xmlns http://schemas.microsoft.com/win/2004/08/events/event
Event.xml_ordering System,EventData

Additionally, the proposed implementation can optionally "flatten" arrays, such as the EventData node in the XML, into a single map. The arrays must have a single, common attribute, and no other children (in this case, EventData has only Data children). This would allow users to access the data in a more intuitive way, and would allow for easier manipulation of the data in OTTL.

Flattened Parse

Path Value
Event.EventData.Data.AuthenticationPackageName NTLM
Event.EventData.Data.FailureReason %%2313
Event.EventData.Data.IpAddress 103.151.140.135
Event.EventData.Data.IpPort 0
Event.EventData.Data.KeyLength 0
Event.EventData.Data.LogonProcessName NtLmSsp
Event.EventData.Data.LogonType 3
Event.EventData.Data.ProcessId 0x0
Event.EventData.Data.Status 0xc000006d
Event.EventData.Data.SubStatus 0xc000006a
Event.EventData.Data.SubjectLogonId 0x0
Event.EventData.Data.SubjectUserSid S-1-0-0
Event.EventData.Data.TargetDomainName domain
Event.EventData.Data.TargetUserName Administrator
Event.EventData.Data.TargetUserSid S-1-0-0
Event.EventData.Data.xml_flattened_array Name
Event.EventData.Data.xml_ordering SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,TargetUserSid,TargetUserName,TargetDomainName,Status,FailureReason,SubStatus,LogonType,LogonProcessName,AuthenticationPackageName,WorkstationName,TransmittedServices,LmPackageName,KeyLength,ProcessId,ProcessName,IpAddress,IpPort
Event.System.Channel Security
Event.System.Computer samuel-windows
Event.System.Correlation.xml_attributes.ActivityID {b67ee0c2-a671-0001-5f6b-82e8c1eeda01}
Event.System.EventID 4625
Event.System.EventRecordID 57220
Event.System.Execution.xml_attributes.ProcessID 656
Event.System.Execution.xml_attributes.ThreadID 4652
Event.System.Keywords 0x8010000000000000
Event.System.Level 0
Event.System.Opcode 0
Event.System.Provider.xml_attributes.Guid {54849625-5478-4994-a5ba-3e3b0328c30d}
Event.System.Provider.xml_attributes.Name Microsoft-Windows-Security-Auditing
Event.System.Task 12544
Event.System.TimeCreated.xml_attributes.SystemTime 2024-08-15T20:03:15.9454501Z
Event.System.Version 0
Event.System.xml_ordering Provider,EventID,Version,Level,Task,Opcode,Keywords,TimeCreated,EventRecordID,Correlation,Execution,Channel,Computer,Security
Event.xml_attributes.xmlns http://schemas.microsoft.com/win/2004/08/events/event
Event.xml_ordering System,EventData

MarshalXML

The MarshalXML function is able to reliably reconstruct the XML, with the limitation imposed by the GO XML library, which doesn't support self closing tags (In this example, tags like <Security />.) The Unit tests for the MarshalXML function use the ParseXML function to parse the XML, and then compare the output of the MarshalXML function to the original XML. The test XML purposely omits self-closing tags.

Link to tracking Issue: #35076

Testing:

Added Unit tests for the new version, previous tests still pass for version 1.

Documentation:

Added documentation for the new parse version to pkg/ottl/ottlfuncs/README.md

Copy link

linux-foundation-easycla bot commented Sep 9, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@mwear
Copy link
Member

mwear commented Sep 12, 2024

Copy link
Contributor

@kuiperda kuiperda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple nits

#### Parameters
| Name | Type | Description |
|------|------|-------------|
| `target` | Getter | A string that should be in XML format.If `target` is not a string, nil, or cannot be parsed as XML, `ParseXML` will return an error. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `target` | Getter | A string that should be in XML format.If `target` is not a string, nil, or cannot be parsed as XML, `ParseXML` will return an error. |
| `target` | Getter | A string that should be in XML format. If `target` is not a string, nil, or cannot be parsed as XML, `ParseXML` will return an error. |

#### Version 2 Parsing

Unmarshalling XML is done using the following rules:
1. All character data for an XML element is trimmed, joined, and placed into the `xml_value` field, or as the value of the map if the element has attributes or children.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'placed into the XML field' and 'as the value of the map' should be swapped here

"00001": "Sam",
"00002": "Bob",
"00003": "Alice",
"xml_ordering": "00001,00003,00003",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"xml_ordering": "00001,00003,00003",
"xml_ordering": "00001,00002,00003",

@@ -304,6 +306,698 @@ func Test_ParseXML(t *testing.T) {
require.True(t, ok)

require.Equal(t, tt.want, resultMap.AsRaw())
// json, err := jsoniter.MarshalIndent(resultMap.AsRaw(), "", " ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Comment on lines +697 to +714
{
xml: `<Log>This record has a collision<User id="0001"/><User id="0002"/></Log>`,
want: map[string]any{
"Log": map[string]any{
"xml_value": "This record has a collision",
"User": []any{map[string]any{
"xml_attributes": map[string]any{
"id": "0001",
},
}, map[string]any{
"xml_attributes": map[string]any{
"id": "0002",
},
}},
"xml_ordering": "User.0,User.1",
},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a duplicate of the test below but with no name

},
},
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add name here

// require.Equal(t, tt.want, rawMap)
t.Logf("Result: %s", formatMap(rawMap))

// re-marshal the result to compare the output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove but good to test this for sure.

Also re-add the require.Equal above

return true
}

func (a *xmlElement) MarshalXML(e *xml.Encoder, start xml.StartElement) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove but good for testing

@shazlehu shazlehu closed this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants