Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing example for Xml insert #1474

Closed
wants to merge 7 commits into from

Conversation

jmortlock
Copy link
Contributor

Example of trying to insert an XmlEntry into doctrine.

See #1472

@norberttech
Copy link
Member

thank you @jmortlock I'm gonna take a look a bit later today!

Copy link
Contributor

github-actions bot commented Feb 12, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+------------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev           |
+-----------------------+-------------------+------+-----+-----------------+------------------+------------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.800mb +0.01%  | 572.668ms +3.75% | ±0.15% -84.55%   |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.873mb +0.01%  | 1.053s -1.66%    | ±0.37% -55.82%   |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 86.318mb +0.00% | 928.925ms +3.61% | ±1.96% +1061.58% |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.530mb +0.01%  | 35.737ms +0.17%  | ±1.04% -26.32%   |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.504mb +0.01%  | 607.502ms +0.56% | ±0.09% -92.73%   |
+-----------------------+-------------------+------+-----+-----------------+------------------+------------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 127.326mb +0.00% | 75.672ms +6.82% | ±1.75% -14.48% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 63.999mb +0.00%  | 108.481ms +3.45% | ±1.56% +64.71%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 84.346mb +0.00%  | 104.624ms +5.96% | ±2.10% +298.82% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 161.188mb +0.00% | 20.776s +0.84%   | ±0.36% +8.57%   |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.996mb +0.00%  | 32.385ms +2.35%  | ±1.23% +143.53% |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 105.969mb +0.00% | 466.642ms +2.53% | ±1.93% +67.17%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.160mb +0.00%  | 233.062ms +1.59% | ±0.62% -12.42%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.682mb +0.00%  | 52.352ms +2.21%  | ±0.79% -76.72%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 43.804mb +0.00%  | 363.017ms -1.89% | ±0.25% -75.49%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.614mb +0.00%  | 73.422ms -0.61%  | ±0.17% -76.91%  |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 97.013mb +0.00%  | 3.969ms +16.42%  | ±2.79% -2.69%   |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 114.298mb +0.00% | 188.076ms +1.61% | ±0.30% -52.50%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 97.018mb +0.00%  | 19.300ms +2.00%  | ±0.15% -76.24%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 97.887mb +0.00%  | 2.030ms +13.85%  | ±1.87% +274.93% |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 97.887mb +0.00%  | 2.022ms +4.75%   | ±0.08% -96.00%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 96.048mb +0.00%  | 5.932ms +12.44%  | ±2.60% -5.00%   |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 96.577mb +0.00%  | 18.029ms +8.33%  | ±0.38% -50.07%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 96.577mb +0.00%  | 18.778ms +12.58% | ±2.35% +775.53% |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 95.268mb +0.00%  | 2.094μs +9.86%   | ±2.28% -6.45%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 95.268mb +0.00%  | 0.500μs 0.00%    | ±0.00% 0.00%    |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 104.486mb +0.00% | 16.735ms +12.18% | ±3.55% +137.84% |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 134.554mb +0.00% | 83.265ms +11.48% | ±0.79% -49.40%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 97.097mb +0.00%  | 2.382ms +62.95%  | ±3.18% +292.03% |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 100.397mb +0.00% | 69.860ms +5.90%  | ±1.78% +72.27%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 98.150mb +0.00%  | 4.817ms +20.35%  | ±3.42% +76.16%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 95.559mb +0.00%  | 44.750ms +8.02%  | ±0.93% -16.23%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 95.560mb +0.00%  | 44.237ms +5.12%  | ±2.21% +206.57% |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 95.559mb +0.00%  | 43.604ms +4.40%  | ±2.48% +246.81% |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 97.709mb +0.00%  | 8.588ms +4.07%   | ±1.59% +98.13%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 95.459mb +0.00%  | 30.933ms +4.99%  | ±1.06% -42.58%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 95.268mb +0.00%  | 18.466μs +19.75% | ±3.52% -4.37%   |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 95.268mb +0.00%  | 21.249μs +32.80% | ±3.67% +259.22% |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 114.298mb +0.00% | 192.521ms +0.69% | ±1.22% +53.13%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

Copy link

codecov bot commented Feb 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 39.59%. Comparing base (74c2a81) to head (e2c6d7f).
Report is 4 commits behind head on 1.x.

Additional details and impacted files
@@             Coverage Diff             @@
##              1.x    #1474       +/-   ##
===========================================
- Coverage   83.04%   39.59%   -43.45%     
===========================================
  Files         664      665        +1     
  Lines       17858    17867        +9     
===========================================
- Hits        14830     7075     -7755     
- Misses       3028    10792     +7764     
Components Coverage Δ
etl 84.88% <ø> (-0.90%) ⬇️
cli 1.53% <ø> (-85.21%) ⬇️
lib-array-dot 46.99% <ø> (-47.55%) ⬇️
lib-azure-sdk 0.00% <ø> (-62.57%) ⬇️
lib-doctrine-dbal-bulk 0.00% <ø> (-97.37%) ⬇️
lib-filesystem 45.06% <ø> (-31.69%) ⬇️
lib-parquet 0.00% <ø> (-84.34%) ⬇️
lib-parquet-viewer 0.00% <ø> (-82.03%) ⬇️
lib-rdsl 73.27% <ø> (-13.83%) ⬇️
lib-snappy 0.00% <ø> (-90.70%) ⬇️
bridge-filesystem-async-aws 0.00% <ø> (-90.39%) ⬇️
bridge-filesystem-azure 0.00% <ø> (-89.93%) ⬇️
bridge-monolog-http 0.00% <ø> (-96.39%) ⬇️
symfony-http-foundation 0.00% <ø> (-77.11%) ⬇️
adapter-chartjs 0.00% <ø> (-86.46%) ⬇️
adapter-csv 68.33% <ø> (-21.24%) ⬇️
adapter-doctrine 0.61% <ø> (-88.08%) ⬇️
adapter-elasticsearch 0.00% <ø> (-97.20%) ⬇️
adapter-google-sheet 0.00% <ø> (-78.05%) ⬇️
adapter-http 0.00% <ø> (-59.16%) ⬇️
adapter-json 0.00% <ø> (-90.63%) ⬇️
adapter-logger 0.00% <ø> (-53.85%) ⬇️
adapter-meilisearch 0.00% <ø> (-97.76%) ⬇️
adapter-parquet 0.00% <ø> (-80.86%) ⬇️
adapter-text 68.88% <ø> (-15.56%) ⬇️
adapter-xml 0.00% <ø> (-83.16%) ⬇️

@github-actions github-actions bot added size: S and removed size: XS labels Feb 12, 2025
@norberttech
Copy link
Member

hey @jmortlock I played with your tests, and the reason why they are failing is because you are using DOMDocument::loadHTML() which adds HTML tags in order to make it a valid html document.
If you use DOMDocument::loadXML() the html part shouldn't be there anymore

@jmortlock
Copy link
Contributor Author

I've updated the tests to use loadXml there all still failing.

I guess the question that comes to my mind, if we are using JSON for example do we actually want to auto infer DOMDocument / Xml ? I know for my use cases which is embded html in a json document I really just expect a string, if i wanted to purify/sanitize/strip_tags those apis are also expecting string inputs.

@norberttech
Copy link
Member

I guess the question that comes to my mind, if we are using JSON for example do we actually want to auto infer DOMDocument / Xml ? I know for my use cases which is embded html in a json document I really just expect a string, if i wanted to purify/sanitize/strip_tags those apis are also expecting string inputs.

I would keep this behavior as it's a generic one, unrelated to source/destination (it comes from EntryFactory) and in your case ->withSchema() and tell the EntryFactory that given column is just a string

@norberttech
Copy link
Member

@jmortlock, but based on your work, I was able to find inconsistent behavior in casting XMLs to strings. I'm closing this PR and will open another one with fixes (that will include your commits and keep your contribution).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants