Implementation of ATLAS_Z0J_8TEV PT-Y and PT-M in the new format #2169

achiefa · 2024-10-10T14:44:54Z

Report showing $(x,Q^2)$ map and theory-data comparisons for legacy implementation: legacy report (for reference)
The same report, but for the new implementation: new report

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/kinematics_legacy_PT-Y.yaml

Radonirinaunimi · 2024-10-11T07:41:19Z

Btw @achiefa, how are you generating (re-)generating the YAML files? Do you have something like the filter.py? If so, could you also upload that one. We want to make sure that we'll be able to re-generate these data in the future.

achiefa · 2024-10-11T07:43:24Z

@Radonirinaunimi I'm not regenerating anything at the moment. I simply changed the name of the old kinematics to legacy. I still need to write the filter.py.

scarlehoff · 2024-10-11T07:46:20Z

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/metadata.yaml

@@ -102,7 +107,7 @@ implemented_observables:
        description: Variable k3
        label: k3
        units: ''
-    file: kinematics_PT-M.yaml
+    file: kinematics_legacy_PT-M.yaml


Just for reference, we don't need to keep the legacy kinematics, you can just remove these files.
The information should be the same as in the new ones, and the new ones have all the info we need (name of the variables mainly)

Thank you. However, I'd say that the information is not the same. The "old" kinematics used the mid value of the bin (squared), while the new one that I'm going to implement will contain min and max value of the bin (non-squared). If you say that it doesn't make any difference, and we can get rid of the old kinematics, I will remove the latter and implement the new kinematics as mentioned above.

Yes, you can remove the old one. You can fill min, max and the midpoint will be filled automatically (and should be the same as before).

The reason only the midpoint was used before is that validphys could not deal with that extra information, but having the min-max is much better because then we can use these datasets to autogenerate runcards (for instance the NNLOJET runcards need this information)

Ok, thank you so much. This is indeed what I was trying to understand. I go on.

achiefa · 2024-10-11T12:14:01Z

Apparently, the tables in hepdata do not provide a full breakdown of the systematic uncertainties. Thus, I don't know how to reproduce the breakdown of systematic uncertainties reported in the legacy file. Should I only use the diagonal entries in the rawdata tables?

Radonirinaunimi · 2024-10-11T12:29:33Z

Apparently, the tables in hepdata do not provide a full breakdown of the systematic uncertainties. Thus, I don't know how to reproduce the breakdown of systematic uncertainties reported in the legacy file. Should I only use the diagonal entries in the rawdata tables?

I fear this will be a recurring theme the more datasets are re-implemented, so decisions have to taken with @enocera on dataset-by-dataset basis (?).

I would point out however that, usually (as is the case for this particular dataset), HepData includes the raw data in which the break-down of the systematics are given if one download the full thing with the resource files. (So to say that even the distinction between HepData and non-HepData is ambiguous).

achiefa · 2024-10-11T12:37:34Z

Yes, I see the break-down in the resource files. But it's not really clear to me how to use this information.

achiefa · 2024-10-11T13:09:22Z

Ok, I think I sort of understood how to use the resource files to retrieve the old break-down of the uncertainties. So, what do you want me to do?

enocera · 2024-10-14T14:29:04Z

@achiefa My suggestion is to implement the full breakdown of uncertainties, whenever this can be retrieved from HepData. This is the case, even if this amounts to look into the "Resources".

achiefa · 2024-10-16T17:48:36Z

I've implemented the dataset with the full break-down of the systematic uncertainties. However, there are a few things that are worth mentioning:

The new implementation of PT-Y matches the legacy one. You can check that using the filter file, where I've implemented an automatic comparison new_vs_legacy for central data and uncertainties.
For PT-M, the number of points provided in HepData is higher than the legacy implementation. Thus, I could not reproduce the legacy version from HepData.
Within the new implementation of PT-M, I ran into a binning problem for table 39 in HepData. Specifically, there is a mismatch between the mid values of the pT bins in table 39 and those also reported in ZcombPt_born_m66116_y0024/tab.dat, which is the file containing the break-down of the sys uncertainties for the same table. It might be that the break-down was worked out with a different bin size. I attach the log of the filter script, which gives more information (log.log. As you can see, the differences are really small, and only for 3 of the bins.

One last thing - Is there a xQ2 map for PT-M?

Radonirinaunimi

Thanks for this @achiefa! I will need to look in detail into the mismatch in PT-M, however, here are some quick comments.

Also, whenever possible (in case the datapoints and everything correspond to the old implementation, as is the case for PT-Y), could you make sure that the covmats are the same?

from validphys.api import API
import numpy as np

inp1 = {"dataset_input": {"dataset": f"{new_implementation}"}, "theoryid": 40_000_000, "use_cuts": "internal"}
inp2 = {"dataset_input": {"dataset": f"{old_implementation}", "variant": "legacy"}, "theoryid": 40_000_000, "use_cuts": "internal"}
covmat1 = API.covmat_from_systematics(**inp1)
covmat2 = API.covmat_from_systematics(**inp2)

np.isclose(covmat1, covmat2)

This not only makes sure that the implementations are consistent but most importantly that the datasets could be loaded properly.

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/filter.py

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/filter_utils.py

achiefa · 2024-10-17T11:30:48Z

For PT-Y the two covmats are the same.

Radonirinaunimi · 2024-10-18T10:45:18Z

For PT-M, the number of points provided in HepData is higher than the legacy implementation. Thus, I could not reproduce the legacy version from HepData.

If the number of datapoints are different, then the re-implementation in the new format becomes a new dataset. Please refer around this comment for more context.

Within the new implementation of PT-M, I ran into a binning problem for table 39 in HepData. Specifically, there is a mismatch between the mid values of the pT bins in table 39 and those also reported in ZcombPt_born_m66116_y0024/tab.dat, which is the file containing the break-down of the sys uncertainties for the same table. It might be that the break-down was worked out with a different bin size. I attach the log of the filter script, which gives more information (log.log. As you can see, the differences are really small, and only for 3 of the bins.

For this, let's simply use the rawdata, even for the binned/centra values.

scarlehoff · 2024-10-18T10:54:21Z

For PT-Y the two covmats are the same.

Nice! Could you also check that the t0 covmat is the same, to be completely sure?

For PT-M, the number of points provided in HepData is higher than the legacy implementation. Thus, I could not reproduce the legacy version from HepData.
If the number of datapoints are different, then the re-implementation in the new format becomes a new dataset. Please refer around this #2170 (comment) for more context.

This is tricky indeed :___ Which points have changed? If they are points that we don't want to include maybe it makes sense to remove them directly in the filter.py file (for instance, if they are points for pT < 30 GeV we can safely discard them and then later on if we need it add a atlas_zoj_lowpt dataset)

achiefa · 2024-10-18T11:30:02Z

Nice! Could you also check that the t0 covmat is the same, to be completely sure?

I tried to use t0_covmat_from_systematics from the API, but I got this AttributeError

AttributeError: 'NoneType' object has no attribute 'is_polarized'

scarlehoff · 2024-10-18T11:31:36Z

The t0 needs also a pdf=NNPDF40_nnlo_as_01180 or t0set= or something like that

achiefa · 2024-10-18T12:03:45Z

Ok, the t0 matrices are the same as well.

This is tricky indeed :___ Which points have changed? If they are points that we don't want to include maybe it makes sense to remove them directly in the filter.py file (for instance, if they are points for pT < 30 GeV we can safely discard them and then later on if we need it add a atlas_zoj_lowpt dataset)

That I didn't check. But I have a feeling that the new implementation adds a range in the mass of the lepton pair, but using the same bins in pT. I'll check that.

achiefa · 2024-11-01T19:11:17Z

I have investigated the discrepancy in the PT-M distribution between the legacy and the new versions. Just to summarise, the new implementation has six different kinematics regions for the lepton pair mass (in GeV), each of them with bins in rapidity:

 12.0  <  m_ll  <  20.0 
 20.0  <  m_ll  <  30.0
 30.0  <  m_ll  <  46.0
 46.0  <  m_ll  <  66.0
 66.0  <  m_ll  <  116.0
 116.0 <  m_ll  <  150.0

The legacy version doesn't have the second-to-last kin region in $M_{ll}$. On the other hand, the bins in rapidity and the respective central data agree between the two versions up to the fourth kin region ($46.0 < M_{ll} < 66.0$). The last kin region also matches in rapidity and central data. However, they have used a different binning for this latter kin region, as in the legacy version $\textrm{mid}[M_{ll}] = 138$ GeV, whereas in the new implementation $\textrm{mid}[M_{ll}] = 133$ GeV. Note that for the legacy version, I only have the mid value of the bins, but not the extremes.

I don't know if this tiny difference in the last kin region is a problem for the grids. If not, we could drop the second-to-last bin and use the same grids as in the legacy implementation.

scarlehoff · 2024-11-01T19:35:29Z

@enocera do you know why are we avoiding that bin?

RE the binning, @achiefa if the central data is the same, it might be a mistake on the kinematic value here. Could you check whether the grid agrees with the values here or with the values in hepdata?

enocera · 2024-11-01T20:18:49Z

Dear @achiefa (cc @scarlehoff), I am a little confused about the description of this data set. There must indeed be 6 bins in the invariant mass of the final state, as @achiefa correctly mentions above. All of these are then single-differential in pT (and not in rapidity, as @achiefa seems to suggest above). The bin corresponding to the Z-mass peak (66.0 < m_ll < 116.0) also comes double differential in pT and in y. So what we did in the legacy implementation is the following: we took all the invariant mass bins, except the one on the Z-mass peak, that are single-differential in pT and we put them into the ATLAS_Z0J_8TEV_PT-M data set. We then took the double differential (pT-y) Z-mass peak bin and put it in the ATLAS_Z0J_8TEV_PT-Y data set. This is because the more differential we are the better for PDF determination. If we had put the Z-mass peak single-differential bin in ATLAS_Z0J_8TEV_PT-M we would have had double counting with ATLAS_Z0J_8TEV_PT-Y. I hope that this clarifies the situation.

achiefa · 2024-11-01T20:39:36Z

Hi @enocera, I apologise. I meant to say pT and not rapidity, as you pointed out.

Note that the invariant mass bins are in the kinematic region $0 < y_{ll} < 2.4$, including the one at the Z-mass peak. So I'm confused, becuase $0 < y_{ll} < 2.4$ covers all the bins in rapidity for the PT-Y distribution. How did you include the invariant mass bin at the Z-mass peak in the PT-Y distribution then?

enocera · 2024-11-01T20:55:08Z

@achiefa as you can see from Hepdata
https://www.hepdata.net/record/ins1408516
Tables 29-34 refer to the cross section on the Z-mass peak (66 GeV<m_ll<116 GeV) which is double differential in pT and y (each table corresponds to a different y bin). These tables form the ATLAS_Z0J_8TEV_PT-Y data set. Tables 35-40 refer to the cross section on the various mass bins, including the Z-mass peak bin (Table 39) which is single-differential in pT (each table corresponds to a different invariant mass bin). These tables, except table 39, form the ATLAS_Z0J_8TEV_PT-M data set. We could have put table 39 in this data set and forgotten about the other data set (no double counting), but, as I said, the more differential we are the better. Is this clearer?

achiefa · 2024-11-01T22:15:36Z

Ok, now I understand. Thank you so much.

achiefa · 2024-11-04T17:52:39Z

Could you check whether the grid agrees with the values here or with the values in hepdata?

The FK table for the last range in the invariant mass is ATLASZPT8TEVMDIST-ATLASZPT8TEV-MLLBIN6_ptZ. Is there a way to check that using the pineappl CLI? @scarlehoff

scarlehoff · 2024-11-05T16:25:06Z

Yes, please, rebase on top of master so that you can use the nice bot that @Radonirinaunimi added to regenerate the data.

And ensure that you are using the yaml prettifier so that things like 6.000000000001 are rounded

achiefa · 2024-11-05T16:40:52Z

Yes, please, rebase on top of master so that you can use the nice bot that @Radonirinaunimi added to regenerate the data.

I did, but I think there's a problem with the EICC rawdata (look at the bot report)

And ensure that you are using the yaml prettifier so that things like 6.000000000001 are rounded

How can I do that?

scarlehoff · 2024-11-05T17:08:16Z

If the EIC data is breaking the tests (and were not broken before) it must be something that has been changed in this branch.

Also the normal tests are also broken.

RE the prettifier, look e.g. at the filter.py of the jet data, basically adding a function for the yaml parser.

achiefa · 2024-11-05T18:21:41Z

I fixed the test for the commondata, but test_overfit_chi2 keeps failing. I don't know what is causing that, though.

scarlehoff

Try to run the test locally to find out what broke the test.

Btw, culd you also remove the jupyter notebook for the filter?

achiefa · 2024-11-08T10:05:34Z

It seems to be test_overfit_chi2 and in particular:

assert False
E        +  where False = <built-in method all of numpy.ndarray object at 0x16c07cdb0>()
E        +    where <built-in method all of numpy.ndarray object at 0x16c07cdb0> = array([0.00040202, 0.00034751, 0.00020966]) < 0.0001.all
E        +      where array([0.00040202, 0.00034751, 0.00020966]) = abs((array([1.86702616, 1.97033108, 1.84250167]) - [1.867428183555603, 1.9699835777282715, 1.842711329460144]))

The order of magnitude for the tolerance is fine though.

achiefa · 2024-11-08T10:31:02Z

Ok, I figured out the problem. It is due to the round of the digits in the HepData table:

I think that is because the central data points in hepdata are rounded to 3 decimal places, while the legacy version considered digits up to the fifth (or something similar).

Indeed, I tried to use the legacy variant for the central data, and the tests succeeded. The legacy data points can be gathered from the source files. What do you think I should do? @scarlehoff

scarlehoff · 2024-11-08T11:01:52Z

For the chi2 that's good.

For the tests, just regenerate the regression test with the new data if that's the only difference.

Btw, could you remove from the rawdata all tables that are not being used by the filter (if any)

achiefa · 2024-11-11T09:25:39Z

Hi @scarlehoff, the PR is ready to be reviewed now.

Note that all tests passed in the previous commit, but now they fail because of a Gateway error:

Error: : Failed to fetch remote theories index https://nnpdf.web.cern.ch/nnpdf/tables/theorydata.json: 504 Server Error: Gateway Time-out for url: https://nnpdf.web.cern.ch/nnpdf/tables/theorydata.json
Error: : Failed to fetch remote theories index https://nnpdf.web.cern.ch/nnpdf/tables_box/theorydata.json: 504 Server Error: Gateway Time-out for url: https://nnpdf.web.cern.ch/nnpdf/tables_box/theorydata.json
Error: : Resource not in the remote repository: Theory 41100010 not available.
Error: : Failed processing key theoryid.
Error: : Bad configuration encountered:
Could not find theory 41100010. Folder '/usr/share/miniconda/envs/test/share/NNPDF/theories/theory_41100010' not found
Instead of '41100010', did you mean one of the following?
 - 704
 - 162
 - 398
 - 399
 - 40000000
 - 708
 - 712

I suspect that is not up to me because, as I said, all tests passed last week, and the last commit just corrected a nan value in one of the uncertainties variants.

scarlehoff · 2024-11-11T09:27:51Z

When there is only one test failing, it might be due to an internet problem. Just resubmit the test (I just did).

achiefa · 2024-11-11T10:49:24Z

Ok, all tests passed!

scarlehoff

Thanks for this. Just a final comment, please remove all the files from hepdata that are not explicitly used in the fitler, otherwise the size of the repository will grow a lot without need.

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/rawdata/hepdata/F2AV_README

validphys2/src/validphys/process_options.py

achiefa · 2024-11-12T15:09:55Z

Ok, green light from me.

scarlehoff

Great! Thanks

scarlehoff

Sorry, one last minor comment before merging. Could you remove the kinematics_legacy? since those are not needed anymore

Thanks!

achiefa · 2024-11-13T10:58:45Z

Yes, I can. Do you want me also to remove the data legacy?

scarlehoff · 2024-11-13T11:04:00Z

Yes, please, since it is not being loaded anymore.

RoyStegeman · 2024-11-13T11:07:51Z

Should these commits be squashed into a single one? In particular I'm looking at f9bcdc9 and it would be nice not to have in the history in master.

achiefa · 2024-11-13T12:48:33Z

Good to go!

First commit Kin and central data Dataset implemented Correct uncertainty definitions + pre-commit Add mid value of the bins Change name kinematics Minor adjustments Remove Table 39 from PT-M distribution Remove sqrts from kinematic_coverage Include m_ll in process options Add DY_PTRAP process Correct nnpdf31 process Add docstring, correct process_option for PT-Y, use m_ll2 Restoring EICC files Add yaml prettifier Remove Remove jupyter nb Add units to PT-M dist Add tolerance to test against legacy Remove unused tables Regenerate fits for tests in validphys Remove from config card for test Add variant with Monte Carlo uncertainties Correct naming error Remove last unused files Correct nan value for mc uncertainties Add eta to process Add module docstring Remove unused files Remove kinematics legacy Remove legacy data Remove tests against legacy data

scarlehoff · 2024-11-13T14:33:39Z

I've recommited the squash to remove the merge commits, let's wait for the tests then merge. Thanks for this!

achiefa requested a review from scarlehoff October 10, 2024 14:44

achiefa self-assigned this Oct 10, 2024

achiefa requested a review from enocera October 10, 2024 14:45

achiefa marked this pull request as draft October 10, 2024 14:45

achiefa commented Oct 10, 2024

View reviewed changes

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/kinematics_legacy_PT-Y.yaml Outdated Show resolved Hide resolved

scarlehoff reviewed Oct 11, 2024

View reviewed changes

scarlehoff added the data toolchain label Oct 16, 2024

Radonirinaunimi requested changes Oct 16, 2024

View reviewed changes

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/filter.py Outdated Show resolved Hide resolved

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/filter_utils.py Outdated Show resolved Hide resolved

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/filter_utils.py Outdated Show resolved Hide resolved

RoyStegeman force-pushed the new_ATLAS_Z0J_8TEV branch from 6afef03 to afd34b6 Compare November 4, 2024 11:49

achiefa force-pushed the new_ATLAS_Z0J_8TEV branch from 600e0d9 to e082411 Compare November 5, 2024 16:19

scarlehoff reviewed Nov 5, 2024

View reviewed changes

achiefa marked this pull request as ready for review November 8, 2024 18:08

achiefa force-pushed the new_ATLAS_Z0J_8TEV branch from 69690b8 to 3c0f9fb Compare November 8, 2024 18:10

scarlehoff reviewed Nov 12, 2024

View reviewed changes

nnpdf_data/nnpdf_data/commondata/ATLAS_Z0J_8TEV/rawdata/hepdata/F2AV_README Outdated Show resolved Hide resolved

validphys2/src/validphys/process_options.py Outdated Show resolved Hide resolved

scarlehoff added the Done PRs that are done but waiting on something else to merge/approve label Nov 12, 2024

scarlehoff approved these changes Nov 12, 2024

View reviewed changes

scarlehoff force-pushed the new_ATLAS_Z0J_8TEV branch from 85a7fa4 to e0e4183 Compare November 12, 2024 17:29

scarlehoff reviewed Nov 12, 2024

View reviewed changes

achiefa force-pushed the new_ATLAS_Z0J_8TEV branch from e2732cc to 9fd163b Compare November 13, 2024 12:48

scarlehoff force-pushed the new_ATLAS_Z0J_8TEV branch from 9fd163b to 7d79286 Compare November 13, 2024 14:31

scarlehoff merged commit 6297f4e into master Nov 13, 2024
7 checks passed

scarlehoff deleted the new_ATLAS_Z0J_8TEV branch November 13, 2024 15:33

Implementation of ATLAS_Z0J_8TEV PT-Y and PT-M in the new format #2169

Implementation of ATLAS_Z0J_8TEV PT-Y and PT-M in the new format #2169

Conversation

achiefa commented Oct 10, 2024 • edited Loading

Radonirinaunimi commented Oct 11, 2024

achiefa commented Oct 11, 2024

scarlehoff Oct 11, 2024

Choose a reason for hiding this comment

achiefa Oct 11, 2024

Choose a reason for hiding this comment

scarlehoff Oct 11, 2024

Choose a reason for hiding this comment

achiefa Oct 11, 2024

Choose a reason for hiding this comment

achiefa commented Oct 11, 2024

Radonirinaunimi commented Oct 11, 2024

achiefa commented Oct 11, 2024

achiefa commented Oct 11, 2024

enocera commented Oct 14, 2024

achiefa commented Oct 16, 2024

Radonirinaunimi left a comment

Choose a reason for hiding this comment

achiefa commented Oct 17, 2024

Radonirinaunimi commented Oct 18, 2024

scarlehoff commented Oct 18, 2024

achiefa commented Oct 18, 2024

scarlehoff commented Oct 18, 2024

achiefa commented Oct 18, 2024

achiefa commented Nov 1, 2024

scarlehoff commented Nov 1, 2024

enocera commented Nov 1, 2024

achiefa commented Nov 1, 2024

enocera commented Nov 1, 2024 • edited Loading

achiefa commented Nov 1, 2024

achiefa commented Nov 4, 2024 • edited Loading

scarlehoff commented Nov 5, 2024

achiefa commented Nov 5, 2024

scarlehoff commented Nov 5, 2024 • edited Loading

achiefa commented Nov 5, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

achiefa commented Nov 8, 2024

achiefa commented Nov 8, 2024

scarlehoff commented Nov 8, 2024

achiefa commented Nov 11, 2024

scarlehoff commented Nov 11, 2024

achiefa commented Nov 11, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

achiefa commented Nov 12, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

scarlehoff left a comment • edited Loading

Choose a reason for hiding this comment

achiefa commented Nov 13, 2024

scarlehoff commented Nov 13, 2024

RoyStegeman commented Nov 13, 2024

achiefa commented Nov 13, 2024

scarlehoff commented Nov 13, 2024

achiefa commented Oct 10, 2024 •

edited

Loading

enocera commented Nov 1, 2024 •

edited

Loading

achiefa commented Nov 4, 2024 •

edited

Loading

scarlehoff commented Nov 5, 2024 •

edited

Loading

scarlehoff left a comment •

edited

Loading