Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of ATLAS_Z0J_8TEV PT-Y and PT-M in the new format #2169

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

achiefa
Copy link
Contributor

@achiefa achiefa commented Oct 10, 2024

Report showing $(x,Q^2)$ map and theory-data comparisons for legacy implementation: legacy report (for reference)
The same report, but for the new implementation: new report

@achiefa achiefa self-assigned this Oct 10, 2024
@achiefa achiefa requested a review from enocera October 10, 2024 14:45
@achiefa achiefa marked this pull request as draft October 10, 2024 14:45
@Radonirinaunimi
Copy link
Member

Btw @achiefa, how are you generating (re-)generating the YAML files? Do you have something like the filter.py? If so, could you also upload that one. We want to make sure that we'll be able to re-generate these data in the future.

@achiefa
Copy link
Contributor Author

achiefa commented Oct 11, 2024

@Radonirinaunimi I'm not regenerating anything at the moment. I simply changed the name of the old kinematics to legacy. I still need to write the filter.py.

@@ -102,7 +107,7 @@ implemented_observables:
description: Variable k3
label: k3
units: ''
file: kinematics_PT-M.yaml
file: kinematics_legacy_PT-M.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, we don't need to keep the legacy kinematics, you can just remove these files.
The information should be the same as in the new ones, and the new ones have all the info we need (name of the variables mainly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. However, I'd say that the information is not the same. The "old" kinematics used the mid value of the bin (squared), while the new one that I'm going to implement will contain min and max value of the bin (non-squared). If you say that it doesn't make any difference, and we can get rid of the old kinematics, I will remove the latter and implement the new kinematics as mentioned above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can remove the old one. You can fill min, max and the midpoint will be filled automatically (and should be the same as before).

The reason only the midpoint was used before is that validphys could not deal with that extra information, but having the min-max is much better because then we can use these datasets to autogenerate runcards (for instance the NNLOJET runcards need this information)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you so much. This is indeed what I was trying to understand. I go on.

@achiefa
Copy link
Contributor Author

achiefa commented Oct 11, 2024

Apparently, the tables in hepdata do not provide a full breakdown of the systematic uncertainties. Thus, I don't know how to reproduce the breakdown of systematic uncertainties reported in the legacy file. Should I only use the diagonal entries in the rawdata tables?

@Radonirinaunimi
Copy link
Member

Apparently, the tables in hepdata do not provide a full breakdown of the systematic uncertainties. Thus, I don't know how to reproduce the breakdown of systematic uncertainties reported in the legacy file. Should I only use the diagonal entries in the rawdata tables?

I fear this will be a recurring theme the more datasets are re-implemented, so decisions have to taken with @enocera on dataset-by-dataset basis (?).

I would point out however that, usually (as is the case for this particular dataset), HepData includes the raw data in which the break-down of the systematics are given if one download the full thing with the resource files. (So to say that even the distinction between HepData and non-HepData is ambiguous).

@achiefa
Copy link
Contributor Author

achiefa commented Oct 11, 2024

Yes, I see the break-down in the resource files. But it's not really clear to me how to use this information.

@achiefa
Copy link
Contributor Author

achiefa commented Oct 11, 2024

Ok, I think I sort of understood how to use the resource files to retrieve the old break-down of the uncertainties. So, what do you want me to do?

@enocera
Copy link
Contributor

enocera commented Oct 14, 2024

@achiefa My suggestion is to implement the full breakdown of uncertainties, whenever this can be retrieved from HepData. This is the case, even if this amounts to look into the "Resources".

@achiefa
Copy link
Contributor Author

achiefa commented Oct 16, 2024

I've implemented the dataset with the full break-down of the systematic uncertainties. However, there are a few things that are worth mentioning:

  • The new implementation of PT-Y matches the legacy one. You can check that using the filter file, where I've implemented an automatic comparison new_vs_legacy for central data and uncertainties.
  • For PT-M, the number of points provided in HepData is higher than the legacy implementation. Thus, I could not reproduce the legacy version from HepData.
  • Within the new implementation of PT-M, I ran into a binning problem for table 39 in HepData. Specifically, there is a mismatch between the mid values of the pT bins in table 39 and those also reported in ZcombPt_born_m66116_y0024/tab.dat, which is the file containing the break-down of the sys uncertainties for the same table. It might be that the break-down was worked out with a different bin size. I attach the log of the filter script, which gives more information (log.log. As you can see, the differences are really small, and only for 3 of the bins.

One last thing - Is there a xQ2 map for PT-M?

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @achiefa! I will need to look in detail into the mismatch in PT-M, however, here are some quick comments.

Also, whenever possible (in case the datapoints and everything correspond to the old implementation, as is the case for PT-Y), could you make sure that the covmats are the same?

from validphys.api import API
import numpy as np

inp1 = {"dataset_input": {"dataset": f"{new_implementation}"}, "theoryid": 40_000_000, "use_cuts": "internal"}
inp2 = {"dataset_input": {"dataset": f"{old_implementation}", "variant": "legacy"}, "theoryid": 40_000_000, "use_cuts": "internal"}
covmat1 = API.covmat_from_systematics(**inp1)
covmat2 = API.covmat_from_systematics(**inp2)

np.isclose(covmat1, covmat2)

This not only makes sure that the implementations are consistent but most importantly that the datasets could be loaded properly.

@achiefa
Copy link
Contributor Author

achiefa commented Oct 17, 2024

For PT-Y the two covmats are the same.
Screenshot 2024-10-17 at 12 30 28

@Radonirinaunimi
Copy link
Member

For PT-M, the number of points provided in HepData is higher than the legacy implementation. Thus, I could not reproduce the legacy version from HepData.

If the number of datapoints are different, then the re-implementation in the new format becomes a new dataset. Please refer around this comment for more context.

Within the new implementation of PT-M, I ran into a binning problem for table 39 in HepData. Specifically, there is a mismatch between the mid values of the pT bins in table 39 and those also reported in ZcombPt_born_m66116_y0024/tab.dat, which is the file containing the break-down of the sys uncertainties for the same table. It might be that the break-down was worked out with a different bin size. I attach the log of the filter script, which gives more information (log.log. As you can see, the differences are really small, and only for 3 of the bins.

For this, let's simply use the rawdata, even for the binned/centra values.

@scarlehoff
Copy link
Member

For PT-Y the two covmats are the same.

Nice! Could you also check that the t0 covmat is the same, to be completely sure?

For PT-M, the number of points provided in HepData is higher than the legacy implementation. Thus, I could not reproduce the legacy version from HepData.
If the number of datapoints are different, then the re-implementation in the new format becomes a new dataset. Please refer around this #2170 (comment) for more context.

This is tricky indeed :___ Which points have changed? If they are points that we don't want to include maybe it makes sense to remove them directly in the filter.py file (for instance, if they are points for pT < 30 GeV we can safely discard them and then later on if we need it add a atlas_zoj_lowpt dataset)

@achiefa
Copy link
Contributor Author

achiefa commented Oct 18, 2024

Nice! Could you also check that the t0 covmat is the same, to be completely sure?

I tried to use t0_covmat_from_systematics from the API, but I got this AttributeError

AttributeError: 'NoneType' object has no attribute 'is_polarized'

@scarlehoff
Copy link
Member

The t0 needs also a pdf=NNPDF40_nnlo_as_01180 or t0set= or something like that

@achiefa
Copy link
Contributor Author

achiefa commented Oct 18, 2024

Ok, the t0 matrices are the same as well.

This is tricky indeed :___ Which points have changed? If they are points that we don't want to include maybe it makes sense to remove them directly in the filter.py file (for instance, if they are points for pT < 30 GeV we can safely discard them and then later on if we need it add a atlas_zoj_lowpt dataset)

That I didn't check. But I have a feeling that the new implementation adds a range in the mass of the lepton pair, but using the same bins in pT. I'll check that.

@achiefa
Copy link
Contributor Author

achiefa commented Nov 1, 2024

I have investigated the discrepancy in the PT-M distribution between the legacy and the new versions. Just to summarise, the new implementation has six different kinematics regions for the lepton pair mass (in GeV), each of them with bins in rapidity:

 12.0  <  m_ll  <  20.0 
 20.0  <  m_ll  <  30.0
 30.0  <  m_ll  <  46.0
 46.0  <  m_ll  <  66.0
 66.0  <  m_ll  <  116.0
 116.0 <  m_ll  <  150.0

The legacy version doesn't have the second-to-last kin region in $M_{ll}$. On the other hand, the bins in rapidity and the respective central data agree between the two versions up to the fourth kin region ($46.0 &lt; M_{ll} &lt; 66.0$). The last kin region also matches in rapidity and central data. However, they have used a different binning for this latter kin region, as in the legacy version $\textrm{mid}[M_{ll}] = 138$ GeV, whereas in the new implementation $\textrm{mid}[M_{ll}] = 133$ GeV. Note that for the legacy version, I only have the mid value of the bins, but not the extremes.

I don't know if this tiny difference in the last kin region is a problem for the grids. If not, we could drop the second-to-last bin and use the same grids as in the legacy implementation.

@scarlehoff
Copy link
Member

@enocera do you know why are we avoiding that bin?

RE the binning, @achiefa if the central data is the same, it might be a mistake on the kinematic value here. Could you check whether the grid agrees with the values here or with the values in hepdata?

@enocera
Copy link
Contributor

enocera commented Nov 1, 2024

Dear @achiefa (cc @scarlehoff), I am a little confused about the description of this data set. There must indeed be 6 bins in the invariant mass of the final state, as @achiefa correctly mentions above. All of these are then single-differential in pT (and not in rapidity, as @achiefa seems to suggest above). The bin corresponding to the Z-mass peak (66.0 < m_ll < 116.0) also comes double differential in pT and in y. So what we did in the legacy implementation is the following: we took all the invariant mass bins, except the one on the Z-mass peak, that are single-differential in pT and we put them into the ATLAS_Z0J_8TEV_PT-M data set. We then took the double differential (pT-y) Z-mass peak bin and put it in the ATLAS_Z0J_8TEV_PT-Y data set. This is because the more differential we are the better for PDF determination. If we had put the Z-mass peak single-differential bin in ATLAS_Z0J_8TEV_PT-M we would have had double counting with ATLAS_Z0J_8TEV_PT-Y. I hope that this clarifies the situation.

@achiefa
Copy link
Contributor Author

achiefa commented Nov 1, 2024

Hi @enocera, I apologise. I meant to say pT and not rapidity, as you pointed out.

Note that the invariant mass bins are in the kinematic region $0 &lt; y_{ll} &lt; 2.4$, including the one at the Z-mass peak. So I'm confused, becuase $0 &lt; y_{ll} &lt; 2.4$ covers all the bins in rapidity for the PT-Y distribution. How did you include the invariant mass bin at the Z-mass peak in the PT-Y distribution then?

@enocera
Copy link
Contributor

enocera commented Nov 1, 2024

@achiefa as you can see from Hepdata
https://www.hepdata.net/record/ins1408516
Tables 29-34 refer to the cross section on the Z-mass peak (66 GeV<m_ll<116 GeV) which is double differential in pT and y (each table corresponds to a different y bin). These tables form the ATLAS_Z0J_8TEV_PT-Y data set. Tables 35-40 refer to the cross section on the various mass bins, including the Z-mass peak bin (Table 39) which is single-differential in pT (each table corresponds to a different invariant mass bin). These tables, except table 39, form the ATLAS_Z0J_8TEV_PT-M data set. We could have put table 39 in this data set and forgotten about the other data set (no double counting), but, as I said, the more differential we are the better. Is this clearer?

@achiefa
Copy link
Contributor Author

achiefa commented Nov 1, 2024

Ok, now I understand. Thank you so much.

@achiefa
Copy link
Contributor Author

achiefa commented Nov 4, 2024

Could you check whether the grid agrees with the values here or with the values in hepdata?

The FK table for the last range in the invariant mass is ATLASZPT8TEVMDIST-ATLASZPT8TEV-MLLBIN6_ptZ. Is there a way to check that using the pineappl CLI? @scarlehoff

@scarlehoff
Copy link
Member

Yes, please, rebase on top of master so that you can use the nice bot that @Radonirinaunimi added to regenerate the data.

And ensure that you are using the yaml prettifier so that things like 6.000000000001 are rounded

@achiefa
Copy link
Contributor Author

achiefa commented Nov 5, 2024

Yes, please, rebase on top of master so that you can use the nice bot that @Radonirinaunimi added to regenerate the data.

I did, but I think there's a problem with the EICC rawdata (look at the bot report)

And ensure that you are using the yaml prettifier so that things like 6.000000000001 are rounded

How can I do that?

@scarlehoff
Copy link
Member

scarlehoff commented Nov 5, 2024

If the EIC data is breaking the tests (and were not broken before) it must be something that has been changed in this branch.

Also the normal tests are also broken.

RE the prettifier, look e.g. at the filter.py of the jet data, basically adding a function for the yaml parser.

@achiefa
Copy link
Contributor Author

achiefa commented Nov 5, 2024

I fixed the test for the commondata, but test_overfit_chi2 keeps failing. I don't know what is causing that, though.

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to run the test locally to find out what broke the test.

Btw, culd you also remove the jupyter notebook for the filter?

@achiefa
Copy link
Contributor Author

achiefa commented Nov 8, 2024

It seems to be test_overfit_chi2 and in particular:

assert False
E        +  where False = <built-in method all of numpy.ndarray object at 0x16c07cdb0>()
E        +    where <built-in method all of numpy.ndarray object at 0x16c07cdb0> = array([0.00040202, 0.00034751, 0.00020966]) < 0.0001.all
E        +      where array([0.00040202, 0.00034751, 0.00020966]) = abs((array([1.86702616, 1.97033108, 1.84250167]) - [1.867428183555603, 1.9699835777282715, 1.842711329460144]))

The order of magnitude for the tolerance is fine though.

@achiefa
Copy link
Contributor Author

achiefa commented Nov 8, 2024

Ok, I figured out the problem. It is due to the round of the digits in the HepData table:

I think that is because the central data points in hepdata are rounded to 3 decimal places, while the legacy version considered digits up to the fifth (or something similar).

Indeed, I tried to use the legacy variant for the central data, and the tests succeeded. The legacy data points can be gathered from the source files. What do you think I should do? @scarlehoff

@scarlehoff
Copy link
Member

For the chi2 that's good.

For the tests, just regenerate the regression test with the new data if that's the only difference.

Btw, could you remove from the rawdata all tables that are not being used by the filter (if any)

@achiefa achiefa marked this pull request as ready for review November 8, 2024 18:08
@achiefa
Copy link
Contributor Author

achiefa commented Nov 11, 2024

Hi @scarlehoff, the PR is ready to be reviewed now.

Note that all tests passed in the previous commit, but now they fail because of a Gateway error:

Error: : Failed to fetch remote theories index https://nnpdf.web.cern.ch/nnpdf/tables/theorydata.json: 504 Server Error: Gateway Time-out for url: https://nnpdf.web.cern.ch/nnpdf/tables/theorydata.json
Error: : Failed to fetch remote theories index https://nnpdf.web.cern.ch/nnpdf/tables_box/theorydata.json: 504 Server Error: Gateway Time-out for url: https://nnpdf.web.cern.ch/nnpdf/tables_box/theorydata.json
Error: : Resource not in the remote repository: Theory 41100010 not available.
Error: : Failed processing key theoryid.
Error: : Bad configuration encountered:
Could not find theory 41100010. Folder '/usr/share/miniconda/envs/test/share/NNPDF/theories/theory_41100010' not found
Instead of '41100010', did you mean one of the following?
 - 704
 - 162
 - 398
 - 399
 - 40000000
 - 708
 - 712

I suspect that is not up to me because, as I said, all tests passed last week, and the last commit just corrected a nan value in one of the uncertainties variants.

@scarlehoff
Copy link
Member

When there is only one test failing, it might be due to an internet problem. Just resubmit the test (I just did).

@achiefa
Copy link
Contributor Author

achiefa commented Nov 11, 2024

Ok, all tests passed!

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Just a final comment, please remove all the files from hepdata that are not explicitly used in the fitler, otherwise the size of the repository will grow a lot without need.

@scarlehoff scarlehoff added the Done PRs that are done but waiting on something else to merge/approve label Nov 12, 2024
@achiefa
Copy link
Contributor Author

achiefa commented Nov 12, 2024

Ok, green light from me.

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one last minor comment before merging. Could you remove the kinematics_legacy? since those are not needed anymore

Thanks!

@achiefa
Copy link
Contributor Author

achiefa commented Nov 13, 2024

Yes, I can. Do you want me also to remove the data legacy?

@scarlehoff
Copy link
Member

Yes, please, since it is not being loaded anymore.

@RoyStegeman
Copy link
Member

Should these commits be squashed into a single one? In particular I'm looking at f9bcdc9 and it would be nice not to have in the history in master.

@achiefa
Copy link
Contributor Author

achiefa commented Nov 13, 2024

Good to go!

First commit

Kin and central data

Dataset implemented

Correct uncertainty definitions + pre-commit

Add mid value of the bins

Change name kinematics

Minor adjustments

Remove Table 39 from PT-M distribution

Remove sqrts from kinematic_coverage

Include m_ll in process options

Add DY_PTRAP process

Correct nnpdf31 process

Add docstring, correct process_option for PT-Y, use m_ll2

Restoring EICC files

Add yaml prettifier

Remove

Remove jupyter nb

Add units to PT-M dist

Add tolerance to test against legacy

Remove unused tables

Regenerate fits for tests in validphys

Remove  from config card for test

Add variant with Monte Carlo uncertainties

Correct naming error

Remove last unused files

Correct nan value for mc uncertainties

Add eta to process

Add module docstring

Remove unused files

Remove kinematics legacy

Remove legacy data

Remove tests against legacy data
@scarlehoff
Copy link
Member

I've recommited the squash to remove the merge commits, let's wait for the tests then merge. Thanks for this!

@scarlehoff scarlehoff merged commit 6297f4e into master Nov 13, 2024
7 checks passed
@scarlehoff scarlehoff deleted the new_ATLAS_Z0J_8TEV branch November 13, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data toolchain Done PRs that are done but waiting on something else to merge/approve
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants