Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single_crystal with order=1 produce different data when split used #1725

Open
mads-bertelsen opened this issue Oct 8, 2024 · 9 comments
Open

Comments

@mads-bertelsen
Copy link
Contributor

The Single_crystal component has a computationally heavy step when calculating the cross section for scattering and finding the possible reflections for a particular velocity vector. If the next ray has the same velocity vector, these results can be reused. By forcing the ray to only scatter once and using split, this optimization can be triggered reliably. Though I just noticed that using split in this situation actually change the output significantly. The results below are 5E8 without split and 5E6 with split 100 on our templateLaue example.

Used latest conda release and Single_crystal.comp from main branch.

Peak intensities are affected.
center_line_sum

Artifacts appear around top and bottom of 4 PI detector.
single_crystal_laue_split

Will work on finding the cause of the problem and hopefully get the optimization functional again.

@willend
Copy link
Contributor

willend commented Oct 8, 2024

@mads-bertelsen thanks for noticing and reporting! For the sake of reproducibility, could you include your instrument files and settings used to generate the outputs?

( Also, once you get to dig deeper in the issue, perhaps think of ways to enrich e.g. Unittest_SPLIT_sample? - That instrument currently seems to behave OK in a similar setting, but the explanation of that is probably hidden in the fine print... )

@mads-bertelsen
Copy link
Contributor Author

The difference in total scattered intensity is only 0.4%, so the unittest doesn't see it as it's a PSD_monitor_4PI. Can add an extra monitor or two that look at some subset of the data.

/*******************************************************************************
*
* Instrument: templateLaue
*
* %Identification
* Written by: K. Nielsen
* Date: June 2nd, 2010
* Origin: ILL
* Modified by: EF, 
* %INSTRUMENT_SITE: Templates
*
* A simple Laue diffractometer
*
* %Description
* A single crystal sample is illuminated with a white cold beam.
* Based on a Laue tutorial written by K. Nielsen, Feb 7, 2000.
*
* %Example: templateLaue reflections=YBaCuO.lau Detector: det_I=1.14351e+07
* %Example: templateLaue reflections=leucine.lau Detector: det_I=4.63548e+08
*
* %Parameters
* reflections: [string] Single crystal reflection list
*
* %End
*******************************************************************************/

/* Change name of instrument and input parameters with default values */
DEFINE INSTRUMENT templateLaue(string reflections="leucine.lau")

TRACE

COMPONENT Origin = Progress_bar()
  AT (0,0,0) ABSOLUTE

COMPONENT source = Source_simple(
  radius=0.02, focus_xw=0.01, focus_yh=0.01, 
  lambda0=7, dlambda=5, flux=1e12)
AT (0,0,0) ABSOLUTE

COMPONENT slit = Slit(
  xwidth=0.01, yheight=0.01)
AT (0,0,5) RELATIVE source

SPLIT 100 COMPONENT sample = Single_crystal(
          xwidth=0.01, yheight=0.01, zdepth=0.01,
          delta_d_d=1e-4, mosaic = 5, sigma_inc=0,
          reflections=reflections, order=1)
AT (0,0,0.10) RELATIVE slit
EXTEND %{
  if (!SCATTERED) ABSORB; /* perfect beam stop */
%}

COMPONENT det= PSD_monitor_4PI(radius=1, nx=360,ny=180,filename="psd")
AT (0,0,0) RELATIVE sample

END

@mads-bertelsen
Copy link
Contributor Author

The optimization assumes that the data written to the tau_list structure is untouched between the first ray calculates possible reflections and the next ray needs those same results. This is however not true as the first ray needs to calculate cross sections as it leaves the crystal, and that is the data that's left in the tau_list structure, which is then reused by the next ray (and this data doesn't apply as its for a different wave vector).

One possible fix is to allocate the structure that saves this data twice, one reserved for the first scattering, then a different for subsequent scattering. Then the pointer is switched depending on the event_counter. In that way the optimization actually extends to any order and even when disabling the order parameter (though of course the amount of time saved diminishes with higher orders.) I have tested this solution and it fixes the issue. Of course it doubles memory used for the buffer.

This means that the desired speed up from this optimization was never actually achieved, as each ray from the split still must calculate its cross section when leaving the crystal, so this optimization is at best a factor of 2 in performance, and not a factor of SPLIT count. We could make a mode that disregards secondary extinction to avoid that each ray calls hkl_search twice.

@willend
Copy link
Contributor

willend commented Oct 8, 2024

@mads-bertelsen good point, this makes sense.

I think the better solution would in fact be that the initial tau_list structure became a component-induced USERVAR.

This would of course result in even more massive memory usage in GPU settings, but potential allow the optimisation to function also there. On CPU only a single thread (pr. mpi-process) is running at a time - and keeping it to a USERVAR would underline that in fact such a list is a neutron-dependent quantity.

@mads-bertelsen
Copy link
Contributor Author

As far as I understand it being in DECLARE or a component USERVAR makes no difference on the CPU side? We can certainly go for that solution. On GPU I guess there is already a tau_list buffer per thread, so it couldn't do more than double the memory consumption.

Just a word on performance:
5E8 without split: 8 min
5E6 split 100: 4 min
5E6 split 100, no secondary extinction: 1.5 min

Comparison here between no split and skipping secondary extinction. This may be worth it in some cases if the user is aware of the change in background shape.
Screenshot 2024-10-08 at 13 46 54

@willend
Copy link
Contributor

willend commented Oct 8, 2024

Indeed in a CPU-only setting / memory-wise there is no difference between a DECLARE and USERVARS variable. But using USERVARS we outline the importance of keeping neutron-derived / calculated quantities to a thread-safe memory structure, in line with e.g. GPU usage.

@mads-bertelsen
Copy link
Contributor Author

I am getting an error when making this a uservar, maybe because a define is used to set the size of the array. Starting a branch with the code.

@willend
Copy link
Contributor

willend commented Oct 8, 2024

Not a fully thought-through solution, but I guess something like this could potentially work - also on GPU - with relevant initialisation of numrefs and a malloc() in TRACE

USERVARS %{
  int numrefs;
  struct tau_data *MyTau_list;
  #pragma acc shape(MyTau_List[0:numrefs]) init_needed(numrefs)
%}

(One added complexity is of course that the component number is added to the USERVAR internally in the neutron structure, see e.g. Monochromator_curved)

@mads-bertelsen
Copy link
Contributor Author

Made a pull request where we can collaborate on the fix. The Unittest is now sensitive through such issues, mainly through the transmission channel. #1726

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants