Single_crystal with order=1 produce different data when split used #1725

mads-bertelsen · 2024-10-08T10:15:49Z

The Single_crystal component has a computationally heavy step when calculating the cross section for scattering and finding the possible reflections for a particular velocity vector. If the next ray has the same velocity vector, these results can be reused. By forcing the ray to only scatter once and using split, this optimization can be triggered reliably. Though I just noticed that using split in this situation actually change the output significantly. The results below are 5E8 without split and 5E6 with split 100 on our templateLaue example.

Used latest conda release and Single_crystal.comp from main branch.

Peak intensities are affected.

Artifacts appear around top and bottom of 4 PI detector.

Will work on finding the cause of the problem and hopefully get the optimization functional again.

willend · 2024-10-08T10:41:44Z

@mads-bertelsen thanks for noticing and reporting! For the sake of reproducibility, could you include your instrument files and settings used to generate the outputs?

( Also, once you get to dig deeper in the issue, perhaps think of ways to enrich e.g. Unittest_SPLIT_sample? - That instrument currently seems to behave OK in a similar setting, but the explanation of that is probably hidden in the fine print... )

mads-bertelsen · 2024-10-08T11:09:33Z

The difference in total scattered intensity is only 0.4%, so the unittest doesn't see it as it's a PSD_monitor_4PI. Can add an extra monitor or two that look at some subset of the data.

/*******************************************************************************
*
* Instrument: templateLaue
*
* %Identification
* Written by: K. Nielsen
* Date: June 2nd, 2010
* Origin: ILL
* Modified by: EF, 
* %INSTRUMENT_SITE: Templates
*
* A simple Laue diffractometer
*
* %Description
* A single crystal sample is illuminated with a white cold beam.
* Based on a Laue tutorial written by K. Nielsen, Feb 7, 2000.
*
* %Example: templateLaue reflections=YBaCuO.lau Detector: det_I=1.14351e+07
* %Example: templateLaue reflections=leucine.lau Detector: det_I=4.63548e+08
*
* %Parameters
* reflections: [string] Single crystal reflection list
*
* %End
*******************************************************************************/

/* Change name of instrument and input parameters with default values */
DEFINE INSTRUMENT templateLaue(string reflections="leucine.lau")

TRACE

COMPONENT Origin = Progress_bar()
  AT (0,0,0) ABSOLUTE

COMPONENT source = Source_simple(
  radius=0.02, focus_xw=0.01, focus_yh=0.01, 
  lambda0=7, dlambda=5, flux=1e12)
AT (0,0,0) ABSOLUTE

COMPONENT slit = Slit(
  xwidth=0.01, yheight=0.01)
AT (0,0,5) RELATIVE source

SPLIT 100 COMPONENT sample = Single_crystal(
          xwidth=0.01, yheight=0.01, zdepth=0.01,
          delta_d_d=1e-4, mosaic = 5, sigma_inc=0,
          reflections=reflections, order=1)
AT (0,0,0.10) RELATIVE slit
EXTEND %{
  if (!SCATTERED) ABSORB; /* perfect beam stop */
%}

COMPONENT det= PSD_monitor_4PI(radius=1, nx=360,ny=180,filename="psd")
AT (0,0,0) RELATIVE sample

END

mads-bertelsen · 2024-10-08T11:38:58Z

The optimization assumes that the data written to the tau_list structure is untouched between the first ray calculates possible reflections and the next ray needs those same results. This is however not true as the first ray needs to calculate cross sections as it leaves the crystal, and that is the data that's left in the tau_list structure, which is then reused by the next ray (and this data doesn't apply as its for a different wave vector).

One possible fix is to allocate the structure that saves this data twice, one reserved for the first scattering, then a different for subsequent scattering. Then the pointer is switched depending on the event_counter. In that way the optimization actually extends to any order and even when disabling the order parameter (though of course the amount of time saved diminishes with higher orders.) I have tested this solution and it fixes the issue. Of course it doubles memory used for the buffer.

This means that the desired speed up from this optimization was never actually achieved, as each ray from the split still must calculate its cross section when leaving the crystal, so this optimization is at best a factor of 2 in performance, and not a factor of SPLIT count. We could make a mode that disregards secondary extinction to avoid that each ray calls hkl_search twice.

willend · 2024-10-08T11:45:37Z

@mads-bertelsen good point, this makes sense.

I think the better solution would in fact be that the initial tau_list structure became a component-induced USERVAR.

This would of course result in even more massive memory usage in GPU settings, but potential allow the optimisation to function also there. On CPU only a single thread (pr. mpi-process) is running at a time - and keeping it to a USERVAR would underline that in fact such a list is a neutron-dependent quantity.

mads-bertelsen · 2024-10-08T11:54:38Z

As far as I understand it being in DECLARE or a component USERVAR makes no difference on the CPU side? We can certainly go for that solution. On GPU I guess there is already a tau_list buffer per thread, so it couldn't do more than double the memory consumption.

Just a word on performance:
5E8 without split: 8 min
5E6 split 100: 4 min
5E6 split 100, no secondary extinction: 1.5 min

Comparison here between no split and skipping secondary extinction. This may be worth it in some cases if the user is aware of the change in background shape.

willend · 2024-10-08T12:05:15Z

Indeed in a CPU-only setting / memory-wise there is no difference between a DECLARE and USERVARS variable. But using USERVARS we outline the importance of keeping neutron-derived / calculated quantities to a thread-safe memory structure, in line with e.g. GPU usage.

mads-bertelsen · 2024-10-08T12:28:52Z

I am getting an error when making this a uservar, maybe because a define is used to set the size of the array. Starting a branch with the code.

willend · 2024-10-08T12:57:45Z

Not a fully thought-through solution, but I guess something like this could potentially work - also on GPU - with relevant initialisation of numrefs and a malloc() in TRACE

USERVARS %{
  int numrefs;
  struct tau_data *MyTau_list;
  #pragma acc shape(MyTau_List[0:numrefs]) init_needed(numrefs)
%}

(One added complexity is of course that the component number is added to the USERVAR internally in the neutron structure, see e.g. Monochromator_curved)

mads-bertelsen · 2024-10-08T13:38:20Z

Made a pull request where we can collaborate on the fix. The Unittest is now sensitive through such issues, mainly through the transmission channel. #1726

willend added the mcstas-3.5.1 label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single_crystal with order=1 produce different data when split used #1725

Single_crystal with order=1 produce different data when split used #1725

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024 •

edited

Loading

mads-bertelsen commented Oct 8, 2024

Single_crystal with order=1 produce different data when split used #1725

Single_crystal with order=1 produce different data when split used #1725

Comments

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024 • edited Loading

mads-bertelsen commented Oct 8, 2024

willend commented Oct 8, 2024 •

edited

Loading