Skip to content

Understand performance difference in cudacpp between SA and madevent #461

@valassi

Description

@valassi

This is a followup to PR #454 where I am doing some first tests (with logs) of madevent+cudacpp MEs, comparing physics and performance to madevent and to cudacpp.

I see that the cudacpp ME calculation is faster in standalone (bridge!) mode than in madevent. IN principle I thought the standalone bridge mode was exactly equivalent?

See for instance
https://github.com/madgraph5/madgraph4gpu/blob/d1fd8b201803b366579f925017f09a85001ce9f8/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt

DATE: 2022-05-20_15:00:32

Working directory: /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg

*** EXECUTE MADEVENT (create results.dat) ***
 [XSECTION] Cross section = 80.03
 [COUNTERS] PROGRAM TOTAL          :    4.2070s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2606s
 [COUNTERS] Fortran MEs      ( 1 ) :    3.9463s for       96 events => throughput is 2.43E+01 events/s

*** EXECUTE CMADEVENT_CUDACPP (create results.dat) ***
 [XSECTION] Cross section = 80.03
 [MERATIOS] ME ratio CudaCpp/Fortran: MIN = 0.99997121 = 1 - 2.9e-05
 [MERATIOS] ME ratio CudaCpp/Fortran: MAX = 1.00014921 = 1 + 0.00015
 [MERATIOS] ME ratio CudaCpp/Fortran: NENTRIES =     96
 [MERATIOS] ME ratio CudaCpp/Fortran - 1: AVG = 1.3e-05 +- 2.9e-06
 [COUNTERS] PROGRAM TOTAL          :    4.2596s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2727s
 [COUNTERS] Fortran MEs      ( 1 ) :    3.5988s for       96 events => throughput is 2.67E+01 events/s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.3881s for       96 events => throughput is 2.47E+02 events/s

*** EXECUTE CHECK -p 2 32 1 --bridge ***
Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 10.2.0] [inlineHel=0] [hardcodePARAM=0]
Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+BRDHST/512y+CXVBRK
EvtsPerSec[MECalcOnly] (3a) = ( 2.888841e+02                 )  sec^-1

*** EXECUTE CHECK -p 2 32 1 ***
Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 10.2.0] [inlineHel=0] [hardcodePARAM=0]
Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
EvtsPerSec[MECalcOnly] (3a) = ( 2.892185e+02                 )  sec^-1

*** EXECUTE GMADEVENT_CUDACPP (create results.dat) ***
 [XSECTION] Cross section = 404.9
 [MERATIOS] ME ratio CudaCpp/Fortran: MIN = 0.99997121 = 1 - 2.9e-05
 [MERATIOS] ME ratio CudaCpp/Fortran: MAX = 1.00007834 = 1 + 7.8e-05
 [MERATIOS] ME ratio CudaCpp/Fortran: NENTRIES =     64
 [MERATIOS] ME ratio CudaCpp/Fortran - 1: AVG = 1.1e-05 +- 2.9e-06
 [COUNTERS] PROGRAM TOTAL          :    3.6001s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.5031s
 [COUNTERS] Fortran MEs      ( 1 ) :    2.4043s for       64 events => throughput is 2.66E+01 events/s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.6928s for       64 events => throughput is 9.24E+01 events/s

*** EXECUTE GCHECK -p 2 32 1 --bridge ***
Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 11.6.124 (gcc 10.2.0)] [inlineHel=0] [hardcodePARAM=0]
Workflow summary            = CUD:DBL+THX:CURHST+RMBHST+BRDDEV/none+NAVBRK
EvtsPerSec[MECalcOnly] (3a) = ( 2.032552e+02                 )  sec^-1

*** EXECUTE GCHECK -p 2 32 1 ***
Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 11.6.124 (gcc 10.2.0)] [inlineHel=0] [hardcodePARAM=0]
Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
EvtsPerSec[MECalcOnly] (3a) = ( 2.367983e+02                 )  sec^-1

TEST COMPLETED

This is ggttggg. In Madevent+cuda I get 9.2E01 events per second. In standalone cuda (bridge mode) I get 2.0E2 events per second. The factor two is probably just helicity filtering?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions