BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210916T132446Z
LOCATION:Ernesto Bertarelli
DTSTART;TZID=Europe/Stockholm:20210705T133000
DTEND;TZID=Europe/Stockholm:20210705T140000
UID:submissions.pasc-conference.org_PASC21_sess106_pap118@linklings.com
SUMMARY:In-Situ Assessment of Device-Side Compute Work for Dynamic Load Ba
 lancing in a GPU-Accelerated PIC Code
DESCRIPTION:Paper\n\nIn-Situ Assessment of Device-Side Compute Work for Dy
 namic Load Balancing in a GPU-Accelerated PIC Code\n\nRowan, Huebl, Gott, 
 Deslippe, Thévenet...\n\nMaintaining computational load balance is importa
 nt to the performant behavior of codes which operate under a distributed c
 omputing model. This is especially true for GPU architectures, which can s
 uffer from memory oversubscription if improperly load balanced. We present
  enhancements to traditional load balancing approaches and explicitly targ
 et GPU architectures, exploring the resulting performance. A key component
  of our enhancements is the introduction of several GPU-amenable strategie
 s for assessing compute work. These strategies are implemented and benchma
 rked to find the most optimal data collection methodology for in-situ asse
 ssment of GPU compute work.  For the fully kinetic particle-in-cell c
 ode WarpX, which supports MPI+CUDA parallelism, we investigate the perform
 ance of the improved dynamic load balancing via a strong scaling-based per
 formance model and show that, for a laser-ion acceleration test problem ru
 n with up to 6144 GPUs on Summit, the enhanced dynamic load balancing achi
 eves from 62%--74% (88% when running on 6 GPUs) of the theoretically predi
 cted maximum speedup; for the 96-GPU case, we find that dynamic load balan
 cing improves performance relative to baselines without load balancing (3.
 8x speedup) and with static load balancing (1.2x speedup). Our results pro
 vide important insights into dynamic load balancing and performance assess
 ment, and are particularly relevant in the context of distributed memory a
 pplications ran on GPUs.\n\nDomain: Chemistry and Materials, Physics
END:VEVENT
END:VCALENDAR
