BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210916T132446Z
LOCATION:Ernesto Bertarelli
DTSTART;TZID=Europe/Stockholm:20210705T143000
DTEND;TZID=Europe/Stockholm:20210705T150000
UID:submissions.pasc-conference.org_PASC21_sess106_pap123@linklings.com
SUMMARY:Memory Reduction Using a Ring Abstraction over GPU RDMA for Distri
 buted Quantum Monte Carlo Solver
DESCRIPTION:Paper\n\nMemory Reduction Using a Ring Abstraction over GPU RD
 MA for Distributed Quantum Monte Carlo Solver\n\nWei, D’Azevedo, Huck, Cha
 tterjee, Hernandez...\n\nScientific applications that run on leadership co
 mputing facilities often face the challenge of being unable to fit leading
  science cases onto accelerator devices due to memory constraints (memory-
 bound applications). In this work, the authors studied one such US Departm
 ent of Energy mission-critical condensed matter physics application, Dynam
 ical Cluster Approximation (DCA++), and this paper discusses how device me
 mory-bound challenges were successfully reduced by proposing an effective 
 “all-to-all” communication method—a ring communication a
 lgorithm. This implementation takes advantage of acceleration on GPUs and 
 remote direct memory access for fast data exchange between GPUs. Additiona
 lly, the ring algorithm was optimized with sub-ring communicators and mult
 i-threaded support to further reduce communication overhead and expose mor
 e concurrency, respectively. The computation and communication were also p
 rofiled by using the Autonomic Performance Environment for Exascale (APEX)
  profiling tool, and this paper discusses the performance trade-off for th
 e ring algorithm implementation. The memory analysis on the ring algorithm
  shows that the allocation size for the authors’ most memory-intensi
 ve data structure per GPU is now reduced to 1/𝑝 of the original si
 ze, where 𝑝 is the number of GPUs in the ring communicator. The co
 mmunication analysis suggests that the distributed Quantum Monte Carlo exe
 cution time grows linearly as sub-ring size increases, and the cost of mes
 sages passing through the network interface connector could be a limiting 
 factor.\n\nDomain: Chemistry and Materials, Physics
END:VEVENT
END:VCALENDAR