BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210916T132446Z
LOCATION:Ernesto Bertarelli
DTSTART;TZID=Europe/Stockholm:20210708T153000
DTEND;TZID=Europe/Stockholm:20210708T160000
UID:submissions.pasc-conference.org_PASC21_sess175_pap126@linklings.com
SUMMARY:Solving DWF Dirac Equation Using Multi-splitting Preconditioned Co
 njugate Gradient with Tensor Cores on NVIDIA GPUs
DESCRIPTION:Paper\n\nSolving DWF Dirac Equation Using Multi-splitting Prec
 onditioned Conjugate Gradient with Tensor Cores on NVIDIA GPUs\n\nTu, Clar
 k, Jung, Mawhinney\n\nWe show that using the multi-splitting algorithm as 
 a preconditioner for the domain wall Dirac linear operator, arising in lat
 tice QCD, effectively reduces the inter-node communication cost, at the ex
 pense of performing more on-node floating point and memory operations. Cor
 rectly including the boundary \textit{snake} terms, the preconditioner is 
 implemented in the QUDA framework, where it is found that utilizing kernel
  fusion and the tensor cores on NVIDIA GPUs is necessary to achieve a suff
 iciently performant preconditioner. A reduced-dimension (reduced-$L_s$) st
 rategy is also proposed and tested for the preconditioner. We find the met
 hod achieves lower time to solution than regular CG at high node count des
 pite the additional local comutational requirements from the preconditione
 r. This method could be useful for supercomputers with more on-node flops 
 and memory bandwidth than inter-node communication bandwidth.\n\nDomain: P
 hysics
END:VEVENT
END:VCALENDAR
