BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210916T132451Z
LOCATION:Henry Dunant
DTSTART;TZID=Europe/Stockholm:20210707T140000
DTEND;TZID=Europe/Stockholm:20210707T143000
UID:submissions.pasc-conference.org_PASC21_sess152_msa299@linklings.com
SUMMARY:Programmable Infrastructure for Diverse, Scalable Learning at Exas
 cale
DESCRIPTION:Minisymposium\n\nProgrammable Infrastructure for Diverse, Scal
 able Learning at Exascale\n\nWozniak\n\nA wide range of problems in deep l
 earning can be addressed with ensembles of training runs, including hyperp
 arameter sweeps and optimization, robustness, sensitivity, and statistical
  studies, and incremental training approaches to study the underlying data
 .  Many of these cases can benefit from thousands of concurrent runs.
   Applying leadership machines like the 24,000-GPU Summit system is c
 hallenging due to the exotic programming environment that makes community 
 approaches difficult, as well as simply managing the sheer scale of the ma
 chine.  The approach presented here and taken by the Cancer Distribut
 ed Learning Environment (CANDLE) project is to apply a scalable workflow p
 rogramming language compatible with MPI-based systems like Summit (and oth
 er HPC machines) and build on a reusable library of common deep learning f
 unctionality, such as training, data manipulation, checkpoint/restart, etc
 .  Thus, common workflows can be simply reused from our collection, e
 xtended, or developed rapidly from scratch.  In this presentation, we
  will give an overview of the CANDLE infrastructure, with details on new f
 eatures such as support for data-parallel training, checkpoint/restart, an
 d our model training abstraction.  We intend that participants from a
  range of fields (not just cancer) will benefit from our approach and poss
 ibly reuse our solutions for other problems in scientific machine learning
  or data analysis.\n\nDomain: CS and Math, Life Sciences
END:VEVENT
END:VCALENDAR
