ICCS 2016 Main Track (MT) Session 1

Time and Date: 10:35 - 12:15 on 6th June 2016

Room: KonTiki Ballroom

Chair: David Abramson

19	Performance Analysis and Optimization of a Hybrid Seismic Imaging Application [abstract] Abstract: Applications to process seismic data are computationally expensive and, therefore, employ scalable parallel systems to produce timely results. Here we describe our experiences of using performance analysis tools to gain insight into an MPI+OpenMP code developed by Shell that performs Reverse Time Migration on a cluster to produce models of the subsurface. Tuning MPI+OpenMP programs for modern platforms is difficult, and, therefore, assistance is required from performance analysis tools. These tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from Rice University's HPCToolkit and hardware performance counters, we were able to improve the performance of Shell's prototype distributed-memory Reverse Time Migration code by roughly 30 percent.	Sri Raj Paul, Mauricio Araya-Polo, John Mellor-Crummey, Detlef Hohl
33	Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications [abstract] Abstract: As parallel machines increase their number of processors, so does the failure rate of the global system, thus, long-running applications will need to make use of fault tolerance techniques to ensure the successful execution completion. Most of current HPC systems are built as clusters of multicores. The hybrid MPI-OpenMP paradigm provides numerous benefits on these systems. This paper presents a checkpointing solution for hybrid MPI-OpenMP applications, in which checkpoint consistency is guaranteed by using a coordination protocol intra-node, while no internode coordination is needed. The proposal reduces network utilization and storage resources in order to optimize the I/O cost of fault tolerance, while minimizing the checkpointing overhead. Besides, the portability of the solution and the dynamic parallelism provided by OpenMP enable the restart of the applications using machines with different architectures, operating systems and/or number of cores, adapting the number of running OpenMP threads for the best exploitation of the available resources. Extensive evaluation using hybrid MPI-OpenMP applications from the ASC Sequoia Benchmark Codes and NERSC-8/Trinity benchmarks is presented, showing the effectiveness and efficiency of the approach.	Nuria Losada, María J. Martín, Gabriel Rodríguez, Patricia González
38	Checkpointing of Parallel MPI Applications using MPI One-sided API with Support for Byte-addressable Non-volatile RAM [abstract] Abstract: The increasing size of computational clusters results in an increasing probability of failures, which in turn requires application checkpointing in order to survive those failures. Traditional checkpointing requires data to be copied from application memory into persistent storage medium, which increases application execution time as it is usually done in a separate step. In this paper we propose to use emerging byte-addressable non-volatile RAM (NVRAM) as a persistent storage medium and we analyze various methods of making consistent checkpoints with support of MPI one-sided API in order to minimize checkpointing overhead. We test our solution on two applications: HPCCG benchmark and PageRank algorithm. Our experiments showed that NVRAM based checkpointing performs much better than traditional disk based approach. We also simulated different possible latencies and bandwidth of future NVRAM and our experiments showed that only bandwidth had visible impact onto application execution time.	Piotr Dorożyński, Pawel Czarnul, Artur Malinowski, Krzysztof Czuryło, Łukasz Dorau, Maciej Maciejewski, Paweł Skowron
57	Acceleration of Tear Film Map Definition on Multicore Systems [abstract] Abstract: Dry eye syndrome is a public health problem, and one of the most common conditions seen by eye care specialists. Among the clinical tests for its diagnosis, the evaluation of the interference patterns observed in the tear film lipid layer is often employed. In this sense, tear film maps illustrate the spatial distribution of the patterns over the whole tear film and provide useful information to practitioners. However, the creation of a single map usually takes tens of minutes. Medical experts currently demand applications with lower response time in order to provide a faster diagnosis for their patients. In this work, we explore different parallel approaches to accelerate the definition of the tear film map by exploiting the power of today's ubiquitous multicore systems. They can be executed on any multicore system without special software or hardware requirements. The experimental evaluation determines the best approach (on-demand with dynamic seed distribution) and proves that it can significantly decrease the runtime. For instance, the average runtime of our experiments with 50 real-world images on a system with AMD Opteron processors is reduced from more than 20 minutes to one minute and 12 seconds.	Jorge González-Domínguez, Beatriz Remeseiro, María J. Martín
99	Modeling and Implementation of an Asynchronous Approach to Integrating HPC and Big Data Analysis [abstract] Abstract: With the emergence of exascale computing and big data analytics, many important scientific applications require the integration of computationally intensive modeling and simulation with data-intensive analysis to accelerate scientific discovery. In this paper, we create an analytical model to steer the optimization of the end-to-end time-to-solution for the integrated computation and data analysis. We also design and develop an intelligent data broker to efficiently intertwine the computation stage and the analysis stage to practically achieve the optimal time-to-solution predicted by the analytical model. We perform experiments on both synthetic applications and real-world computational fluid dynamics (CFD) applications. The experiments show that the analytic model exhibits an average relative error of less than 10%, and the applications’ performance can be improved by up to 131% for the synthetic programs and by up to 78% for the real-world CFD application.	Yuankun Fu, Fengguang Song, Luoding Zhu