ICCS 2016 Main Track (MT) Session 4
Time and Date: 10:15 - 11:55 on 7th June 2016
Room: KonTiki Ballroom
Chair: Alfredo Tirado-Ramos
287 | Embedded real-time stereo estimation via Semi-Global Matching on the GPU [abstract] Abstract: Dense, robust and real-time computation of depth information from stereo-camera systems is a computationally demanding requirement for robotics, advanced driver assistance systems (ADAS) and autonomous vehicles. Semi-Global Matching (SGM) is a widely used algorithm that propagates consistency constraints along several paths across the image. This work presents a real-time system producing reliable disparity estimation results on the new embedded energy- efficient GPU devices. Our design runs on a Tegra X1 at 42 frames per second (fps) for an image size of 640×480, 128 disparity levels, and using 4 path directions for the SGM method. |
Daniel Hernández Juárez, Alejandro Chacón, Antonio Espinosa, David Vázquez, Juan Carlos Moure, Antonio M. López |
314 | Multivariate Polynomial Multiplication on GPU [abstract] Abstract: Multivariate polynomial multiplication is a fundamental operation which is used in many scientific domains, for example in the optics code for particle accelerator design at CERN. We present a novel and efficient multivariate polynomial multiplication algorithm for GPUs using floating-point double precision coefficients implemented using the CUDA parallel programming platform. We obtain very good speedups over another multivariate polynomial multiplication library for GPUs (up to 548x), and over the implementation of our algorithm for multi-core machines using OpenMP (up to 7.46x). |
Diana Andreea Popescu, Rogelio Tomas Garcia |
329 | CUDA Optimization of Non-Local Means Extended to Wrapped Gaussian Distributions for Interferometric Phase Denoising [abstract] Abstract: Interferometric Synthetic Aperture Radar (InSAR) captures hundreds of millions of phase measurements with a single image, which can be differenced with a subsequent matching image to measure the Earth’s physical properties such as atmosphere, topography, and ground instability. Each pixel in an InSAR image lies somewhere between perfect information and complete noise; deriving useful measurements from InSAR is therefore predicated upon estimating the quality (coherence) of each pixel, while also enhancing the information-bearing pixels through filtering. Rejecting noisy pixels at the outset and filtering the available information without introducing artifacts is crucial for generating accurate and spatially dense measurements. A capable filtering strategy must accommodate the diversity of manmade and natural ground cover exhibiting noise spawned by vegetation and water interwoven with useable signals echoed by infrastructure, rocks, and bare ground. Traditional filtering strategies assuming spatial homogeneity have lately been replaced by filters that honor discontinuities in ground cover, but two key improvements are needed: a) techniques must be adapted to enhance phase rather than amplitude, and b) runtime needs to be reduced to support deployment for operational land-information products. We present a new algorithm for wrapped phase filtering based on the nonlocal means algorithm (NLM) of Baudes et al. (2005) and the non-local InSAR (NL-InSAR) algorithm of Deledalle et al. (2011). The new filter, wrapped-NLM (WNLM), extends NLM to wrapped phase data that is inherently lossy due to an unknown integer number of phase ambiguities per pixel. The filter is similar to that of NL-InSAR in that we adopt their procedure of iteratively improving the filtered phase estimates by updating the Bayesian prior based on the previously filtered data (2009). Our filter differs from NL-INSAR in that it does not assume the Goodman model (1963) nor that of speckle noise (Goodman J. W., 2007) which were found to suffer in some areas due to having too many degrees of freedom; instead we use a more general assumption that the phase noise distribution is additive wrapped Gaussian, making the filter more robust to a larger variety of input data. This also simplifies the algorithm making it possible to implement an efficient parallel algorithm on the GPU using CUDA. |
Aaron Zimmer, Parwant Ghuman |
449 | A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs [abstract] Abstract: This paper presents unique modeling algorithms of performance prediction for sparse matrix-vector multiplication on GPUs. Based on the algorithms, we develop a framework that is able to predict SpMV kernel performance and to analyze the reported prediction results. We make the following contributions: (1) We provide theoretical basis for the generation of benchmark matrices according to the hardware features of a given specic GPU. (2) Given a sparse matrix, we propose a quantitative method to collect some features representing its matrix settings. (3) We propose four performance modeling algorithms to accurately predict kernel performance for SpMV computing using CSR, ELL, COO, and HYB SpMV kernels. We evaluate the accuracy of our framework with 8 widely-used sparse matrices (totally 32 test cases) on NVIDIA Tesla K80 GPU. In our experiments, the average performance differences between the predicted and measured SpMV kernel execution times for CSR, ELL, COO, and HYB SpMV kernels are 5.1%, 5.3%, 1.7%, and 6.1%, respectively. |
Ping Guo, Chung-Wei Lee |
71 | A Multi-GPU Fast Iterative Method for Eikonal Equations using On-the-fly Adaptive Domain Decomposition [abstract] Abstract: The recent research trend of Eikonal solver focuses on employing state-of-the-art parallel computing technology, such as GPUs. Even though there exists previous work on GPU-based parallel Eikonal solvers, only little research literature exists on the multi-GPU Eikonal solver due to its complication in data and work management. In this paper, we propose a novel on-the-fly, adaptive domain decomposition method for efficient implementation of the Block-based Fast Iterative Method on a multi-GPU system. The proposed method is based on dynamic domain decomposition so that the region to be processed by each GPU is determined on-the-fly when the solver is running. In addition, we propose an efficient domain assignment algorithm that minimizes communication overhead while maximizing load balancing between GPUs. The proposed method scales well, up to 6.17x for eight GPUs, and can handle large computing problems that do not fit to limited GPU memory. We assess the parallel efficiency and runtime performance of the proposed method on various distance computation examples using up to eight GPUs. |
Sumin Hong, Won-Ki Jeong |