If you prefer to download a single file with all the LARCC bibliography, you can find it here. One can also follow new publications by RSS.
In addition, you can also check the publications in the LARCC profile on Google Scholar .
2020 |
Stein, Charles M; Rockenbach, Dinei A; Griebler, Dalvan; Torquati, Massimo; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz G Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @article{STEIN:CCPE:20, title = {Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units}, author = {Charles M Stein and Dinei A Rockenbach and Dalvan Griebler and Massimo Torquati and Gabriele Mencagli and Marco Danelutto and Luiz G Fernandes}, url = {https://doi.org/10.1002/cpe.5786}, doi = {10.1002/cpe.5786}, year = {2020}, date = {2020-05-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {na}, number = {na}, pages = {e5786}, publisher = {Wiley Online Library}, abstract = {Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {article} } Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads. |
de Araujo, Gabriell Alves; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Efficient NAS Parallel Benchmark Kernels with CUDA Inproceedings doi 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9-16, IEEE, Västerås, Sweden, Sweden, 2020. Abstract | Links | BibTeX | Tags: Benchmark, GPGPU @inproceedings{ARAUJO:PDP:20, title = {Efficient NAS Parallel Benchmark Kernels with CUDA}, author = {Gabriell Alves de Araujo and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP50117.2020.00009}, doi = {10.1109/PDP50117.2020.00009}, year = {2020}, date = {2020-03-01}, booktitle = {28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {9-16}, publisher = {IEEE}, address = {Västerås, Sweden, Sweden}, series = {PDP'20}, abstract = {NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels.}, keywords = {Benchmark, GPGPU}, pubstate = {published}, tppubtype = {inproceedings} } NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels. |
2019 |
Rockenbach, Dinei A; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo High-Level Stream Parallelism Abstractions with SPar Targeting GPUs Inproceedings doi Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 543-552, IOS Press, Prague, Czech Republic, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{ROCKENBACH:PARCO:19, title = {High-Level Stream Parallelism Abstractions with SPar Targeting GPUs}, author = {Dinei A Rockenbach and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.3233/APC200083}, doi = {10.3233/APC200083}, year = {2019}, date = {2019-09-01}, booktitle = {Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo)}, volume = {36}, pages = {543-552}, publisher = {IOS Press}, address = {Prague, Czech Republic}, series = {ParCo'19}, abstract = {The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained. |
Rockenbach, Dinei A; Stein, Charles Michael; Griebler, Dalvan; Mencagli, Gabriele; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{ROCKENBACH:stream-multigpus:IPDPSW:19, title = {Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges}, author = {Dinei A Rockenbach and Charles Michael Stein and Dalvan Griebler and Gabriele Mencagli and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/IPDPSW.2019.00137}, doi = {10.1109/IPDPSW.2019.00137}, year = {2019}, date = {2019-05-01}, booktitle = {International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}, pages = {834-841}, publisher = {IEEE}, address = {Rio de Janeiro, Brazil}, series = {IPDPSW'19}, abstract = {The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models. |
Stein, Charles M; Stein, Joao V; Boz, Leonardo; Rockenbach, Dinei A; Griebler, Dalvan Mandelbrot Streaming para Sistemas Multi-core com GPUs Inproceedings 19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS), Sociedade Brasileira de Computação, Três de Maio, RS, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{larcc:mandelbrot_multicore_GPU:ERAD:19, title = {Mandelbrot Streaming para Sistemas Multi-core com GPUs}, author = {Charles M Stein and Joao V Stein and Leonardo Boz and Dinei A Rockenbach and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2019/04/192109.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS)}, publisher = {Sociedade Brasileira de Computação}, address = {Três de Maio, RS, Brazil}, abstract = {Este trabalho visa explorar o paralelismo na aplicação MandelbrotStreamingpara arquiteturas multi-core com GPUs, usando as bibliotecas Fast-Flow, TBB e SPar com CUDA. A implementação do paralelismo foi baseada nopadrão farm, alcançando speedup de 16x no sistema multi-core e de 77x em umambiente multi-core com duas GPUs. Os resultados evidenciam um melhor de-sempenho no uso de GPUs embora tenham sido identificadas futuras melhorias.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visa explorar o paralelismo na aplicação MandelbrotStreamingpara arquiteturas multi-core com GPUs, usando as bibliotecas Fast-Flow, TBB e SPar com CUDA. A implementação do paralelismo foi baseada nopadrão farm, alcançando speedup de 16x no sistema multi-core e de 77x em umambiente multi-core com duas GPUs. Os resultados evidenciam um melhor de-sempenho no uso de GPUs embora tenham sido identificadas futuras melhorias. |
Stein, Charles M; Rockenbach, Dinei A; Griebler, Dalvan Paralelização do Dedup para Sistemas Multi-core com GPUs Inproceedings 19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS), Sociedade Brasileira de Computação, Três de Maio, RS, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU @inproceedings{larcc:paralelizacao_multicore_GPU:ERAD:19, title = {Paralelização do Dedup para Sistemas Multi-core com GPUs}, author = {Charles M Stein and Dinei A Rockenbach and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2019/04/192087.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS)}, publisher = {Sociedade Brasileira de Computação}, address = {Três de Maio, RS, Brazil}, abstract = {O maior volume de dados gerado, trafegado e processado aumentaa demanda por mais poder de processamento e por algoritmos de compressãoeficientes. Este trabalho tem como objetivo explorar o paralelismo de streampara arquiteturas multi-core com GPUs na aplicação Dedup, usando SPar comCUDA e OpenCL. Apesar do desempenho não ser o esperado, o artigo contribuicom uma análise detalhada dos resultados e sugestões futuras de melhorias.}, keywords = {GPGPU}, pubstate = {published}, tppubtype = {inproceedings} } O maior volume de dados gerado, trafegado e processado aumentaa demanda por mais poder de processamento e por algoritmos de compressãoeficientes. Este trabalho tem como objetivo explorar o paralelismo de streampara arquiteturas multi-core com GPUs na aplicação Dedup, usando SPar comCUDA e OpenCL. Apesar do desempenho não ser o esperado, o artigo contribuicom uma análise detalhada dos resultados e sugestões futuras de melhorias. |
Stein, Charles Michael; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{STEIN:LZSS-multigpu:PDP:19, title = {Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs}, author = {Charles Michael Stein and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671624}, doi = {10.1109/EMPDP.2019.8671624}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {247-251}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques. |
2018 |
Stein, Charles Programação Paralela para GPU em Aplicações de Processamento Stream Undergraduate Thesis Undergraduate Thesis, 2018. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @misc{larcc:charles_stein:TCC:18, title = {Programação Paralela para GPU em Aplicações de Processamento Stream}, author = {Charles Stein}, url = {http://larcc.setrem.com.br/wp-content/uploads/2018/11/TCC_SETREM__Charles_Stein_1.pdf}, year = {2018}, date = {2018-06-01}, address = {Três de Maio, RS, Brazil}, school = {Sociedade Educacional Três de Maio (SETREM)}, abstract = {Stream processing applications are used in many areas. They usually require real-time processing and have a high computational load. The parallelization of this type of application is necessary. The use of GPUs can hypothetically increase the performance of this stream processing applications. This work presents the study and parallel software implementation for GPU on stream processing applications. Applications of different areas were chosen and parallelized for CPU and GPU. A set of experiments were conducted and the results achieved were analyzed. Therefore, the Sobel, LZSS, Dedup, and Black-Scholes applications were parallelized. The Sobel filter did not gain performance, while the LZSS, Dudup and Black-Scholes obtained a Speedup of 36x, 13x and 6.9x respectively. In addition to performance, the source lines of code from the implementations with CUDA and OpenCL libraries were measured in order to analyze the code intrusion. The tests performed showed that in some applications the use of GPU is advantageous, while in other applications there are no significant gains when compared to the parallel versions in CPU.}, howpublished = {Undergraduate Thesis}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {misc} } Stream processing applications are used in many areas. They usually require real-time processing and have a high computational load. The parallelization of this type of application is necessary. The use of GPUs can hypothetically increase the performance of this stream processing applications. This work presents the study and parallel software implementation for GPU on stream processing applications. Applications of different areas were chosen and parallelized for CPU and GPU. A set of experiments were conducted and the results achieved were analyzed. Therefore, the Sobel, LZSS, Dedup, and Black-Scholes applications were parallelized. The Sobel filter did not gain performance, while the LZSS, Dudup and Black-Scholes obtained a Speedup of 36x, 13x and 6.9x respectively. In addition to performance, the source lines of code from the implementations with CUDA and OpenCL libraries were measured in order to analyze the code intrusion. The tests performed showed that in some applications the use of GPU is advantageous, while in other applications there are no significant gains when compared to the parallel versions in CPU. |
Stein, Charles M; Griebler, Dalvan Explorando o Paralelismo de Stream em CPU e de Dados em GPU na Aplicação de Filtro Sobel Inproceedings 18th Escola Regional de Alto Desempenho do Estado do Rio Grande do Sul (ERAD/RS), pp. 137-140, Sociedade Brasileira de Computação, Porto Alegre, RS, Brazil, 2018. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{larcc:stream_gpu_cuda:ERAD:18, title = {Explorando o Paralelismo de Stream em CPU e de Dados em GPU na Aplicação de Filtro Sobel}, author = {Charles M Stein and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2018/04/LARCC_ERAD_IC_Stein_2018.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {18th Escola Regional de Alto Desempenho do Estado do Rio Grande do Sul (ERAD/RS)}, pages = {137-140}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, RS, Brazil}, abstract = {O objetivo deste estudo é a paralelização combinada do stream em CPU e dos dados em GPU usando uma aplicação de filtro sobel. Foi realizada uma avaliação do desempenho de OpenCL, OpenACC e CUDA com o algorí-timo de multiplicação de matrizes para escolha da ferramenta a ser usada com a SPar. Concluiu-se que apesar da GPU apresentar um speedup de 11.81x com CUDA, o uso exclusivo da CPU com a SPar é mais vantajoso nesta aplicação.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } O objetivo deste estudo é a paralelização combinada do stream em CPU e dos dados em GPU usando uma aplicação de filtro sobel. Foi realizada uma avaliação do desempenho de OpenCL, OpenACC e CUDA com o algorí-timo de multiplicação de matrizes para escolha da ferramenta a ser usada com a SPar. Concluiu-se que apesar da GPU apresentar um speedup de 11.81x com CUDA, o uso exclusivo da CPU com a SPar é mais vantajoso nesta aplicação. |