Abstract
In this article, we first characterize register operand value locality in shader programs of modern gaming applications and observe that there is a high likelihood of one of the register operands of several multiply, logical-and, and similar operations being zero, dynamically. We provide intuition, examples, and a quantitative characterization for how zeros originate dynamically in these programs. Next, we show that this dynamic behavior can be gainfully exploited with a profile-guided code optimization called
Zeroploit
that transforms targeted code regions into a zero-(value-)specialized fast path and a default slow path. The fast path benefits from zero-specialization in two ways, namely: (a) the backward slice of the
other
operand of a given multiply or logical-and can be skipped dynamically, provided the only use of that other operand is in the given instruction, and (b) the forward slice of instructions originating at the given instruction can be zero-specialized, potentially triggering further backward slice specializations from operations of that forward slice as well. Such specialization helps the fast path avoid redundant dynamic computations as well as memory fetches, while the fast-slow versioning transform helps preserve functional correctness. With an offline value profiler and manually optimized shader programs, we demonstrate that
Zeroploit
is able to achieve an average speedup of 35.8% for targeted shader programs, amounting to an average frame-rate speedup of 2.8% across a collection of modern gaming applications on an NVIDIA® GeForce RTX™ 2080 GPU.
Topics

No keywords indexed for this article. Browse by subject →

References
61
[4]
Louis Bavoil. 2019. The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload. Retrieved from https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/. Louis Bavoil. 2019. The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload. Retrieved from https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/.
[5]
Chris Brennan. 2016. Delta Color Compression Overview. Retrieved from https://gpuopen.com/dcc-overview/. Chris Brennan. 2016. Delta Color Compression Overview. Retrieved from https://gpuopen.com/dcc-overview/.
[7]
Brad Calder , Peter Feller , and Alan Eustace . 1999. Value profiling and optimization. J. Instruct. Level Parallel. 1 (Mar . 1999 ). Retrieved from https://www.jilp.org/vol1/v1paper2.pdf. Brad Calder, Peter Feller, and Alan Eustace. 1999. Value profiling and optimization. J. Instruct. Level Parallel. 1 (Mar. 1999). Retrieved from https://www.jilp.org/vol1/v1paper2.pdf.
[8]
Eui-Young Chung , B. Luca , G. DeMicheli , G. Luculli , and M. Carilli . 2002 . Value-sensitive automatic code specialization for embedded software . IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 21 , 9 (Sep. 2002). Eui-Young Chung, B. Luca, G. DeMicheli, G. Luculli, and M. Carilli. 2002. Value-sensitive automatic code specialization for embedded software. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 21, 9 (Sep. 2002).
[9]
Microprocessor Standards Committee. 2019. 754-2019-IEEE Standard for Floating-Point Arithmetic. Retrieved from https://ieeexplore.ieee.org/servlet/opac?punumber=8766227. Microprocessor Standards Committee. 2019. 754-2019-IEEE Standard for Floating-Point Arithmetic. Retrieved from https://ieeexplore.ieee.org/servlet/opac?punumber=8766227.
[10]
Charles Consel Luke Hornof François Noël Jacques Noyé and Nicolae Volansche. 1996. A uniform approach for compile-time and run-time specialization. In Selected Papers from the International Seminar on Partial Evaluation. Charles Consel Luke Hornof François Noël Jacques Noyé and Nicolae Volansche. 1996. A uniform approach for compile-time and run-time specialization. In Selected Papers from the International Seminar on Partial Evaluation. 10.1007/3-540-61580-6_4
[11]
Microsoft Corporation. 2015. Fixed Order of Pipeline Results. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#4.2%20Fixed%20Order%20of%20Pipeline%20Results. Microsoft Corporation. 2015. Fixed Order of Pipeline Results. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#4.2%20Fixed%20Order%20of%20Pipeline%20Results.
[12]
Microsoft Corporation. 2015. Unordered Access Views. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#UAVs. Microsoft Corporation. 2015. Unordered Access Views. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#UAVs.
[13]
Microsoft Corporation. 2018. Atomic Iadd. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/atomic-iadd--sm5---asm. Microsoft Corporation. 2018. Atomic Iadd. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/atomic-iadd--sm5---asm.
[14]
Microsoft Corporation. 2018. Direct3D 11 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11. Microsoft Corporation. 2018. Direct3D 11 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11.
[15]
Microsoft Corporation. 2018. Direct3D 12 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics. Microsoft Corporation. 2018. Direct3D 12 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics.
[16]
Microsoft Corporation. 2018. Effect-Compiler Tool. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dtools/fxc. Microsoft Corporation. 2018. Effect-Compiler Tool. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dtools/fxc.
[17]
Microsoft Corporation. 2018. High Level Shading Language. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl. Microsoft Corporation. 2018. High Level Shading Language. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl.
[18]
Microsoft Corporation. 2018. movc (sm4-asm). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/movc--sm4---asm. Microsoft Corporation. 2018. movc (sm4-asm). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/movc--sm4---asm.
[19]
Microsoft Corporation. 2018. Shader Model 4 Assembly (DirectX HLSL)-dcl_globalFlags. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dcl-globalflags. Microsoft Corporation. 2018. Shader Model 4 Assembly (DirectX HLSL)-dcl_globalFlags. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dcl-globalflags.
[20]
Microsoft Corporation. 2018. Shader Model 5 Assembly (DirectX HLSL). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/shader-model-5-assembly--directx-hlsl. Microsoft Corporation. 2018. Shader Model 5 Assembly (DirectX HLSL). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/shader-model-5-assembly--directx-hlsl.
[21]
Microsoft Corporation. 2018. Sync. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sync--sm5---asm. Microsoft Corporation. 2018. Sync. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sync--sm5---asm.
[22]
Microsoft Corporation. 2018. Unordered Access Buffer or Texture. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-cs-resources#unordered-access-buffer-or-texture. Microsoft Corporation. 2018. Unordered Access Buffer or Texture. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-cs-resources#unordered-access-buffer-or-texture.
[23]
Microsoft Corporation. 2018. Variable Syntax. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-variable-syntax. Microsoft Corporation. 2018. Variable Syntax. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-variable-syntax.
[24]
Microsoft Corporation. 2019. DirectX Intermediate Language. Retrieved from https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst. Microsoft Corporation. 2019. DirectX Intermediate Language. Retrieved from https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst.
[25]
NVIDIA Corporation . 2019 . Geforce Game Ready Driver , Version 441 . 41 -WHQL. Retrieved from https://www.geforce.com/drivers/results/155060. NVIDIA Corporation. 2019. Geforce Game Ready Driver, Version 441.41-WHQL. Retrieved from https://www.geforce.com/drivers/results/155060.
[26]
NVIDIA Corporation. 2019. Nsight 2019.6. Retrieved from https://developer.nvidia.com/nsight-graphics. NVIDIA Corporation. 2019. Nsight 2019.6. Retrieved from https://developer.nvidia.com/nsight-graphics.
[27]
NVIDIA Corporation. 2019. Parallel Thread Execution ISA: Application Guide. Retrieved from https://docs.nvidia.com/pdf/ptx_isa_6.5.pdf. NVIDIA Corporation. 2019. Parallel Thread Execution ISA: Application Guide. Retrieved from https://docs.nvidia.com/pdf/ptx_isa_6.5.pdf.
[29]
S. Z. Gilani , N. S. Kim , and M. J. Schulte . 2013. Power-efficient computing for compute-intensive GPGPU applications . In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13) . 330--341. S. Z. Gilani, N. S. Kim, and M. J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 330--341.
[30]
Brian Grant , Matthai Philipose , Markus Mock , Craig Chambers , and Susan J. Eggers . 1999. An evaluation of staged run-time optimizations in DyC . In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99) . 293--304. Brian Grant, Matthai Philipose, Markus Mock, Craig Chambers, and Susan J. Eggers. 1999. An evaluation of staged run-time optimizations in DyC. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99). 293--304.
[31]
Hilbert Hagedoorn . 2019 . NVIDIA GeForce RTX 2080 SUPER (8GB Founder). Retrieved from https://www.guru3d.com/articles-pages/geforce-rtx-2080-super-review,1.html. Hilbert Hagedoorn. 2019. NVIDIA GeForce RTX 2080 SUPER (8GB Founder). Retrieved from https://www.guru3d.com/articles-pages/geforce-rtx-2080-super-review,1.html.
[32]
Song Han , Xingyu Liu , Huizi Mao , Jing Pu , Ardavan Pedram , Mark A. Horowitz , and William J. Dally . 2016. EIE: Efficient inference engine on compressed deep neural network . In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16) . 243--254. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 243--254.
[34]
Randall Hyde . 2006. Writing Great Code, Volume 2: Thinking Low-Level , Writing High-Level . No Starch Press, Chapter 13, 427--435. Randall Hyde. 2006. Writing Great Code, Volume 2: Thinking Low-Level, Writing High-Level. No Starch Press, Chapter 13, 427--435.
[35]
The Khronos Group Inc.[n.d.]. OpenGL Overview. Retrieved from https://www.opengl.org/documentation/. The Khronos Group Inc.[n.d.]. OpenGL Overview. Retrieved from https://www.opengl.org/documentation/.
[36]
The Khronos Group Inc.2018. Vulkan Overview. Retrieved from https://www.khronos.org/vulkan/. The Khronos Group Inc.2018. Vulkan Overview. Retrieved from https://www.khronos.org/vulkan/.
[37]
Neil D. Jones , Carsten K. Gomard , and Peter Sestoft . 1993. Partial Evaluation and Automatic Program Generation . Prentice-Hall , Upper Saddle River, NJ. Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. 1993. Partial Evaluation and Automatic Program Generation. Prentice-Hall, Upper Saddle River, NJ.
[38]
Baldur Karlsson. 2019. Renderdoc v1.5. Retrieved from https://renderdoc.org/docs/index.html. Baldur Karlsson. 2019. Renderdoc v1.5. Retrieved from https://renderdoc.org/docs/index.html.
[39]
John Kessenich Dave Baldwin and Randi Rost. 2017. The OpenGL Shading Language. Retrieved from https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.50.pdf. John Kessenich Dave Baldwin and Randi Rost. 2017. The OpenGL Shading Language. Retrieved from https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.50.pdf.
[41]
Kevin M. Lepak and Mikko H. Lipasti. 2000. On the value locality of store instructions . In Proceedings of the 27th International Symposium on Computer Architecture. Kevin M. Lepak and Mikko H. Lipasti. 2000. On the value locality of store instructions. In Proceedings of the 27th International Symposium on Computer Architecture.
[42]
Kevin M. Lepak and Mikko H. Lipasti. 2000. Silent stores for free . In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00) . 22--31. Kevin M. Lepak and Mikko H. Lipasti. 2000. Silent stores for free. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00). 22--31.
[43]
Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction . In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’96) . 226--237. Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’96). 226--237.
[45]
Future Mark. 2019. 3DMARK® Technical Guide. Retrieved from https://s3.amazonaws.com/download-aws.futuremark.com/3dmark-technical-guide.pdf. Future Mark. 2019. 3DMARK® Technical Guide. Retrieved from https://s3.amazonaws.com/download-aws.futuremark.com/3dmark-technical-guide.pdf.
[46]
D. K. McAllister S. E. Molnar Jr. J. F. Duluk E. M. Kilgariff P. R. Brown C. J. Amsinck J. M. O’Connor J. M. Burgess G. A. Muthler and J. Robertson. 2012. Zero Bandwidth Clears. United States Patent No. 8330766. D. K. McAllister S. E. Molnar Jr. J. F. Duluk E. M. Kilgariff P. R. Brown C. J. Amsinck J. M. O’Connor J. M. Burgess G. A. Muthler and J. Robertson. 2012. Zero Bandwidth Clears. United States Patent No. 8330766.
[47]
Robert Muth , Scott A. Watterson , and Saumya K. Debray . 2000. Code specialization based on value profiles . In Proceedings of the 7th International Symposium on Static Analysis (SAS’00) . Springer-Verlag, London, 340--359. Robert Muth, Scott A. Watterson, and Saumya K. Debray. 2000. Code specialization based on value profiles. In Proceedings of the 7th International Symposium on Static Analysis (SAS’00). Springer-Verlag, London, 340--359.
[48]
Angshuman Parashar , Minsoo Rhu , Anurag Mukkara , Antonio Puglielli , Rangharajan Venkatesan , Brucek Khailany , Joel Emer , Stephen W. Keckler , and William J. Dally . 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks . In Proceedings of the 44th Annual International Symposium on Computer Architecture. 27--40 . Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 27--40.
[50]
Alex Peleg , Sam Wilkie , and Uri Weiser . 1997. Intel MMX for multimedia PCs. Commun. ACM 40, 1 ( 1997 ). Alex Peleg, Sam Wilkie, and Uri Weiser. 1997. Intel MMX for multimedia PCs. Commun. ACM 40, 1 (1997).

Showing 50 of 61 references

Metrics
7
Citations
61
References
Details
Published
Aug 03, 2020
Vol/Issue
17(3)
Pages
1-26
License
View
Cite This Article
Ram Rangan, Mark W. Stephenson, Aditya Ukarande, et al. (2020). Zeroploit. ACM Transactions on Architecture and Code Optimization, 17(3), 1-26. https://doi.org/10.1145/3394284
Related

You May Also Like

Temperature-aware microarchitecture

Kevin Skadron, Mircea R. Stan · 2004

599 citations

CACTI 7

Rajeev Balasubramonian, Andrew B. Kahng · 2017

473 citations

Non-monopolizable caches

Leonid Domnitser, Aamer Jaleel · 2012

150 citations

A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures

Cristóbal Ramírez, César Alejandro Hernández · 2020

43 citations