%0 Journal Article %J IEEE Transactions on Computing %D 2009 %T A New Basis for Shifters in General-Purpose Processors for Existing and Advanced Bit Manipulations %A Hilewitz, Yedidya %A Lee, Ruby %N 8 %V 58 %X This paper describes a new basis for the implementation of the shifter functional unit in microprocessors that can implement new advanced bit manipulations as well as standard shifter operations. Our design is based on the inverse butterfly and butterfly datapath circuits, rather than the barrel shifter or log-shifter designs currently used. We show how this new shifter can implement the standard shift and rotate operations, as well as more advanced extract, deposit and mix operations found in some processors. Furthermore, it can perform important new classes of even more advanced bit manipulation instructions like arbitrary bit permutations, bit gather (or parallel extract) and bit scatter (or parallel deposit) instructions. Thus, our new functional unit performs the functionality of three functional units ? the basic shifter, the multimedia-mix unit and the advanced bit manipulation functional unit, while having a latency only slightly longer than that of the log-shifter. %Z Available online since November 2008. %8 08/2009 %0 Conference Paper %B Proceedings of 19th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP ?08) %D 2008 %T Bit Matrix Multiplication in Commodity Processors %A Hilewitz, Yedidya %A Lauradoux, Cédric %A Lee, Ruby %8 July 2008 %0 Conference Paper %B Proceedings of the 15th Fast Software Encryption Workshop (FSE) %D 2008 %T Accelerating the Whirlpool Hash Function Using Parallel Table Lookup and Fast Cyclical Permutation %A Hilewitz, Yedidya %A Yin, Yiqun Lisa %A Lee, Ruby %C Lausanne, Switzerland %X Hash functions are an important building block in almost all security applications. In the past few years, there have been major advances in the cryptanalysis of hash functions, especially the MDx family, and it has become important to select new hash functions for next-generation security applications. One of the potential candidates is Whirlpool, an AES-based hash function. Whirlpool adopts a very different design approach from MDx, and hence it has withstood all the latest attacks. However, its slow software performance has made it less attractive for practical use. In this paper, we present a new software implementation of Whirlpool that is significantly faster than previous ones. Our optimization leverages new ISA extensions, in particularly Parallel Table Lookup (PTLU), which has previously been proposed to accelerate block ciphers like AES and DES, multimedia and other applications. We also show a novel cyclical permutation algorithm that can concurrently convert rows of a matrix to diagonals. We obtain a speedup of 8.8x and 13.9x over a basic RISC architecture using 64-bit and 128-bit PTLU modules, respectively. This is equivalent to rates of 11.4 and 7.2 cycles/byte, respectively, which makes our Whirlpool implementation faster than the fastest published rate of 12 cycles/byte for SHA-2 in software. %8 February 2008 %0 Journal Article %J Journal of Signal Processing Systems %D 2008 %T Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for Commodity Microprocessors %A Hilewitz, Yedidya %A Lee, Ruby %I Springer New York %N 1-2 %P 145-169 %V 53 %8 11/2008 %0 Thesis %B Department of Electrical Engineering %D 2008 %T Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications %A Hilewitz, Yedidya %I Princeton University %X Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit matrix multiply instructions, building on previous work which focused solely on instructions for accelerating general bit permutations.
We show that the bit gather and bit scatter instructions can be implemented efficiently using the fast butterfly and inverse butterfly network datapaths. We define static, dynamic and loop-invariant versions of the instructions, with static versions utilizing a much simpler functional unit than dynamic or loop-invariant versions. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. We propose a new advanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation instructions and then show how this functional unit can be extended to subsume the functionality of the standard shifter unit. This new unit represents an evolution in the design of shifters.
We also consider the bit matrix multiply instruction. This instruction multiplies two n x n bit matrices and can be used to accelerate parity computation and is a powerful bit manipulation primitive. Bit matrix multiply is currently only supported by supercomputers and we investigate simpler bmm primitive instructions suitable for implementation in a commodity processor. We consider smaller units that perform submatrix multiplication and the use of the Parallel Table Lookup module to speed up bmm.
Additionally, we perform an analysis of a variety of different application kernels taken from domains including binary compression, image manipulation, communications, random number generation, bioinformatics, integer compression and cryptology. We show that usage of our proposed instructions yields significant speedups over a basic RISC architecture ? parallel extract and parallel deposit speed up applications 2.4x on average, while applications that benefit from bmm instructions are accelerated up to 4.0x on average for the various bmm solutions. %9 PhD Thesis %0 Generic %D 2007 %T Fast Bit Matrix Multiplication in Commodity Microprocessors %A Hilewitz, Yedidya %A Lauradoux, Cedric %A Lee, Ruby %C Princeton University Department of Electrical Engineering Technical Report CE-L2007-011 %8 November 2007 %0 Generic %D 2007 %T PAX: A Cryptographic Processor with Parallel Table Lookup and Wordsize Scalability %A Lee, Ruby %A Fiskiran, Murat %A Wang, Michael %A Hilewitz, Yedidya %A Chen, Yu-Yuan %C Princeton University Department of Electrical Engineering Technical Report CE-L2007-010 %8 November 2007 %0 Conference Paper %B Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH-18) %D 2007 %T Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors %A Hilewitz, Yedidya %A Lee, Ruby %C Montpellier, France %K shifter, rotations, permutations, bit manipulations, arithmetic, processor %P 251-260 %X This paper describes a new basis for the implementation of a shifter functional unit. We present a design based on the inverse butterfly and butterfly datapath circuits that performs the standard shift and rotate operations, as well as more advanced extract, deposit and mix operations found in some processors. Additionally, it also supports important new classes of even more advanced bit manipulation instructions recently proposed: these include arbitrary bit permutations, bit scatter and bit gather instructions. The new functional unit?s datapath is comparable in latency to that of the classic barrel shifter. It replaces two existing functional units - shifter and mix - with a much more powerful one. %8 June 2007 %0 Generic %D 2007 %T A New Basis for Shifters in General-Purpose Processors for Existing and Advanced Bit Manipulations %A Hilewitz, Yedidya %A Lee, Ruby %C Princeton University Department of Electrical Engineering Technical Report CE-L2007-004 %8 July 2007 %0 Generic %D 2007 %T Accelerating the Whirlpool Hash Function using On-Chip Lookup Tables %A Hilewitz, Yedidya %A Lee, Ruby %C Princeton University Department of Electrical Engineering Technical Report CE-L2007-001 %8 February 2007 %0 Generic %D 2007 %T Achieving Very Fast Bit Matrix Multiplication in Commodity Microprocessors %A Hilewitz, Yedidya %A Lee, Ruby %C Princeton University Department of Electrical Engineering Technical Report CE-L2007-006 %8 August 2007 %0 Conference Paper %B Proceedings of the IEEE 17th International Conference on Application-Specific Systems, Architectures and Processors (ASAP) %D 2006 %T Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions %A Hilewitz, Yedidya %A Lee, Ruby %P 65-72 %U http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04019493 %X Current microprocessor instruction set architectures are word oriented, with some subword support. Many important applications, however, can realize substantial performance benefits from bitoriented instructions. We propose the parallel extract (pex) and parallel deposit (pdep) instructions to accelerate compressing and expanding selections of bits. We show that these instructions can be implemented by the fast inverse butterfly and butterfly network circuits. We evaluate latency and area costs of alternative functional units for implementing subsets of advanced bit manipulation instructions. We show applications exhibiting significant speedup, 3.41 %Z (Best Paper Award) %8 11/09/2006 %0 Conference Paper %B Proceedings of the 38th Annual Asilomar Conference on Signals, Systems, and Computers %D 2004 %T Comparing Fast Implementations of Bit Permutation Instructions %A Hilewitz, Yedidya %A Shi, Zhijie Jerry %A Lee, and Ruby %C Pacific Grove, California, USA %P 1856-1863 %U http://palms.ee.princeton.edu/PALMSopen/hilewitz04comparing_with_cit.pdf %8 Nov. 2004