2 Hilewitz, Yedidya 2008 Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications Department of Electrical Engineering Princeton University Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit matrix multiply instructions, building on previous work which focused solely on instructions for accelerating general bit permutations. <p> We show that the bit gather and bit scatter instructions can be implemented efficiently using the fast butterfly and inverse butterfly network datapaths. We define static, dynamic and loop-invariant versions of the instructions, with static versions utilizing a much simpler functional unit than dynamic or loop-invariant versions. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. We propose a new advanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation instructions and then show how this functional unit can be extended to subsume the functionality of the standard shifter unit. This new unit represents an evolution in the design of shifters. <p> We also consider the bit matrix multiply instruction. This instruction multiplies two n x n bit matrices and can be used to accelerate parity computation and is a powerful bit manipulation primitive. Bit matrix multiply is currently only supported by supercomputers and we investigate simpler bmm primitive instructions suitable for implementation in a commodity processor. We consider smaller units that perform submatrix multiplication and the use of the Parallel Table Lookup module to speed up bmm. <p> Additionally, we perform an analysis of a variety of different application kernels taken from domains including binary compression, image manipulation, communications, random number generation, bioinformatics, integer compression and cryptology. We show that usage of our proposed instructions yields significant speedups over a basic RISC architecture ? parallel extract and parallel deposit speed up applications 2.4x on average, while applications that benefit from bmm instructions are accelerated up to 4.0x on average for the various bmm solutions.