M.Srujana B.ParameshwariAssistant Professor Assistant ProfessorDepartment of Department ofElectronics and Electronics andCommunication Engineering, Communication Engineering,CJITS, Yeshwanthapur,T.S CJITS,Yeshwanthapur,T.
SE-mail:[email protected] E-mail:[email protected]— In this paper we present a high-performance, highthroughput, and area efficient architecture for the VLSI imple-mentation of the AES algorithm. The subkeys, required for eachround of the Rijndael algorithm, are generated in real-time bythe key-scheduler module by expanding the initial secret key,thus reducing the amount of storage for buffering.
Moreover,pipelining is used after each standard round to enhance thethroughput. A prototype chip implemented using 0.35µ CMOStechnology resulted in a throughput of 232M bps for iterativearchitecture and 1.83ttbps for pipelining architecture.I. INTRODUCTIONSeveral techniques, such as cryptography, steganography,watermarking, and scrambling, have been developed to keepdata secure, private, and copyright protected 1, 2. Cryptog-raphy is an essential tool underlying virtually all networkingand computer protection, traditionally used for military andespionage. However, the need for secure transactions in e-commerce, private networks, and secure messaging has movedencryption into the commercial realm 3.
Advanced encryption standard (AES) was issued as FederalInformation Processing Standards (FIPS) by National Instituteof Standards and Technology (NIST) as a successor to dataencryption standard (DES) algorithms. In recent literature,a number of architectures for the VLSI implementation ofAES Rijndael algorithm are reported 4, 5, 6, 7, 8.It can be observed that some of these architectures are of lowperformance and some provide low throughput.
Further, manyof the architectures are not area efficient and can result inhigher cost when implemented in silicon.In this paper, we propose a high performance, high through-put and area efficient VLSI architecture for Rijndeal algo-rithm that is suitable for low cost silicon implementation.The proposed architecture is optimized for high throughputin terms of the encryption and decryption data rates usingpipelining. Polynomial multiplication is implemented usingXOR operation instead of using multipliers to decrease thehardware complexity. In the proposed architecture both theencryption and decryption modes use common hardware re-sources, thus making the design area efficient. Selective use oflook-up tables and combinational logic further enhances the ar-chitecture’s memory optimization, area, and performance.
Animportant feature of our proposed architecture is an effectivesolution of online (real-time) round key generation needingsignificantly less storage for buffering.II.RIJNDAEL ALGORITHMRijndael algorithm is an iterated block cipher 9 supportinga variable data block and a variable key length of 128, 192 or256 bits. The algorithm consists of three distinct phases: (i)an initial data/key addition, (ii) nine (128-bits), eleven (192-bits) or thirteen (256-bits) standard rounds, (iii) a final roundwhich is a variation of a standard round.
The number ofstandard rounds depends on the data block and key length.If the maximum length of the datablock or key is 128, 192 or256, then the number of rounds is 10, 12 or 14, respectively.The initial key is expanded to generate the round keys, eachof size equal to block length.
Each round of the algorithmreceives a new round key from the key schedule module.Each standard round includes four fundamental algebraicfunction transformations on arrays of bytes. These transforma-tions are: byte substitution, shift row, mix column, and roundkey addition. The final round of the algorithm is similar tothe standard round, except that it does not have MixColumnoperation. Decryption is performed by the application of theinverse transformations of the round functions. The sequenceof operations for the standard round function differs fromencryption.
The computational performance differs betweenencryption and decryption because the inverse transformationsin the round function is more complex than the correspondingtransformation for encryption.III. THE PROPOSED VLSI ARCHITECTURE FOR RIJNDAELThe proposed architecture showing the order of operationand control between the transformations is shown in Fig. 1(a).A. Architecture of the Data UnitThe data unit consists of: the initial round of key addition,N r 1 standard rounds, and a final round. The architecturefor a standard round composed of four basic blocks is shownin Fig.
1(b). For each block, both the transformation and theinverse transformation needed for encryption and decryption,respectively are performed using the same hardware resources.This implementation generates one set of subkey and reusesit for calculating all other subkeys in real-time.1) ByteSub: In this architecture each block is replaced byits substitution in an S-Box table consisting of the multiplica-tive inverse of each byte of the block state in the finite fieldttF (28). In order to overcome the performance bottleneck,×???? ????Where, In2T rans(K) is the multiplication of the byte by X(hexadecimal value 02) over ttF (28).
In0 is the leastsignificant 8 bits of a column of a matrix. Architecture ofdifferent units are shown in Fig. 2 and the architecture ofMixColumn transformation is shown in the Fig. 3.InputOutputIN0IN18IN28IN3T0888 8XOR8XOR8XOR88XORTrans88 XOR8XOR 8Trans8XOR8Trans8 TransMULTIPLEXER (Encryption, Decryption)8 8(a) Rijndael Algorithm Data and Con- (b) Architecture for the Standard 8 8trol FlowRound in the Data UnitFig. 1.
Top Level View of the RijndaelXOR Y Z88(a) For Computation of Y , Zthe implementation of multiplicative inverses is carried outusing look-up tables (stored in a table of 8 256). Theimplementation includes the affine mapping of the input inboth encryption and decryption processes as follows:a7 a6 a5a4 a3 a2a1 a0 0Affine Mapping: Outi = Ini In(i + 4)mod8 In(i +5)mod8 In(i+ 6)mod8 In(i+ 7)mod8 CEi, where CE= 01100011 is a constant, leftmost bit the being MSB.Inverse Affine Mapping: Outi = In(i + 2)mod8 In(i +5)mod8 In(i + 7)mod8 CDi, where CD = 00000101is a constant, leftmost bit being the MSB.2) ShiftRow: In this transformation the rows of the blockstate are shifted over different offsets. The amount of shiftsis determined by the block length. The proposed architectureimplements the shift row operation using combinational logicconsidering the offset by which a row should be shifted.
3) MixColumn: In this transformation each column of theblock state is considered as a polynomial over ttF (28). It ismultiplied with a constant polynomial C(x) or D(x) over afinite field in encryption or decryption, respectively. Inhardware, the multiplication by the corresponding polynomialis done by XOR operations and multiplication of a block byX. This is implemented using a multiplexer, the control beingthe MSB is 1 or 0. The equations implemented in hardwarefor MixColumn in encryption and decryption are as follows.
In encryption process,Y = In0 ? In1 ? In2 ? In3 and Z = Y .In decryption process, T 0 = In0 ? In1 ? In2 ? In3, T 1 =T 0 ? In2T rans(In2T rans(T 0)),Y = T 1 ? In2T rans(In2T rans(In0 ? In2)), and Z =T 1 ? In2T rans(In2T rans(In1 ? In3)).Out0 = In0 ? Y ? In2T rans(In0 ? In1), Out1 =In1 ? Z ? In2T rans(In1 ? In2)Out2 = In2 ? Y ? In2T rans(In2 ? In3), and Out3 =In3 ? Z ? In2T rans(In3 ? In0).(b) For Multiplication by X (hex “02”)Fig. 2.
Architecture for Units used in Mix Column TransformationFig. 3. Architecture for Mix Column Transformation for 128 bits4) AddRoundKey: In this transformation (architecture rep-resented in Fig.
4), the round key obtained from the keyscheduler is XORed with the block state obtained from theMixColumn transformation or ShiftRow transformation basedon the type of round being implemented. In the standardround, the round key is XORed with the output obtained fromthe MixColumn transformation. In the final round the roundkey is XORed with the output obtained from the ShiftRowtransformation. In the initial round, bitwise XOR operation isInitial KeyEncryption /DecryptionFinal Round /Initial RoundMixColumnTransformationShiftRowTransformationByteSubTransformationAddRoundKeyTransformation128128Round Key AdditionRound key128128XOR blockY Mix ColumnXOR blockMUXZ128ByteSub128Shift RowInverseAffineMappingMUX AffineMappingmultiplicative inverse8 * 256 ROM cellsMultiplexer BlockXOR XOR XOR XOR XOR XOR XOR XORMultiplexer Multiplexer Multiplexer Multiplexer Multiplexer Multiplexer Multiplexer Multiplexer(a7 ; a7′) (a7 ; a7′) (a7 ; a7′) (a7 ; a7′) (a7 ; a7′) (a7 ; a7′) (a7 ; a7′) (a7 ; a7′)Y0Z 032Y1Z 132Second Column of State MatrixY2Z 232Third Column of State MatrixY3Z 332Fourth Column of State MatrixOut33 Out32Out31 Out30Out23 Out22Out21 Out20Out13 Out12Out11 Out10First Column of State MatrixOut03 Out02 Out01 Out00MIXCOLUMNCOMPUTATIONOFY,Z(Encryption,Decryption)×××performed between the initial round key and the initial stateblock.generate the last key value. Since the Rijndael algorithm allowsdifferent key lengths and block lengths, each round key iscarefully set to have the same length as the data block. InK 00(i)B 00(i)K 01(i)B 01(i)K 02(i)B 02(i)K 03(i)B 03(i)K 31(i)B 31(i)B 32(i)K 32(i)15th byteK 33(i)B 33(i)the case where key length and the block length are not equal,previous, current and also the next round keys are needed inorder to generate the appropriate set of round keys that arefed into the encryption module, which is performed by a “keyB 00(i+1)B 01(i+1)B 02(i+1)B 03(i+1)B 31(i+1)B 32(i+1)B 33(i+1)alignment unit”.
Fig. 4. Architecture for Round Key Addition TransformationB. Architecture for Key SchedulingIn the key scheduling module (Fig. 5), the initial key isexpanded and the generated round keys are stored in four 32-bit registers. Both the forward and reverse key scheduling aredone in the same device.
The ByteSub required in the keyexpansion unit is implemented using the S-Boxes. Four S-Boxes are needed for a 128-bit key and 128-bit data blockimplemented using 8 256 ROM cells. Multiplexers are usedas a control signal to distinguish between the initial key andthe round key (obtained from the initial key using a “keyexpansion unit”). The least significant 32 bits of the 128-bitkey is cyclically shifted to the left by a byte, implementedusing combinational logic. The resulting word after the leftshift operation is sent through the S-boxes and the affinemapping operation, in order to perform ByteSub. The keyresulting from the ByteSub is XORed with the Round Constant(RCON). In this architecture, the round constant is generatedusing the combinational logic. The round constant should besymmetric with the round key being generated.
initialkey0 initialkey1 initialkey2 initialkey3Encryption32Decryptioninitial_norrmal initial_norrmal initial_norrmal initial_norrmalIV.A PROPOTYPE CHIP IMPLEMENTATION FOR RIJNDAELWe now present the methodology used to design, simulate,and verify the proposed architecture.A. Implementation AnalysisIt is evident that the Rijndael’s S-Boxes are the dominantelement of the round function in terms of required logicresources. Each Rijndael round requires sixteen copies ofthe S-Boxes, each of which is an 8-bit 8-bit look-up- table,requiring more hardware resources.
However, the remainingcomponents of the Rijndael round function – byte swapping,constant multiplication by an element of Galois Field , and keyaddition – were found to be simpler in structure, resulting inthese elements of the round function requiring fewer hardwareresources. Additionally, it was found that the synthesis toolscould not minimize the overall size of a Rijndael roundsufficiently to allow for a fully unrolled or fully pipelinedimplementation of the entire ten rounds of the algorithm withinthe target FPGA 10. Partially pipelined implementation withone sub-pipeline stage provided one area-optimized solution.
As compared to a one-stage implementation with no sub-pipelining, the addition of a sub-pipeline stage afforded thesynthesis tool greater flexibility in its optimizations, resultingin a more area efficient implementation. The 2-stage loopunrolling was found to yield the highest throughput whenMUX MUXMUXMUXoperating in FeedBack (FB) mode.Register 0 Register 1 Register 2 Register 3B.Memory OptimizationRound Key 0 Round Key 1 Round Key 2 Round Key 3MUX enc_decSincethe design is based on one clock cycle for eachenc_decMUXFeedbackenc_decMUX128enc_decMUXleftshiftS?boxround constantencryption round, the memory modules had to be duplicated.For example, in the ByteSub, the S-boxes need to be duplicated16 times. Consequently, the choice of memory architecture isvery critical. Since all the table entries are fixed and definedin the standard, the usage of ROM is preferred.
Specifically,the architecture requires several small ROM modules insteadof one large module, since each lookup will only be based onFig. 5. Architecture for Key Scheduling UnitThe total number of round constants that need to be gener-ated is equal to the number of rounds. The round constantis obtained in real-time by multiplying the previous roundconstant by X. This is amenable for implementation in thehardware using XOR operations. For the reverse key schedul-ing, the last round key should be generated with forward keyscheduling for the first time. The last round key is expandedto generate the reverse round keys. Decryption requires morecycles than encryption because it needs pre-scheduling toa maximum of 8-bit address, which translates to 256 entries.
We implemented the multiplicative inverse function using thelook-up table of size 8 256. We have a total of 20 copies ofthe S-boxes in our design; 16 of them in encryption moduleand 4 in the key scheduling module.C. Design FlowThe proposed architecture was implemented using the CA-DENCE virtuoso layout design tool. The method adopted wasa custom designed at the transistor level based on a custom celllibrary of 0.
35µ CMOS primitive standard cells. A hierarchical16th byte14th byte4th byte3rd byte2nd byteIst byte32 32 32. ?total clock cycles 11total delay 70nsTABLE Iput = . block size?clock frequency ? = . 128?20MHz ? =232.7M bps, where, clock frequency = 1 / clock period forthe critical path. When the pipelining technique is usedinstead of the iterative feedback logic, the standard rounds areduplicated for N r times cascaded by the pipelining registers.This increases the effective area.
At a particular clock cycle,N r blocks of data can be encrypted or decrypted using thepipelining technique. Based on the critical path obtained usingour implementation, the throughput achieved with pipeliningcan be calculated as: Throughput = block size = 128TABLE IISUMMARY OF THE PERFORMANCE OF THE AES-128 MODULEArchitecture Clock cycles Throughput MbpsProposed Architecture 11 232Mangard et. al. 7-Standard 64 128Mangard et. al. 7-High Performance 34 241approach was followed in the implementation.
The layoutfor each module was generated and later integrated to obtainthe final chip. The generated netlist was then simulated withHSPICE using the MOSIS CMOS model. Once the creationof layout design is finished, the I/O pins have to be added tothe circuit. The design layouts of different architectural unitsare shown in the Fig. 6.(a) Multiplicative Inverse (b) Affine Mapping (c) Round ConstantsFig. 6.
Layout of Different Architectural UnitsD. Performance EvaluationAn AES-128 encryption / decryption of a 128-bit block wasdone in 11 clock cycles using the feedback logic. In each clockcycle, one transformation is executed and, at the same time,the appropriate key for the next round is calculated. The wholeprocess concludes after 10 rounds of transformations. Theanalysis of the components used for the proposed architectureis shown in the Table I. The architecture proposed by Mangardet. al. 7 uses multipliers for the implementation of theMixColumn, while ours uses XOR, multiplexors, invertersetc.
to reduce the complexity. Kuo et. al. 4 uses lookuptable for the implementation of the shift row module and forthe generation of the round constants in the key schedulingmodule, but we used the combinational logic instead of thelook-up tables, thus reducing the area.
The frequency of the external clock with which thearchitecture operates was 20M Hz; the critical delay be-ing 50ns. The throughput is calculated as: Through-= 1.83ttbps, where, total delay is the delay of the singleround including the delays caused by the pipelined registers.The summary of the performance in the Table II shows thatour proposed architecture minimizes the needed number ofclock cycles and achieves high throughput.V. DISCUSSIONS AND CONCLUSIONSWe have presented a VLSI architecture for the Rijndael AESalgorithm that performs both the encryption and decryption.S-boxes are used for the implementation of the multiplicativeinverses and shared between encryption and decryption. Theround keys needed for each round of the implementation aregenerated in real-time.
The forward and reverse key schedulingis implemented on the same device, thus allowing efficientarea minimization. Although the algorithm is symmetrical, thehardware required is not, with the encryption algorithm beingless complex than the decryption algorithm. The implementa-tion of the key unit in the proposed architecture, can be scaledfor the keys of length 192 and 256 bits easily.REFERENCES1 S. P. Mohanty, K.
R. Ramakrishnan, and M. S. Kankanhalli, “A DCTDomain Visible Watermarking Technique for Images,” in Proc of theIEEE International Conf on Multimedia and Expo, 2000, pp.
1029–1032.2 M. S. Kankanhalli and T. T.
Guan, “Compressed-Domain Scrambler/ Descrambler for Digital Video,” IEEE Transactions on ConsumerElectronics, vol. 48, no. 2, pp. 356–365, May 2002.3 B. M. Macq and J.
J. Quisquater, “Cryptography for Digital TVBroadcasting,” Proceedings of the IEEE, vol. 83, no. 6, pp. 944–957,Jun 1995.4 H. Kuo and I.
Verbauwhede, “Architectural Optimization for a 1.82Gbits/sec VLSI Implementation of the AES Rijndael Algorithm,” inProceedings of the Workshop on Cryptographic Hardware and Embed-ded Systems, 2001, vol. 2162, pp.
51–64.5 M. McLoone and J.
V. McCanny, “Rijndael FPGA ImplementationUtilizing Look-up Tables,” in Proceedings of the IEEE Workshop onSignal Processing Systems, 2001, pp. 349–360.6 A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “A Compact RijndaelHardware Architecture with S-Box Optimization,” in Proceedings ofAdvances in Cryptology – ASIACRYPT 2001, 2001, pp.
171–184.7 S. Mangard, M. Aigner, and S. Dominikus, “A Highly Regularand Scalable AES Hardware Architecture,” IEEE Transactions onComputers, vol. 52, no.
4, pp. 483–491, April 2003.8 T. Sodon O. J. Hernandez and M.
Adel, “Low-Cost Advanced En-cryption Standard (AES) VLSI Architecture: A Minimalist Bit-SerialApproach,” in Proc of IEEE Southeast Conference, 2005, pp. 121–125.9 J. Daemen and V. Rijmen, The Design of Rijndael, Springer-Verlag,2002.10 A. J.
Elbirt, W. Yip, B. Chetwynd, and Christof Paar, “An FPGAImplementation and