How does word length affect the performance and operation of a CPU?

in Technical Articles

About a year ago, I came across a question on Super User titled “How much faster is a 64-bit CPU than a 32-bit CPU?”, which was promptly closed and deleted since it’s a very open ended question. However, the author (a software developer) referred to benchmarks regarding system performance in 32-bit versus 64-bit. The purpose of this blog post is to investigate how the performance of a computer is affected, as a function of the word length.

Two Microcontrollers (8-bit Atmel), A Raspberry Pi (32-bit ARM), and my laptop (64-bit Intel)

8-bit, 32-bit, and 64-bit all refer to the word length of the processor, which can be thought of as the “fundamental data type”. Often, this is the number of bits transferred to/from the RAM of the system, and the width of pointers (although nothing stops you from using software to access more RAM then what a single pointer can access). A word length can be any number of bits, but is usually a power of two.

Assuming a constant clock speed (as well as everything else in the architecture being constant), and assuming memory reads/writes are the same speed (we assume 1 clock cycle here, but this is far from the case in real life), there is no direct speed advantage from a 32-bit to a 64-bit processor, except when using higher-precision values, or lots of memory reads/writes are required. For example, if I need to add two 64-bit numbers, I can do it in a single clock cycle on a 64-bit machine (three if you count fetching the numbers from RAM):

     ADDA [NUM1], [NUM2]

However, on a 32-bit machine, I need to do this in many clock cycles, since I first need to add the lower 32-bits, and then compensate for overflow, then add the upper 64-bits:

     CLRA          ; I'm assuming the condition flags are not modified by this.
     BRNO CMPS     ; Branch to CMPS if there was no overflow.
     ADDA #1       ; If there was overflow, compensate the value of A.

Going through my made-up assembly syntax, you can easily see how higher-precision operations can take an exponentially longer time on a lower word length machine. This is the real key to 64-bit and 128-bit processors: they allow us to handle larger numbers of bits in a single operation.

Likewise, if we had to make a copy of some data in memory, assuming everything else is constant, we could copy twice as many bits per cycle on a 64-bit versus 32-bit machine. This is why 64-bit versions of many image/video editing programs outperform their 32-bit counterparts.

Back to high-precision operations, even if you add the ability to a 32-bit processor to add two 64-bit numbers in a single clock cycle, *you still need more than one clock cycle* to fetch those numbers from RAM, since the word length (again) is often the fundamental size of memory operations. So, let’s assume we have two 64-bit registers (A64 and B64), and have an operation called ADDAB64 which adds A and B, and stores it in A:

     LDDA64 [NUM1]   ; Takes 2 clock cycles, since this number is fetched 32-bits at a time
     LDAB64 [NUM2]   ; Again, two more clock cycles.
     ADDAB64         ; This only takes 1.
     STAA64 [RESULT] ; However, this takes two again, since we need to store a 64-bit result 32-bits at a time.

As you can see, even a hardware implementation of a 64-bit addition under a 32-bit processor still takes 7 clock cycles at minimum (and this assumes all memory reads/writes take a single clock cycle). Where I’m going with this has performance implications specifically with pointers.

On a 32-bit machine, pointers can address ~4GB of RAM, where they can address over 16.7 million TB on a 64-bit machine. If you needed to address past 4GB on 32-bit, you would need to compensate for that kind of like how we added a 64-bit number on our 32-bit machine above. You would have many extra clock cycles dedicated to fetching and parsing those wider numbers, whereas those operations go much quicker on a processor that can handle it all in one word.

Also, while increasing the number of bits in an arithmetic and logic unit (ALU) will increase propagation delays for most operations, this delay is very manageable in today’s processors (or else we couldn’t keep the same clock speeds as our 32-bit processor variants), and is not much use when discussing digital synchronous circuits (since everything is clocked together, if the propagation delay was too long, the processor would just crash – which is also why there are limits to overclocking).

The bottom line: larger word lengths means we can process more data faster in the processor, which is greatly needed as we advance computing technology. This is why so many instruction set extensions (MMX, SSE, etc…) have been created: to process larger amounts of data in less amount of time.

A larger word length in a processor does not directly increase the performance of the system, but when dealing with larger (or higher precision) values is required, exponential performance gains can be realized. While the average consumer may not notice these increases, they are greatly appreciated in the fields of numeric computing, scientific analysis, video encoding, and encryption/compression.


  1. Matshidiso Makeke Reply October 16, 2012 8:13 AM

    Hello, I am an IT student at an M.Sc college (Middelburg Campus), and I have an assignment that I have been struggling with… In the following questions, could you provide assistance?

    1. What needs to happen in a fetch cycle?
    2. Why would word length have an effect on processor performance?
    3. What would be the effect of an incorrect instruction length?

    • Brandon Castellano Reply October 17, 2012 8:13 AM

      Hello Matshidiso; in regards to your questions, I have outlined some answers below:

      1. In a fetch cycle, the value in the program counter/instruction pointer is used as the address of the next instruction. This address is sent to a memory interface unit, which will then retrieve the next instruction word at the address. The program counter is then incremented, and the instruction is executed.

      2. I updated this blog post with significantly more information, which should full answer this question.

      3. It depends on your architecture. On a variable-length instruction word CPU, this may halt the CPU, as it would begin executing (presumably) “random” code. On an x86 machine, inserting bytes in a compiled executable would also likely corrupt it (causing the program to crash or CPU to halt), as certain non-instruction bytes may be interpreted as machine code.

      If you have any further questions, please let me know. Thanks!

  2. MATSHIDISO MAKEKE Reply October 19, 2012 7:16 AM

    Thank you so much Brandon, this information really helped me a lot – thanks for your generosity. you were a great help till next time.


  3. Craig Reply November 28, 2012 12:43 AM

    I’m reading this to try and fully understand the following statement:
    -strlen code implementation

    Rationale: it is generally much more efficient to do word length
    operations and avoid branches on modern computer systems, as
    compared to byte-length operations with a lot of branches.

    Can anyone explain how this works?

    • Brandon Castellano Reply December 8, 2012 10:11 AM

      Hi Craig;

      You need to remember that modern computer processors are pipelined, with a very large pipeline depth. When a branch is required, the CPU will use the branch prediction (i.e. assume the branch is taken or not taken before you actually know if it will – a form of speculative execution) to reduce the performance penalty. However, it should be noted that if the processor didn’t end up taking the branch, the pipeline will have to be stalled until normal execution can be continued (creating a bubble in the pipeline).

      In terms of actual word-length versus byte-length operations, it’s always more efficient to do word-length operations (assuming your word is larger than a byte!) since memory operations are expensive in terms of time. This comes back to the idea of fetching; we can fetch one word from our RAM at a time, and if we can hold 4 characters in a single word on a 32-bit system (or 8 on a 64-bit system), we can fetch four characters at a time. Now, instead of fetching one word at a time, and loading only the relevant character from the word into an 8-bit register (and thus repeating this step for subsequent characters), we can get a much higher throughput if we read in multiple characters at a time, and use the appropriate bit-wise operations to parse the values.

      In terms of the strlen implementation you linked to above, it indeed uses the above concepts to avoid expensive branching/fetching operations. What the code does is scan through a string one word at a time, and see if any of the bytes in the word evaluate to zero (the null byte). If so, the code then uses bit-masking operations to determine what character was the null byte, and uses this to determine the string length. For this reason, the code can efficiently iterate through a string one word at a time, as opposed to one character at a time.

      If you have any further questions, please let me know!

  4. Nicole Hamilton Reply December 21, 2012 4:27 PM

    A lot of computing is about sloshing data from one place to another, strcpy’ing strings and so on. In these simple cases, the advantage of 64 bits isn’t greater precision, it’s a bigger pipe to memory and i/o.

    • Brandon Castellano Reply December 21, 2012 4:52 PM

      Hello Nicole; very valid point, I only touched on that briefly at the end of the article, but agree that has just as much implications in terms of performance as do high-precision operations. I’ve updated the article to reflect this point better.


  5. Pingback: Why only 32 and 64 bits? - Page 2 - Windows 7

Post a Comment

Your email address will not be published. Required fields are marked *