Alright, so first off: what exactly does IEEE 754-1985 stand for anyways? Well, it stands for Institute of Electrical and Electronics Engineers, and was a system adopted in 1985. It is the current system used in modern computers to represent decimal numbers in a binary format. Before I go on any further, this tutorial assumes you already know how to perform written calculations in both binary and hexadecimal, without the use of a calculator. If you cannot do these things, please leave while you still can. Alright, now onto the good stuff.
Float values are stored as 32-bit integers, meaning they use 32 0's and 1's(bits) to represent a number. Here's the IEEE 754 Format:
Alright, time to break it apart. The sign goes by a signed magnitude component. Sign magnitude means that in this, a 1 signifies a negative number, while 0 a positive. Now, to cover the rest. Now, let's choose a random number... Let's try -6.125 for instance... The first part to do is place a one in the sign to represent a negative
Now, we have to "normalize" this to make it work. Sure, we represent decimals in binaries in the denormalized format, but a computer does not. What we do is Move the decimal to the left place as many times as it takes to place it right next to the last 1-bit. It turns out like this:
Now, we drop the 1 and the decimal point, and get this:
"Pad" the remaining 18 bits with 0's, and we get this... We know this due to (Number of bits) - (Number in "Normalized Strand" 23-5=18 remaining bits
Now, we find the exponent. We now have to remember how many times we moved the decimal place to the left to "normalize" it. We moved it two times to the left. We will add what is called the "Bias" The bias is the highest number we get in a signed(+/-) system of that many bits. The highest number in a signed system with 8 bits is 127. We now add the exponent(2) with 127,and we get 129. Now, all we do is calculate out 129 in binary, and load it into the exponent bits. 129 = 10000001 in Binary, so we load that into the exponent... Our full float notation number is:
And we're done! Now, we have Double notation. I included this in the same lesson due to its similarities. The only difference is this:
The bias is now 1023(due to using 11 bits, 1023 is the highest signed number), and the significand holds 52 bits, allowing for a calculation of up to 1/2^52 in precision, instead of a 1/2^23 precision in float. Keep in mind that this is God-Awful for numbers that are not powers of 2, and will most likely have to be rounded in the end, and EVERY bit will have to be used just to represent that rounded number. In the next post, I'll put some little extra terminology in the next post, but for now, this is how you can calculate in Float! Enjoy!
A float Calculator to check your work:
http://www.h-schmidt.net/FloatConverter/IEEE754.html
Float values are stored as 32-bit integers, meaning they use 32 0's and 1's(bits) to represent a number. Here's the IEEE 754 Format:
Alright, time to break it apart. The sign goes by a signed magnitude component. Sign magnitude means that in this, a 1 signifies a negative number, while 0 a positive. Now, to cover the rest. Now, let's choose a random number... Let's try -6.125 for instance... The first part to do is place a one in the sign to represent a negative
- Code:
1 | XXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXX
Sign| Exponent | Mantissa/Significand
- Code:
110.XXX
- Code:
110.001
Now, we have to "normalize" this to make it work. Sure, we represent decimals in binaries in the denormalized format, but a computer does not. What we do is Move the decimal to the left place as many times as it takes to place it right next to the last 1-bit. It turns out like this:
- Code:
1.10001
Now, we drop the 1 and the decimal point, and get this:
- Code:
10001
"Pad" the remaining 18 bits with 0's, and we get this... We know this due to (Number of bits) - (Number in "Normalized Strand" 23-5=18 remaining bits
- Code:
1 | XXXXXXXX| 10001000000000000000000
Sign| Exponent | Significand/Mantissa
Now, we find the exponent. We now have to remember how many times we moved the decimal place to the left to "normalize" it. We moved it two times to the left. We will add what is called the "Bias" The bias is the highest number we get in a signed(+/-) system of that many bits. The highest number in a signed system with 8 bits is 127. We now add the exponent(2) with 127,and we get 129. Now, all we do is calculate out 129 in binary, and load it into the exponent bits. 129 = 10000001 in Binary, so we load that into the exponent... Our full float notation number is:
- Code:
11000000110001000000000000000000
And we're done! Now, we have Double notation. I included this in the same lesson due to its similarities. The only difference is this:
The bias is now 1023(due to using 11 bits, 1023 is the highest signed number), and the significand holds 52 bits, allowing for a calculation of up to 1/2^52 in precision, instead of a 1/2^23 precision in float. Keep in mind that this is God-Awful for numbers that are not powers of 2, and will most likely have to be rounded in the end, and EVERY bit will have to be used just to represent that rounded number. In the next post, I'll put some little extra terminology in the next post, but for now, this is how you can calculate in Float! Enjoy!
A float Calculator to check your work:
http://www.h-schmidt.net/FloatConverter/IEEE754.html
Last edited by Reclaimer Shawn on 3/24/2018, 8:56 pm; edited 2 times in total