BullyWiiHacks
Welcome dear guest! Very Happy

To start posting and being part of the BWH community, you simply need to register an account or log into an existing one.

If you do not wish to register at all, that's fine but there will be more advertisements. :/

You can probably see and download most content provided for regular members even without an account.

Your contributions will be greatly appreciated though, give it a shot and register today! thumbsup

Join the forum, it's quick and easy

BullyWiiHacks
Welcome dear guest! Very Happy

To start posting and being part of the BWH community, you simply need to register an account or log into an existing one.

If you do not wish to register at all, that's fine but there will be more advertisements. :/

You can probably see and download most content provided for regular members even without an account.

Your contributions will be greatly appreciated though, give it a shot and register today! thumbsup
BullyWiiHacks
Would you like to react to this message? Create an account in a few clicks or log in to continue.
BullyWiiHacks

Gaming, Modding & Programming

Important reminders:

- Click *HERE* for advanced forum search or check out the text field below on the front page for Google before posting
- NO support via private message (use the forum)
- Write meaningful topic titles
Site Translation
Latest topics
Search
 
 

Display results as :
 


Rechercher Advanced Search

December 2024
MonTueWedThuFriSatSun
      1
2345678
9101112131415
16171819202122
23242526272829
3031     

Calendar Calendar

Country Statistics
Free counters!

You are not connected. Please login or register

IEEE 754-1985 Float Calculation

Go down  Message [Page 1 of 1]

1Download IEEE 754-1985 Float Calculation 2/25/2016, 1:53 pm

Reclaimer Shawn

Reclaimer Shawn
Code Creator

Alright, so first off: what exactly does IEEE 754-1985 stand for anyways? Well, it stands for Institute of Electrical and Electronics Engineers, and was a system adopted in 1985. It is the current system used in modern computers to represent decimal numbers in a binary format. Before I go on any further, this tutorial assumes you already know how to perform written calculations in both binary and hexadecimal, without the use of a calculator. If you cannot do these things, please leave while you still can. Alright, now onto the good stuff.

Float values are stored as 32-bit integers, meaning they use 32 0's and 1's(bits) to represent a number. Here's the IEEE 754 Format:

IEEE 754-1985 Float Calculation 618px-11

Alright, time to break it apart. The sign goes by a signed magnitude component. Sign magnitude means that in this, a 1 signifies a negative number, while 0 a positive. Now, to cover the rest. Now, let's choose a random number... Let's try -6.125 for instance... The first part to do is place a one in the sign to represent a negative
Code:

1    |   XXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXX
Sign|   Exponent  | Mantissa/Significand
Now, we find the Mantissa. First, we work out the full number part of the number. We know by now that 6= 110 in binary. Now, we have to we'll place this down here:

Code:

110.XXX
Now, you know how with binary we do powers of 2? Well, now we'll use negative powers to represent decimals, like in scientific notation. The first value is 2^-1, or 1/2^1. The next is 1/2^2, and so on. Now, we check if 1/2^1 goes in... Does .5 go in? Nope, so we place a zero. Now, we try 1/2^2. Does .25 go in? Nope, we place another zero. Now, we try 1/2^3, which is .125, which goes in, and 0's out the number, so we stop there. Our "denormalized" number is as such:
Code:

110.001

Now, we have to "normalize" this to make it work. Sure, we represent decimals in binaries in the denormalized format, but a computer does not. What we do is Move the decimal to the left place as many times as it takes to place it right next to the last 1-bit. It turns out like this:

Code:

1.10001

Now, we drop the 1 and the decimal point, and get this:

Code:

10001

"Pad" the remaining 18 bits with 0's, and we get this... We know this due to (Number of bits) - (Number in "Normalized Strand" 23-5=18 remaining bits

Code:

1    | XXXXXXXX| 10001000000000000000000
Sign| Exponent  | Significand/Mantissa

Now, we find the exponent. We now have to remember how many times we moved the decimal place to the left to "normalize" it. We moved it two times to the left. We will add what is called the "Bias" The bias is the highest number we get in a signed(+/-) system of that many bits. The highest number in a signed system with 8 bits is 127. We now add the exponent(2) with 127,and we get 129. Now, all we do is calculate out 129 in binary, and load it into the exponent bits. 129 = 10000001 in Binary, so we load that into the exponent... Our full float notation number is:

Code:

11000000110001000000000000000000

And we're done! Now, we have Double notation. I included this in the same lesson due to its similarities. The only difference is this:

IEEE 754-1985 Float Calculation 618px-12

The bias is now 1023(due to using 11 bits, 1023 is the highest signed number), and the significand holds 52 bits, allowing for a calculation of up to 1/2^52 in precision, instead of a 1/2^23 precision in float. Keep in mind that this is God-Awful for numbers that are not powers of 2, and will most likely have to be rounded in the end, and EVERY bit will have to be used just to represent that rounded number. In the next post, I'll put some little extra terminology in the next post, but for now, this is how you can calculate in Float! Enjoy!
A float Calculator to check your work:
http://www.h-schmidt.net/FloatConverter/IEEE754.html



Last edited by Reclaimer Shawn on 3/24/2018, 8:56 pm; edited 2 times in total

2Download Terminology and Other Factoids 2/25/2016, 2:15 pm

Reclaimer Shawn

Reclaimer Shawn
Code Creator

Truncation: Rounding a number to a whole number(if it is 1,2,3, or 4, it'll be rounded down. 5+ will round up)

Flooring: Rounding a value down.(Bringing it to the floor as I like to think)

Ceiling: Rounding a value up.(Raising it up to the ceiling)

For example

Number EX: -12.4 12.6 -12.6 12.4
Rounding Methods: Flooring -13 12 -13 12
Ceiling -12 13 -12 13
Truncating -12 13 -13 12

Not a Number(NaN)
Types of NaNs
Quiet NaN(QNaN): A NaN that simply results from an undefined or erroneous calculation. Say, the hexadecimal number 0x7FFFFFFF, which in a signed 32 bit system is usually the highest number, but here, it's an error.
Signalling NaN(SNaN): Used for either debugging purposes or setting illegal program operations. A SNaN might be 0x7FC00000.

Special Operations in IEEE 754:
Number/Infinity = 0
(+/-)Infinity*(+/-)Infinity = (+/-)Infinity
(+/-)Nonzero number/0 = (+/-)Infinity
(+/-)0/(+/-)0 = NaN
Infinity-Infinity = NaN
(+/-)Infinity/0 = NaN

Special Numbers in IEEE 754:
0x7F800000 = Infinity
0xFF800000 = -Infinity
0x7FC00000 = SNaN(Probably many more than this)
0x80000000 = Negative Zero

Back to top  Message [Page 1 of 1]

Similar topics

-

» Float(ing Point) Values

Permissions in this forum:
You cannot reply to topics in this forum