In this article, we will take a closer look at the storage details of floating-point numbers in the go language
What is the result of the following simple program 0.3 + 0.6?Some would naively assume 0.9, but the actual output is 0.899999999999 (go 1.13.5)
var f1 float64 = 0.3
var f2 float64 = 0.6
fmt.Println(f1 + f2)
The problem is that most decimal numbers are approximated and infinite when expressed as binary.Take 0.1 as an example.It may be one of the simplest decimal digits you can think of, but binary looks very complicated: 0.0001100110011001100...He is a continuous loop of infinite numbers (described later on on how to convert to binary numbers).
The absurdity of the result tells us that we must have a deep understanding of how floating-point numbers are stored in computers and their nature in order to properly handle the calculation of numbers.
Golang, like many other languages (C, C++, Python), uses the IEEE-754 standard to store floating-point numbers.
The IEEE-754 specification uses a special scientific representation of floating-point numbers based on a 2-cardinality.
Basic decimal number | Scientific notation | Index means | Coefficient | Base | Index | Decimal |
---|---|---|---|---|---|---|
700 | 7e+2 | 7 * 10^2 | 7 | 10 | 2 | 0 |
4,900,000,000 | 4.9e+9 | 4.9 * 10^9 | 4.9 | 10 | 9 | .9 |
5362.63 | 5.36263e+3 | 5.36263 * 10^3 | 5.36263 | 10 | 3 | .36263 |
-0.00345 | 3.45e-3 | 3.45 * 10^-3 | 3.45 | 10 | -3 | .45 |
0.085 | 1.36e-4 | 1.36 * 2^-4 | 1.36 | 2 | -4 | .36 |
Differences between 32-bit single-precision and 64-bit double-precision floating-point numbers
Precision | Symbol | Index | Decimal | Offset |
---|---|---|---|---|
Single (32 Bits) | 1 [31] | 8 [30-23] | 23 [22-00] | 127 |
Double (64 Bits) | 1 [63] | 11 [62-52] | 52 [51-00] | 1023 |
Symbol bits: 1 is negative, 0 is positive.
Exponential bits: Stores the exponent minus the offset, which is designed to express negative numbers.
Decimal places: The exact or closest value of the decimal places of the storage factor.
Take the number 0.085 for example.
Symbol Bit | Index Bit (123) | Decimal Bit (.36) |
---|---|---|
0 | 0111 1011 | 010 1110 0001 0100 0111 1011 |
Take 0.36 for example: 010 1110 0001 0100 0111 1011 = 0.36 (the first digit represents 1/2, the second digit is 1/4...)
The calculation steps after decomposition are:
Bit | Value | Fraction | Decimal | Total |
---|---|---|---|---|
2 | 4 | 1⁄4 | 0.25 | 0.25 |
4 | 16 | 1⁄16 | 0.0625 | 0.3125 |
5 | 32 | 1⁄32 | 0.03125 | 0.34375 |
6 | 64 | 1⁄64 | 0.015625 | 0.359375 |
11 | 2048 | 1⁄2048 | 0.00048828125 | 0.35986328125 |
13 | 8192 | 1⁄8192 | 0.0001220703125 | 0.3599853515625 |
17 | 131072 | 1⁄131072 | 0.00000762939453 | 0.35999298095703 |
18 | 262144 | 1⁄262144 | 0.00000381469727 | 0.3599967956543 |
19 | 524288 | 1⁄524288 | 0.00000190734863 | 0.35999870300293 |
20 | 1048576 | 1⁄1048576 | 0.00000095367432 | 0.35999965667725 |
22 | 4194304 | 1⁄4194304 | 0.00000023841858 | 0.35999989509583 |
23 | 8388608 | 1⁄8388608 | 0.00000011920929 | 0.36000001430512 |
Golang Display Floating Point - Verify Previous Theory
math.Float32bits can print out the binary representation of numbers for us.
The code below outputs a binary representation of 0.085
.
To verify the correctness of the previous theory, the original decimal 0.085 represented by the binary representation is deduced backwards.
package main
import (
"fmt"
"math"
)
func main() {
var number float32 = 0.085
fmt.Printf("Starting Number: %f\n\n", number)
// Float32bits returns the IEEE 754 binary representation
bits := math.Float32bits(number)
binary := fmt.Sprintf("%.32b", bits)
fmt.Printf("Bit Pattern: %s | %s %s | %s %s %s %s %s %s\n\n",
binary[0:1],
binary[1:5], binary[5:9],
binary[9:12], binary[12:16], binary[16:20],
binary[20:24], binary[24:28], binary[28:32])
bias := 127
sign := bits & (1 << 31)
exponentRaw := int(bits >> 23)
exponent := exponentRaw - bias
var mantissa float64
for index, bit := range binary[9:32] {
if bit == 49 {
position := index + 1
bitValue := math.Pow(2, float64(position))
fractional := 1 / bitValue
mantissa = mantissa + fractional
}
}
value := (1 + mantissa) * math.Pow(2, float64(exponent))
fmt.Printf("Sign: %d Exponent: %d (%d) Mantissa: %f Value: %f\n\n",
sign,
exponentRaw,
exponent,
mantissa,
value)
}
Output:
Starting Number: 0.085000
Bit Pattern: 0 | 0111 1011 | 010 1110 0001 0100 0111 1011
Sign: 0 Exponent: 123 (-4) Mantissa: 0.360000 Value: 0.085000
Classic Question: How to tell that a floating point number actually stores integers
Think for 10 seconds....
The following is a code implementation to determine if a floating point number is an integer. Let's analyze the function line by line. It can enhance understanding of floating-point numbers.
func IsInt(bits uint32, bias int) {
exponent := int(bits >> 23) - bias - 23
coefficient := (bits & ((1 << 23) - 1)) | (1 << 23)
intTest := (coefficient & (1 << uint32(-exponent) - 1))
fmt.Printf("\nExponent: %d Coefficient: %d IntTest: %d\n",
exponent,
coefficient,
intTest)
if exponent < -23 {
fmt.Printf("NOT INTEGER\n")
return
}
if exponent < 0 && intTest != 0 {
fmt.Printf("NOT INTEGER\n")
return
}
fmt.Printf("INTEGER\n")
}
To be an integer, an important condition is that the exponential bit is greater than 127. If the exponential bit is 127, the exponential bit is greater than 127, the exponential bit is greater than 127, the exponential bit is greater than 0, and vice versa. Let's take the number 234523 as an example:
Starting Number: 234523.000000
Bit Pattern: 0 | 1001 0000 | 110 0101 0000 0110 1100 0000
Sign: 0 Exponent: 144 (17) Mantissa: 0.789268 Value: 234523.000000
Exponent: -6 Coefficient: 15009472 IntTest: 0
INTEGER
The first step is to calculate the index.Since 23 is subtracted, the criterion in the first judgment is exponent < -23.
exponent := int(bits >> 23) - bias - 23
Step 2, (bits & ((1 < < 23) - 1)) calculates the decimal places.
coefficient := (bits & ((1 << 23) - 1)) | (1 << 23)
Bits: 01001000011001010000011011000000
(1 << 23) - 1: 00000000011111111111111111111111
bits & ((1 << 23) - 1): 00000000011001010000011011000000
|(1 << 23)
means that 1 is added in front.
bits & ((1 << 23) - 1): 00000000011001010000011011000000
(1 << 23): 00000000100000000000000000000000
coefficient: 00000000111001010000011011000000
1 + Decimal = Coefficient.
The third step, calculating intTest, is an integer only if the exponential multiple can make up for the smallest decimal place.As shown below, the exponent is 17, and it cannot make up for the last 6 decimal places.That is, it cannot make up for 1/2^18 decimal.It's an integer because 2^18 bits are followed by 0.
exponent: (144 - 127 - 23) = -6
1 << uint32(-exponent): 000000
(1 << uint32(-exponent)) - 1: 111111
coefficient: 00000000111001010000011011000000
1 << uint32(-exponent)) - 1: 00000000000000000000000000111111
intTest: 00000000000000000000000000000000
Wikipedia explains:
In computing, a normal number is a non-zero number in a floating-point representation which is within the balanced range supported by a given floating-point format: it is a floating point number that can be represented without leading zeros in its significand. What does that mean?There is an offset at the exponential position in IEEE-754, which is designed to express negative numbers.For example, 0.085 in single precision, the actual exponent is -3, stored in the exponential bit is 123.
So there is an upper limit to the negative number expressed.This upper limit is 2^-126.If it is smaller than this negative number, for example, 2^-127, it should be expressed as 0.1 * 2 ^-126. Then the coefficient becomes a number that is not 1 leading, which is called denormal (or subnormal) number.
Normal coefficients are numbers leading by one, called Normal number s.
Precision is a very complex concept, in which the author is discussing the 10-digit precision of binary floating-point numbers.
Precision D means within a range if we convert the D-bit 10-digit (expressed in scientific notation) to binary.Convert binary to D-bit 10-bit.No loss of data means that there is d precision in this range.
The reason for accuracy is that when data is converted from one process to another, it is not matched accurately, but to the nearest number.
For the time being, we will not go into any further discussion here, but will draw a conclusion:
- float32 has a precision of 6-8 bits.
- The precision of float64 is 15-17 bits
- Accuracy is dynamic and may vary from range to range.A simple hint here is that the intersection between the powers of 2 and 10 is different.
This paper introduces the specific storage method of floating point number in the IEEE-754 standard used by go language.
This article helps readers understand how floating-point numbers are stored using snippets of actual code and a brainstorming twist.
This paper introduces two important concepts, normal number and precision.