WAV files can store PCM audio (WAVE_FORMAT_PCM). The WAV file format specification says:
The data format and maximum and minimums values for PCM waveform samples of various sizes are as follows:
Sample Size Data Format Maximum Value Minimum Value One to eight bits Unsigned integer 255 (0xFF) 0 Nine or more bits Signed integer i Largest positive value of i Most negative value of i For example, the maximum, minimum, and midpoint values for 8-bit and 16-bit PCM waveform data are as follows:
Format Maximum Value Minimum Value Midpoint Value 8-bit PCM 255 (0xFF) 0 128 (0x80) 16-bit PCM 32767 (0x7FFF) -32768 (-0x8000) 0
Both the signed and unsigned formats are asymmetrical. How to handle the asymmetry? The signed version is two's complement representation, and AES17 defines the meaning of full-scale amplitude in this case:
amplitude of a 997-Hz sine wave whose positive peak value reaches the positive digital full scale, leaving the negative maximum code unused.
NOTE In 2's-complement representation, the negative peak is 1 LSB away from the negative maximum code.
As does IEC 61606-3:
amplitude of a 997 Hz sinusoid whose peak positive sample just reaches positive digital full-scale (in 2’s-complement a binary value of 0111…1111 to make up the word length) and whose peak negative sample just reaches a value one away from negative digital full-scale (1000…0001 to make up the word length) leaving the maximum negative code (1000…0000) unused
So, for example, for 16-bit audio, a signal that just reaches +32,767 and −32,767 would be full-scale, while one that reaches −32,768 exceeds full-scale.
The midpoint example for 8-bit clarifies that the symmetry of unsigned data is the same as for signed data. So, for 8-bit data, a signal that reaches from 1 to 255 would be full-scale, and the value 0 exceeds full-scale.
WAVE Audio File Format Specifications says:
For float data, full scale is 1.
So, to correctly convert signed ints to float, divide by 2**(b-1) - 1
, where b is the number of bits.
To correctly convert unsigned ints to float, subtract 2**(b-1)
, then, similarly, divide by 2**(b-1) - 1
.
The float representation will then be limited to +1.0 full-scale in the positive direction, but can exceed −1.0 full-scale in the negative direction.
WAV format actually allows for less than 8 bits:
The bits that represent the sample amplitude are stored in the most significant bits of i, and the remaining bits are set to zero.
So I'll show 2-bit audio first (wBitsPerSample = 2), because it's simpler to follow:
WAV | Sample | int | float | Comment |
---|---|---|---|---|
0xC0 | 0b11 | 3 | +1.0 | full-scale |
0x80 | 0b10 | 2 | 0.0 | midpoint |
0x40 | 0b01 | 1 | −1.0 | full-scale |
0x00 | 0b00 | 0 | −2.0 |
For 8-bit audio, as mentioned above, 255 is full-scale, 128 is midpoint, 1 is negative full-scale, and 0 exceeds full-scale:
WAV | Sample | int | float | Comment |
---|---|---|---|---|
0xFF | 0b1111_1111 | 255 | +1.000 | full-scale |
0xFE | 0b1111_1110 | 254 | +0.992 | |
0xFD | 0b1111_1101 | 253 | +0.984 | |
... | ... | ... | ... | |
0x82 | 0b1000_0010 | 130 | +0.016 | |
0x81 | 0b1000_0001 | 129 | +0.008 | |
0x80 | 0b1000_0000 | 128 | 0.000 | midpoint |
0x7F | 0b0111_1111 | 127 | −0.008 | |
0x7E | 0b0111_1110 | 126 | −0.016 | |
... | ... | ... | ... | |
0x03 | 0b0000_0011 | 3 | −0.984 | |
0x02 | 0b0000_0010 | 2 | −0.992 | |
0x01 | 0b0000_0001 | 1 | −1.000 | full-scale |
0x00 | 0b0000_0000 | 0 | −1.008 |
For 16-bit audio, the interpretation is signed:
WAV | Sample | int | float | Comment |
---|---|---|---|---|
0x7FFF | 0b0111_1111_1111_1111 | +32,767 | +1.00000 | full-scale |
0x7FFE | 0b0111_1111_1111_1110 | +32,766 | +0.99997 | |
0x7FFD | 0b0111_1111_1111_1101 | +32,765 | +0.99994 | |
... | ... | ... | ... | |
0x0002 | 0b0000_0000_0000_0010 | +2 | +0.00006 | |
0x0001 | 0b0000_0000_0000_0001 | +1 | +0.00003 | |
0x0000 | 0b0000_0000_0000_0000 | 0 | 0.00000 | midpoint |
0xFFFF | 0b1111_1111_1111_1111 | −1 | −0.00003 | |
0xFFFE | 0b1111_1111_1111_1110 | −2 | −0.00006 | |
... | ... | ... | ... | |
0x8003 | 0b1000_0000_0000_0011 | −32,765 | −0.99994 | |
0x8002 | 0b1000_0000_0000_0010 | −32,766 | −0.99997 | |
0x8001 | 0b1000_0000_0000_0001 | −32,767 | −1.00000 | full-scale |
0x8000 | 0b1000_0000_0000_0000 | −32,768 | −1.00003 |
As is 9-bit audio:
WAV | Sample | int | float | Comment |
---|---|---|---|---|
0x7F80 | 0b0111_1111_1 | +255 | +1.000 | full-scale |
0x7F00 | 0b0111_1111_0 | +254 | +0.996 | |
0x7E80 | 0b0111_1110_1 | +253 | +0.992 | |
... | ... | ... | ... | |
0x0100 | 0b0000_0001_0 | +2 | +0.008 | |
0x0080 | 0b0000_0000_1 | +1 | +0.004 | |
0x0000 | 0b0000_0000_0 | 0 | 0.000 | midpoint |
0xFF80 | 0b1111_1111_1 | −1 | −0.004 | |
0xFF00 | 0b1111_1111_0 | −2 | −0.008 | |
... | ... | ... | ... | |
0x8180 | 0b1000_0001_1 | −253 | −0.992 | |
0x8100 | 0b1000_0001_0 | −254 | −0.996 | |
0x8080 | 0b1000_0000_1 | −255 | −1.000 | full-scale |
0x8000 | 0b1000_0000_0 | −256 | −1.004 |
@gavingc Basically there are two different interpretations of PCM data, used when converting to float. I listed two standards above that interpret it such that the largest positive value is considered full-scale, and therefore maps to +1.0, so the most negative number exceeds full-scale and is <−1.0. However, other sources (Android spec, USB Audio spec) interpret PCM as a fixed-point number, with the binary point after the first bit, so that the most negative number is full-scale, and maps to −1.0, so the most positive number is less than full-scale, and is <+1.0.
Do you have examples of other audio standards that this can be compared with?
In my code, I'm supporting both interpretations, and for the one that allows negative values to exceed full-scale, I'm just keeping the values, assigning a float value more negative than -1.0. scipy/scipy#12507