Table of Contents for Programming Languages: a survey


from John Bode [1] [2]:


Here are several different ways you can interpret the values of three bits:

Bits    Unsigned    Sign-Magnitude    1's Complement    2's Complement
----    --------    -------------     --------------    -------------- 
 000           0                0                  0                 0
 001           1                1                  1                 1
 010           2                2                  2                 2
 011           3                3                  3                 3
 100           4               -0                 -3                -4
 101           5               -1                 -2                -3
 110           6               -2                 -1                -2
 111           7               -3                 -0                -1


Floating point

floating point:

IEEE754 1985, 2008

"several fallacies, the first being that all IEEE 754 systems must deliver identical results for the same program. We have focused on differences between extended-based systems and single/double systems, but there are further differences among systems within each of these families. For example, some single/double systems provide a single instruction to multiply two numbers and add a third with just one final rounding. This operation, called a fused multiply-add, can cause the same program to produce different results across different single/double systems, and, like extended precision, it can even cause the same program to produce different results on the same system depending on whether and when it is used." -- What Every Computer Scientist Should Know About Floating-Point Arithmetic (see also C 2011, ISO/IEC 9899:201x)

the difficulties of floating point determinism:

some notes on 64-bit floats:

proposed alternatives to IEEE754 floating point:

"Go on, ask me what the largest integer is, such that it and all smaller integers can be stored in IEEE 64-bit doubles without losing precision. An IEEE 64-bit double has 52 bits of mantissa, so I think it's 2^53..."


" Floating point numbers are the optimal minimum message length method of representing reals with an improper Jeffery's prior distribution. A Jeffery's prior is a prior that is invariant under reparameterization, which is a mandatory property for approximating the reals.

In this case, it is where Prob(log(

" -- lenticular
x)) is proportional to a constant.

Asymmetry between inequality signs

" Q: Why do we need four different ordered comparisons? Wouldn't < and <= suffice with appropriately swapped operands? A: No, because for floating-point comparisons (x < y) is not the same as not (x >= y) in the presence of NaNs?. " -- [5]

Printing floating point numbers (dtoa)

Software floating point implementations

De-facto IEEE 754

A lot of things have a headline claim to provide IEEE 754, but actually in the footnotes admit that they implement only part of the standard. Imo the fault is with IEEE 754; it requires more functionality than most projects need, so of course people will only implement a subset. IEEE 754 appears to be de-facto partitionable into a common core that most implementations provide, and extensions that many implementations do not provide; the standard should be updated to specify extensions/profiles.

IEEE 754 does in fact define 'recommended' vs 'required' functionality (eg log, sin, cos are recommended but not required), but many platforms don't directly implement even all of the 'required' functionality, which is why i think there is, de-facto, an even smaller common core.

Even platforms that consider themselves compliant often don't really exactly comply with the wording of the IEEE 754 standard, which requires a large number of separate operations which implementations tend to omit as long as there is a short way to reproduce their effect with the provided primitives; for example the standard requires both a 'class' function and also an 'isInfinite' predicate, but in practice you can tell if something isInfinite by seeing what the result of applying 'class' to it is, so for example RISC-V provides an FCLASS instruction but has no isInfinite. (however, some might argue that the standard countenances this sort of thing, because although the standard says "A conforming implementation of a supported arithmetic format shall provide all the operations of this standard defined in Clause 5", it also says, "In this standard, operations are written as named functions; in a specific programming environment they might be represented by operators, or by families of format-specific functions, or by operations or functions whose names might differ from those in this standard".

Another problem is that it costs money to purchase an official copy of the standard.

So really, in my opinion, what is needed is a new standard for floating point that is:

Until that time, this section contains notes on which subsets of IEEE 754 are implemented by various platforms, in hopes of helping the reader to identify what they think the common core is/should be.


The WebAssembly Spec 1.0 states:

" Floating-point arithmetic follows the IEEE 754-2008 standard, with the following qualifications:

ARM Cortex M4F

The ARM Cortex M4 Technical Reference Manual says:

" 7.2.5. Complete implementation of the IEEE 754 standard

The Cortex-M4F floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:

The Cortex-M4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. "

ARM Cortex-A9 NEON

[Cortex-A9 NEON Media Processing Engine Technical Reference Manual] says:

" 2.3. IEEE754 standard compliance

The IEEE754 standard provides a number of implementation choices. The ARM Architecture Reference Manual describes the choices that apply to the Advanced SIMD and VFPv3 architectures.

The Cortex-A9 NEON MPE implements the ARMv7 Advanced SIMD and VFP extensions. It does not provide hardware support for the following IEEE754 operations:

    round floating-point number to nearest integer-valued in floating-point number
    binary-to-decimal conversion
    decimal-to-binary conversion
    direct comparison of single-precision and double-precision values
    any extended-precision operations.


The Java Virtual Machine Specification, Java SE 10 Edition says:

" 2.8.1. Java Virtual Machine Floating-Point Arithmetic and IEEE 754

The key differences between the floating-point arithmetic supported by the Java Virtual Machine and the IEEE 754 standard are:

    The floating-point operations of the Java Virtual Machine do not throw exceptions, trap, or otherwise signal the IEEE 754 exceptional conditions of invalid operation, division by zero, overflow, underflow, or inexact. The Java Virtual Machine has no signaling NaN value.
    The Java Virtual Machine does not support IEEE 754 signaling floating-point comparisons.
    The rounding operations of the Java Virtual Machine always use IEEE 754 round to nearest mode. Inexact results are rounded to the nearest representable value, with ties going to the value with a zero least-significant bit. This is the IEEE 754 default mode. But Java Virtual Machine instructions that convert values of floating-point types to values of integral types round toward zero. The Java Virtual Machine does not give any means to change the floating-point rounding mode.
    The Java Virtual Machine does not support either the IEEE 754 single extended or double extended format, except insofar as the double and double-extended-exponent value sets may be said to support the single extended format. The float-extended-exponent and double-extended-exponent value sets, which may optionally be supported, do not correspond to the values of the IEEE 754 extended formats: the IEEE 754 extended formats require extended precision as well as extended exponent range."


The software floating point library Qfplib-m3 for ARM Cortex M3 says:

" Implementation of the IEEE 754 standard

Qfplib correctly treats signed zeros, denormals, infinities and NaNs? according to the IEEE 754 standard. The results of the addition, subtraction, multiplication, division and square root operations are correctly rounded (to nearest, even-on-tie). Other rounding modes and traps are not supported. "

Qfplib (M0)

The software floating point library Qfplib for ARM Cortex M0 says:

" Limitations and deviations from the IEEE 754 standard

Except as noted below, on input and output, NaNs? are converted to infinities, denormals are flushed to zero, and negative zero is converted to positive zero. The result of the square root function is not always correctly rounded according to IEEE 754; see the next section for more on function accuracy. "

C language

The December 2, 2010 Committee Draft of the C language standard has an Annex F starting on page 503 (PDF page 521) discussing IEEE 754 (IEC 60559, which is identical) arithmetic.

Some excerpts:

" F.2.1 Infinities, signed zeros, and NaNs? This specification does not define the behavior of signaling NaNs?. 346) It generally uses the term NaN? to denote quiet NaNs?. The NAN and INFINITY macros and the nan functions in <math.h> provide designations for IEC 60559 NaNs? and infinities. "

Sneftel's comment

In a Stackoverflow answer, a user named Sneftel says:

" IEEE-754 except for blah. That is, they mostly implement 754, but cheap out on some of the more expensive and/or fiddly bits.

The most common cheap-outs:

BUUUUT... even those except for blah architectures still use IEEE-754's representation of numbers. Other than byte ordering issues, the bits describing a float or double on architecture A are essentially guaranteed to have the same meaning on architecture B.

So as long as all you care about is the representation of values, you're totally fine. If you care about cross-platform consistency of operations, you may need to do some extra work.

EDIT: As Chux mentions in the comments, a common extra source of inconsistency between platforms is the use of extended precision, such as the x87's 80-bit internal representation. That's the opposite of a cheap-out, and (with proper treatment) fully conforms to both IEEE-754 and the C standard, but it will likewise cause results to differ between architectures, and even between compiler versions and following apparently minor and unrelated code changes. However: a particular x86/x64 executable will NOT produce different results on different processors due to extended precision.


Other IEEE 754 implementation compliance links


In Perl6:


IEEE 754-2008 decimal64


Crockford's Dec64 proposal



Posits and Unums

John Gustafson has recently proposed two new number formats, including Unum [7] [8] [9] [10] [11] and Posit [12] [13].

A response from Kahan of IEEE 754 fame:


" smaddox 9 months ago [-]

It's interesting that booleans are stored as 0 or -1 [1]. I wonder what drove that decision. I just checked, and Rust's LLVM stores booleans as 0 or 1.


KMag 9 months ago [-]

Well, for one, if true is -1 (using two's complement), then a bunch of conditionals and bit-twiddling become more simple. For instance:

    x = some_bool ? a : b;

can be compiled as a branchless:

    x = (some_bool & a) | (~some_bool & b);

This is why Forth also represents true as -1.

pcwalton 9 months ago [-]

Also why SIMD fairly universally uses -1 for true.

C arguably picked the wrong semantics here. Hardware picks up the slack, however, with the setcc instructions on x86, csinc on AArch64, etc. " -- [14]

Note: " (Big parentheses: Until Lua 4.0, all order operators were translated to a single one, by translating a <= b to not (b < a). However, this translation is incorrect when we have a partial order, that is, when not all elements in our type are properly ordered. For instance, floating-point numbers are not totally ordered in most machines, because of the value Not a Number (NaN?). According to the IEEE 754 standard, currently adopted by virtually every hardware, NaN? represents undefined values, such as the result of 0/0. The standard specifies that any comparison that involves NaN? should result in false. That means that NaN? <= x is always false, but x < NaN? is also false. That implies that the translation from a <= b to not (b < a) is not valid in this case.) " --

" . "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991) David Goldberg (every computer scientist should read this) . 'extracts from an expansion upon David Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic"' (really lazy ones could just read this)
    "Floating Point Arithmetic: Issues and Limitations" from Python docs.
    'Musings on Lua Integer and Enum Support". may be worth reading for some people. It has a slightly different opinion on the essentiality of double-precision FP as this text. Also, tries to find middle ground for those non-FPU CPU's. -- AskoKauppi . "What Every Programmer Should Know About Floating-Point Arithmetic or Why don't my numbers add up?"

--- extracted from [15]

some annoying things about (IEEE 754) floating-point arithmetic:


See also number constructs.