# proj-plbook-plChArithmeticImpl

### Integers

from John Bode [1] [2]:

"

```Here are several different ways you can interpret the values of three bits:

Bits    Unsigned    Sign-Magnitude    1's Complement    2's Complement
----    --------    -------------     --------------    --------------
000           0                0                  0                 0
001           1                1                  1                 1
010           2                2                  2                 2
011           3                3                  3                 3
100           4               -0                 -3                -4
101           5               -1                 -2                -3
110           6               -2                 -1                -2
111           7               -3                 -0                -1
```

"

### Floating point

floating point:

IEEE754 1985, 2008

https://en.wikipedia.org/wiki/IEEE_754-2008

https://en.wikipedia.org/wiki/ISO/IEC_10967

"several fallacies, the first being that all IEEE 754 systems must deliver identical results for the same program. We have focused on differences between extended-based systems and single/double systems, but there are further differences among systems within each of these families. For example, some single/double systems provide a single instruction to multiply two numbers and add a third with just one final rounding. This operation, called a fused multiply-add, can cause the same program to produce different results across different single/double systems, and, like extended precision, it can even cause the same program to produce different results on the same system depending on whether and when it is used." -- What Every Computer Scientist Should Know About Floating-Point Arithmetic

http://www.lsi.upc.edu/~robert/teaching/master/material/p5-goldberg.pdf

the difficulties of floating point determinism:

some notes on 64-bit floats:

proposed alternatives to IEEE754 floating point:

• posits: see "Beating Floats at Their Own Game" by Gustafson at [3]

"Go on, ask me what the largest integer is, such that it and all smaller integers can be stored in IEEE 64-bit doubles without losing precision. An IEEE 64-bit double has 52 bits of mantissa, so I think it's 2^53..."

[4]

" Floating point numbers are the optimal minimum message length method of representing reals with an improper Jeffery's prior distribution. A Jeffery's prior is a prior that is invariant under reparameterization, which is a mandatory property for approximating the reals.

In this case, it is where Prob(log(

" -- lenticular
 x )) is proportional to a constant.

http://geocar.sdf1.org/numbers.html

#### Asymmetry between inequality signs

" Q: Why do we need four different ordered comparisons? Wouldn't < and <= suffice with appropriately swapped operands? A: No, because for floating-point comparisons (x < y) is not the same as not (x >= y) in the presence of NaNs?. " -- [5]

### De-facto IEEE 754

A lot of things have a headline claim to provide IEEE 754, but actually in the footnotes admit that they implement only part of the standard. Imo the fault is with IEEE 754; it requires more functionality than most projects need, so of course people will only implement a subset. IEEE 754 appears to be de-facto partitionable into a common core that most implementations provide, and extensions that many implementations do not provide; the standard should be updated to specify extensions/profiles.

IEEE 754 does in fact define 'recommended' vs 'required' functionality (eg log, sin, cos are recommended but not required), but many platforms don't directly implement even all of the 'required' functionality, which is why i think there is, de-facto, an even smaller common core.

Even platforms that consider themselves compliant often don't really exactly comply with the wording of the IEEE 754 standard, which requires a large number of separate operations which implementations tend to omit as long as there is a short way to reproduce their effect with the provided primitives; for example the standard requires both a 'class' function and also an 'isInfinite' predicate, but in practice you can tell if something isInfinite by seeing what the result of applying 'class' to it is, so for example RISC-V provides an FCLASS instruction but has no isInfinite. (however, some might argue that the standard countenances this sort of thing, because although the standard says "A conforming implementation of a supported arithmetic format shall provide all the operations of this standard defined in Clause 5", it also says, "In this standard, operations are written as named functions; in a specific programming environment they might be represented by operators, or by families of format-specific functions, or by operations or functions whose names might differ from those in this standard".

Another problem is that it costs money to purchase an official copy of the standard.

So really, in my opinion, what is needed is a new standard for floating point that is:

• requires only a small core subset of what IEEE 754 requires

Until that time, this section contains notes on which subsets of IEEE 754 are implemented by various platforms, in hopes of helping the reader to identify what they think the common core is/should be.

#### WASM IEEE 754

The WebAssembly Spec 1.0 states:

" Floating-point arithmetic follows the IEEE 754-2008 standard, with the following qualifications:

• All operators use round-to-nearest ties-to-even, except where otherwise specified. Non-default directed rounding attributes are not supported.
• Following the recommendation that operators propagate NaN? payloads from their operands is permitted but not required.
• All operators use “non-stop” mode, and floating-point exceptions are not otherwise observable. In particular, neither alternate floating-point exception handling attributes nor operators on status flags are supported. There is no observable difference between quiet and signalling NaNs? "

#### ARM Cortex M4F

" 7.2.5. Complete implementation of the IEEE 754 standard

The Cortex-M4F floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:

• remainder
• round floating-point number to integer-valued floating-point number
• binary-to-decimal conversions
• decimal-to-binary conversions
• direct comparison of single-precision and double-precision values.

The Cortex-M4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. "

#### ARM Cortex-A9 NEON

[Cortex-A9 NEON Media Processing Engine Technical Reference Manual http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409f/CIHEJAGA.html] says:

" 2.3. IEEE754 standard compliance

The IEEE754 standard provides a number of implementation choices. The ARM Architecture Reference Manual describes the choices that apply to the Advanced SIMD and VFPv3 architectures.

The Cortex-A9 NEON MPE implements the ARMv7 Advanced SIMD and VFP extensions. It does not provide hardware support for the following IEEE754 operations:

`    remainder`
`    round floating-point number to nearest integer-valued in floating-point number`
`    binary-to-decimal conversion`
`    decimal-to-binary conversion`
`    direct comparison of single-precision and double-precision values`
`    any extended-precision operations.`

#### JVM

" 2.8.1. Java Virtual Machine Floating-Point Arithmetic and IEEE 754

The key differences between the floating-point arithmetic supported by the Java Virtual Machine and the IEEE 754 standard are:

`    The floating-point operations of the Java Virtual Machine do not throw exceptions, trap, or otherwise signal the IEEE 754 exceptional conditions of invalid operation, division by zero, overflow, underflow, or inexact. The Java Virtual Machine has no signaling NaN value.`
`    The Java Virtual Machine does not support IEEE 754 signaling floating-point comparisons.`
`    The rounding operations of the Java Virtual Machine always use IEEE 754 round to nearest mode. Inexact results are rounded to the nearest representable value, with ties going to the value with a zero least-significant bit. This is the IEEE 754 default mode. But Java Virtual Machine instructions that convert values of floating-point types to values of integral types round toward zero. The Java Virtual Machine does not give any means to change the floating-point rounding mode.`
`    The Java Virtual Machine does not support either the IEEE 754 single extended or double extended format, except insofar as the double and double-extended-exponent value sets may be said to support the single extended format. The float-extended-exponent and double-extended-exponent value sets, which may optionally be supported, do not correspond to the values of the IEEE 754 extended formats: the IEEE 754 extended formats require extended precision as well as extended exponent range."`

#### Qfplib-M3

The software floating point library Qfplib-m3 for ARM Cortex M3 says:

" Implementation of the IEEE 754 standard

Qfplib correctly treats signed zeros, denormals, infinities and NaNs? according to the IEEE 754 standard. The results of the addition, subtraction, multiplication, division and square root operations are correctly rounded (to nearest, even-on-tie). Other rounding modes and traps are not supported. "

#### Qfplib (M0)

The software floating point library Qfplib for ARM Cortex M0 says:

" Limitations and deviations from the IEEE 754 standard

Except as noted below, on input and output, NaNs? are converted to infinities, denormals are flushed to zero, and negative zero is converted to positive zero. The result of the square root function is not always correctly rounded according to IEEE 754; see the next section for more on function accuracy. "

#### C language

The December 2, 2010 Committee Draft of the C language standard has an Annex F starting on page 503 (PDF page 521) discussing IEEE 754 (IEC 60559, which is identical) arithmetic.

Some excerpts:

" F.2.1 Infinities, signed zeros, and NaNs? This specification does not define the behavior of signaling NaNs?. 346) It generally uses the term NaN? to denote quiet NaNs?. The NAN and INFINITY macros and the nan functions in <math.h> provide designations for IEC 60559 NaNs? and infinities. "

#### Sneftel's comment

" IEEE-754 except for blah. That is, they mostly implement 754, but cheap out on some of the more expensive and/or fiddly bits.

The most common cheap-outs:

• Flushing denormals to zero. This invalidates certain sometimes-useful theorems (in particular, the theorem that a-b can be exactly represented if 0 <= a/2 <= b <= a*2), but in practice it's generally not going to be an issue.
• Failure to recognize inf and NaN? as special. These architectures will fail to follow the rules regarding inf and NaN? as operands, and may not saturate to inf, instead producing numbers that are larger than FLT_MAX, which will generally be recognized by other architectures as NaN?.
• Proper rounding of division and square root. It's a whole lot easier to guarantee that the result is within 1-3 ulps of the exact result than within 1/2 ulp. A particularly common case is for division to be implemented as reciprocal+multiplication, which loses you one bit of precision.
• Fewer or no guard digits. This is an unusual cheap-out, but means that other operations can be 1-2 ulps off.

BUUUUT... even those except for blah architectures still use IEEE-754's representation of numbers. Other than byte ordering issues, the bits describing a float or double on architecture A are essentially guaranteed to have the same meaning on architecture B.

So as long as all you care about is the representation of values, you're totally fine. If you care about cross-platform consistency of operations, you may need to do some extra work.

EDIT: As Chux mentions in the comments, a common extra source of inconsistency between platforms is the use of extended precision, such as the x87's 80-bit internal representation. That's the opposite of a cheap-out, and (with proper treatment) fully conforms to both IEEE-754 and the C standard, but it will likewise cause results to differ between architectures, and even between compiler versions and following apparently minor and unrelated code changes. However: a particular x86/x64 executable will NOT produce different results on different processors due to extended precision.

"

In Perl6:

### Decimals

Arithmetic:

#### Crockford's Dec64 proposal

http://dec64.org/

## Endianness

• Endianness
• the original article that coined the term Endian: http://www.ietf.org/rfc/ien/ien137.txt . Makes the point that big-endian makes for efficient comparisons and division, whereas little-endian makes for more efficient addition and multiplication.

## Posits and Unums

John Gustafson has recently proposed two new number formats, including Unum [7] [8] [9] [10] [11] and Posit [12] [13].

A response from Kahan of IEEE 754 fame:

## Booleans

" smaddox 9 months ago [-]

It's interesting that booleans are stored as 0 or -1 [1]. I wonder what drove that decision. I just checked, and Rust's LLVM stores booleans as 0 or 1.

KMag 9 months ago [-]

Well, for one, if true is -1 (using two's complement), then a bunch of conditionals and bit-twiddling become more simple. For instance:

`    x = some_bool ? a : b;`

can be compiled as a branchless:

`    x = (some_bool & a) | (~some_bool & b);`

This is why Forth also represents true as -1.

pcwalton 9 months ago [-]

Also why SIMD fairly universally uses -1 for true.

C arguably picked the wrong semantics here. Hardware picks up the slack, however, with the setcc instructions on x86, csinc on AArch64, etc. " -- [14]

Note: " (Big parentheses: Until Lua 4.0, all order operators were translated to a single one, by translating a <= b to not (b < a). However, this translation is incorrect when we have a partial order, that is, when not all elements in our type are properly ordered. For instance, floating-point numbers are not totally ordered in most machines, because of the value Not a Number (NaN?). According to the IEEE 754 standard, currently adopted by virtually every hardware, NaN? represents undefined values, such as the result of 0/0. The standard specifies that any comparison that involves NaN? should result in false. That means that NaN? <= x is always false, but x < NaN? is also false. That implies that the translation from a <= b to not (b < a) is not valid in this case.) " -- https://www.lua.org/pil/13.2.html

"

```    http://citeseer.ist.psu.edu/goldberg91what.html . "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991) David Goldberg (every computer scientist should read this)
http://www.validlab.com/goldberg/addendum.html . 'extracts from an expansion upon David Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic"' (really lazy ones could just read this)
"Floating Point Arithmetic: Issues and Limitations" from Python docs. http://docs.python.org/tut/node16.html
'Musings on Lua Integer and Enum Support". http://kotisivu.dnainternet.net/askok/articles/articles/article5.html may be worth reading for some people. It has a slightly different opinion on the essentiality of double-precision FP as this text. Also, tries to find middle ground for those non-FPU CPU's. -- AskoKauppi
http://floating-point-gui.de/ . "What Every Programmer Should Know About Floating-Point Arithmetic or Why don't my numbers add up?"
"```

--- extracted from [15]

some annoying things about (IEEE 754) floating-point arithmetic:

• there is both +0.0 and -0.0, and they are distinct; for example, 1/+0.0 = inf, but 1/(-0.0) = -inf
• even though they are distinct, +0.0 == -0.0. This means that when you talk about floating-point arithmetic, you sometimes have to make a verbal distinction between whether two quantities are equal in the sense that they are same value, vs. equal in the sense that they compare to equal under the floating-point equality operator
• (NaN? == NaN?) evaluates to false. This means that the floating-point equality operator does not obey the mathematical property of reflexivity and hence, mathematically speaking, the floating-point equality operator is not in fact an "equivalence relation" at all.

https://0.30000000000000004.com/

---

https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Integer-Overflow-Basics.html