Bayle Shanks's website: proj-plbook-plChArithmeticImpl

Table of Contents for Programming Languages: a survey

Integers

Two's complement encoding, and the (mostly inferior) alternatives for the representation of signed integers
how the 'overflow' flag works in signed integer arithmetic on the 6502 processor

from John Bode [1] [2]:

Here are several different ways you can interpret the values of three bits:

Bits    Unsigned    Sign-Magnitude    1's Complement    2's Complement
----    --------    -------------     --------------    -------------- 
 000           0                0                  0                 0
 001           1                1                  1                 1
 010           2                2                  2                 2
 011           3                3                  3                 3
 100           4               -0                 -3                -4
 101           5               -1                 -2                -3
 110           6               -2                 -1                -2
 111           7               -3                 -0                -1

Floating point

floating point:

IEEE754 1985, 2008

https://en.wikipedia.org/wiki/IEEE_754-2008

https://en.wikipedia.org/wiki/ISO/IEC_10967

What Every Computer Scientist Should Know About Floating-Point Arithmetic

"several fallacies, the first being that all IEEE 754 systems must deliver identical results for the same program. We have focused on differences between extended-based systems and single/double systems, but there are further differences among systems within each of these families. For example, some single/double systems provide a single instruction to multiply two numbers and add a third with just one final rounding. This operation, called a fused multiply-add, can cause the same program to produce different results across different single/double systems, and, like extended precision, it can even cause the same program to produce different results on the same system depending on whether and when it is used." -- What Every Computer Scientist Should Know About Floating-Point Arithmetic

https://en.wikipedia.org/wiki/C99#IEEE.C2.A0754_floating_point_support (see also C 2011, ISO/IEC 9899:201x)

http://www.lsi.upc.edu/~robert/teaching/master/material/p5-goldberg.pdf

the difficulties of floating point determinism:

some notes on 64-bit floats:

https://drawings.jvns.ca/float/

proposed alternatives to IEEE754 floating point:

posits: see "Beating Floats at Their Own Game" by Gustafson at [3]

"Go on, ask me what the largest integer is, such that it and all smaller integers can be stored in IEEE 64-bit doubles without losing precision. An IEEE 64-bit double has 52 bits of mantissa, so I think it's 2^53..."

[4]

" Floating point numbers are the optimal minimum message length method of representing reals with an improper Jeffery's prior distribution. A Jeffery's prior is a prior that is invariant under reparameterization, which is a mandatory property for approximating the reals.

In this case, it is where Prob(log(

" -- lenticular

x	)) is proportional to a constant.

https://float.exposed/ http://weitz.de/ieee/

http://geocar.sdf1.org/numbers.html

Asymmetry between inequality signs

" Q: Why do we need four different ordered comparisons? Wouldn't < and <= suffice with appropriately swapped operands? A: No, because for floating-point comparisons (x < y) is not the same as not (x >= y) in the presence of NaNs?. " -- [5]

Printing floating point numbers (dtoa)

Errol: https://cseweb.ucsd.edu/%7Emandrysc/pub/dtoa.pdf
- https://github.com/marcandrysco/Errol
Gay's dtoa: http://www.netlib.org/fp/dtoa.c
- recommended by [6]

Software floating point implementations

SoftFloat (appears to have a permissive BSD-like copyright license)
list of IEEE 754 floating-point test software
Qfplib: an ARM Cortex-M0 floating-point library in 1 kbyte. GPL.
Qfplib-M3: a free, fast and accurate ARM Cortex-M3 floating-point library. GPL.
Softgun software floating point library

De-facto IEEE 754

A lot of things have a headline claim to provide IEEE 754, but actually in the footnotes admit that they implement only part of the standard. Imo the fault is with IEEE 754; it requires more functionality than most projects need, so of course people will only implement a subset. IEEE 754 appears to be de-facto partitionable into a common core that most implementations provide, and extensions that many implementations do not provide; the standard should be updated to specify extensions/profiles.

IEEE 754 does in fact define 'recommended' vs 'required' functionality (eg log, sin, cos are recommended but not required), but many platforms don't directly implement even all of the 'required' functionality, which is why i think there is, de-facto, an even smaller common core.

Even platforms that consider themselves compliant often don't really exactly comply with the wording of the IEEE 754 standard, which requires a large number of separate operations which implementations tend to omit as long as there is a short way to reproduce their effect with the provided primitives; for example the standard requires both a 'class' function and also an 'isInfinite' predicate, but in practice you can tell if something isInfinite by seeing what the result of applying 'class' to it is, so for example RISC-V provides an FCLASS instruction but has no isInfinite. (however, some might argue that the standard countenances this sort of thing, because although the standard says "A conforming implementation of a supported arithmetic format shall provide all the operations of this standard defined in Clause 5", it also says, "In this standard, operations are written as named functions; in a specific programming environment they might be represented by operators, or by families of format-specific functions, or by operations or functions whose names might differ from those in this standard".

Another problem is that it costs money to purchase an official copy of the standard.

So really, in my opinion, what is needed is a new standard for floating point that is:

freely available under a Creative Commons copyright license
requires only a small core subset of what IEEE 754 requires

Until that time, this section contains notes on which subsets of IEEE 754 are implemented by various platforms, in hopes of helping the reader to identify what they think the common core is/should be.

WASM IEEE 754

The WebAssembly Spec 1.0 states:

" Floating-point arithmetic follows the IEEE 754-2008 standard, with the following qualifications:

All operators use round-to-nearest ties-to-even, except where otherwise specified. Non-default directed rounding attributes are not supported.
Following the recommendation that operators propagate NaN? payloads from their operands is permitted but not required.
All operators use “non-stop” mode, and floating-point exceptions are not otherwise observable. In particular, neither alternate floating-point exception handling attributes nor operators on status flags are supported. There is no observable difference between quiet and signalling NaNs? "

ARM Cortex M4F

The ARM Cortex M4 Technical Reference Manual says:

" 7.2.5. Complete implementation of the IEEE 754 standard

The Cortex-M4F floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:

remainder
round floating-point number to integer-valued floating-point number
binary-to-decimal conversions
decimal-to-binary conversions
direct comparison of single-precision and double-precision values.

The Cortex-M4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. "

ARM Cortex-A9 NEON

[Cortex-A9 NEON Media Processing Engine Technical Reference Manual http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409f/CIHEJAGA.html] says:

" 2.3. IEEE754 standard compliance

The IEEE754 standard provides a number of implementation choices. The ARM Architecture Reference Manual describes the choices that apply to the Advanced SIMD and VFPv3 architectures.

The Cortex-A9 NEON MPE implements the ARMv7 Advanced SIMD and VFP extensions. It does not provide hardware support for the following IEEE754 operations:

    remainder

    round floating-point number to nearest integer-valued in floating-point number

    binary-to-decimal conversion

    decimal-to-binary conversion

    direct comparison of single-precision and double-precision values

    any extended-precision operations.

JVM

The Java Virtual Machine Specification, Java SE 10 Edition says:

" 2.8.1. Java Virtual Machine Floating-Point Arithmetic and IEEE 754

The key differences between the floating-point arithmetic supported by the Java Virtual Machine and the IEEE 754 standard are:

    The floating-point operations of the Java Virtual Machine do not throw exceptions, trap, or otherwise signal the IEEE 754 exceptional conditions of invalid operation, division by zero, overflow, underflow, or inexact. The Java Virtual Machine has no signaling NaN value.

    The Java Virtual Machine does not support IEEE 754 signaling floating-point comparisons.

    The rounding operations of the Java Virtual Machine always use IEEE 754 round to nearest mode. Inexact results are rounded to the nearest representable value, with ties going to the value with a zero least-significant bit. This is the IEEE 754 default mode. But Java Virtual Machine instructions that convert values of floating-point types to values of integral types round toward zero. The Java Virtual Machine does not give any means to change the floating-point rounding mode.

    The Java Virtual Machine does not support either the IEEE 754 single extended or double extended format, except insofar as the double and double-extended-exponent value sets may be said to support the single extended format. The float-extended-exponent and double-extended-exponent value sets, which may optionally be supported, do not correspond to the values of the IEEE 754 extended formats: the IEEE 754 extended formats require extended precision as well as extended exponent range."

Qfplib-M3

The software floating point library Qfplib-m3 for ARM Cortex M3 says:

" Implementation of the IEEE 754 standard

Qfplib correctly treats signed zeros, denormals, infinities and NaNs? according to the IEEE 754 standard. The results of the addition, subtraction, multiplication, division and square root operations are correctly rounded (to nearest, even-on-tie). Other rounding modes and traps are not supported. "

Qfplib (M0)

The software floating point library Qfplib for ARM Cortex M0 says:

" Limitations and deviations from the IEEE 754 standard

Except as noted below, on input and output, NaNs? are converted to infinities, denormals are flushed to zero, and negative zero is converted to positive zero. The result of the square root function is not always correctly rounded according to IEEE 754; see the next section for more on function accuracy. "

C language

The December 2, 2010 Committee Draft of the C language standard has an Annex F starting on page 503 (PDF page 521) discussing IEEE 754 (IEC 60559, which is identical) arithmetic.

Some excerpts:

" F.2.1 Infinities, signed zeros, and NaNs? This specification does not define the behavior of signaling NaNs?. 346) It generally uses the term NaN? to denote quiet NaNs?. The NAN and INFINITY macros and the nan functions in <math.h> provide designations for IEC 60559 NaNs? and infinities. "

Sneftel's comment

In a Stackoverflow answer, a user named Sneftel says:

" IEEE-754 except for blah. That is, they mostly implement 754, but cheap out on some of the more expensive and/or fiddly bits.

The most common cheap-outs:

Flushing denormals to zero. This invalidates certain sometimes-useful theorems (in particular, the theorem that a-b can be exactly represented if 0 <= a/2 <= b <= a*2), but in practice it's generally not going to be an issue.
Failure to recognize inf and NaN? as special. These architectures will fail to follow the rules regarding inf and NaN? as operands, and may not saturate to inf, instead producing numbers that are larger than FLT_MAX, which will generally be recognized by other architectures as NaN?.
Proper rounding of division and square root. It's a whole lot easier to guarantee that the result is within 1-3 ulps of the exact result than within 1/2 ulp. A particularly common case is for division to be implemented as reciprocal+multiplication, which loses you one bit of precision.
Fewer or no guard digits. This is an unusual cheap-out, but means that other operations can be 1-2 ulps off.

BUUUUT... even those except for blah architectures still use IEEE-754's representation of numbers. Other than byte ordering issues, the bits describing a float or double on architecture A are essentially guaranteed to have the same meaning on architecture B.

So as long as all you care about is the representation of values, you're totally fine. If you care about cross-platform consistency of operations, you may need to do some extra work.

EDIT: As Chux mentions in the comments, a common extra source of inconsistency between platforms is the use of extended precision, such as the x87's 80-bit internal representation. That's the opposite of a cheap-out, and (with proper treatment) fully conforms to both IEEE-754 and the C standard, but it will likewise cause results to differ between architectures, and even between compiler versions and following apparently minor and unrelated code changes. However: a particular x86/x64 executable will NOT produce different results on different processors due to extended precision.

Rationals

In Perl6:

Decimals

IEEE 754-2008 decimal64

Arithmetic:

http://speleotrove.com/decimal/

Crockford's Dec64 proposal

http://dec64.org/

comments:

Endianness

Endianness
- the original article that coined the term Endian: http://www.ietf.org/rfc/ien/ien137.txt . Makes the point that big-endian makes for efficient comparisons and division, whereas little-endian makes for more efficient addition and multiplication.

Some advantages of little-endian:

https://www.technicalsourcery.net/posts/on-endianness/
- discussion: https://news.ycombinator.com/item?id=31475808
"
- Long addition is possible across very large integers by just adding the bytes and keeping track of the carry.
- Encoding variable sized integers is possible through an easy algorithm: set aside space in the encoded data for the size, then encode the low bits of the value, shift, repeat until value = 0. When done, store the number of bytes you wrote to the earlier length field. The length calculation comes for free.
- Decoding unaligned bits into big integers is easy because you just store the leftover bits in the next value of the bigint array and keep going. With big endian, you're going high bits to low bits, so once you pass to more than one element in the bigint array, you have to start shifting across multiple elements for every piece you decode from then on.
- Storing bit-encoded length fields into structs becomes trivial since it's always in the low bit, and you can just incrementally build the value low-to-high using the previously decoded length field. Super easy and quick decoding, without having to prepare specific sized destinations. " -- kstenerud
when a human is doing arithmetic on paper by hand the usual way, you compute the least-significant digit first, and you don't know exactly how many digits there will be until the end. In other words, the result digits come out in little-endian order
" Multiple-precision arithmetic routines were almost always little-endian. " -- [7]

Some advantages of big-endian:

for unsigned integers, the lowest byte in a value gives the most information about the sign of the value
if you are looking at a hex dump of memory, by convention low-to-high memory addresses go from left to right; and each byte is written as two hexadecimal digits with the most significant digit on the left; so if you are looking at a multi-byte little-endian value and you want to write it as a single hex number you cannot just read from right to left, you must also the position of each pair of hex digits; for example the hex number 0xABCD will appear in memory as 0xCD 0xAB, or CD AB; to reconstruct 0xABCD you can't just read it from right to left. But a big-endian value can just be read from left to right, that is, 0xABCD in memory will appear as 0xAB 0xCD, so you can just read ABCD from left to right.
- If you wanted to more easily read little-endian hex dumps, perhaps this could be corrected either by writing memory from high on the left to low on the right, or by printing each hex byte with the most-significant digit on the right. Many hex editors instead allow the user to fix a word size and print aligned little-endian words as single hex values; some may even allow the user to identify words at user-specified locations and sizes.

Posits and Unums

John Gustafson has recently proposed two new number formats, including Unum [8] [9] [10] [11] [12] and Posit [13] [14].

A response from Kahan of IEEE 754 fame:

http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf

Booleans

" smaddox 9 months ago [-]

It's interesting that booleans are stored as 0 or -1 [1]. I wonder what drove that decision. I just checked, and Rust's LLVM stores booleans as 0 or 1.

[1] https://cranelift.readthedocs.io/en/latest/ir.html#boolean-t...

KMag 9 months ago [-]

Well, for one, if true is -1 (using two's complement), then a bunch of conditionals and bit-twiddling become more simple. For instance:

    x = some_bool ? a : b;

can be compiled as a branchless:

    x = (some_bool & a) | (~some_bool & b);

This is why Forth also represents true as -1.

pcwalton 9 months ago [-]

Also why SIMD fairly universally uses -1 for true.

C arguably picked the wrong semantics here. Hardware picks up the slack, however, with the setcc instructions on x86, csinc on AArch64, etc. " -- [15]

Note: " (Big parentheses: Until Lua 4.0, all order operators were translated to a single one, by translating a <= b to not (b < a). However, this translation is incorrect when we have a partial order, that is, when not all elements in our type are properly ordered. For instance, floating-point numbers are not totally ordered in most machines, because of the value Not a Number (NaN?). According to the IEEE 754 standard, currently adopted by virtually every hardware, NaN? represents undefined values, such as the result of 0/0. The standard specifies that any comparison that involves NaN? should result in false. That means that NaN? <= x is always false, but x < NaN? is also false. That implies that the translation from a <= b to not (b < a) is not valid in this case.) " -- https://www.lua.org/pil/13.2.html

    http://citeseer.ist.psu.edu/goldberg91what.html . "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991) David Goldberg (every computer scientist should read this)
    http://www.validlab.com/goldberg/addendum.html . 'extracts from an expansion upon David Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic"' (really lazy ones could just read this)
    "Floating Point Arithmetic: Issues and Limitations" from Python docs. http://docs.python.org/tut/node16.html
    'Musings on Lua Integer and Enum Support". http://kotisivu.dnainternet.net/askok/articles/articles/article5.html may be worth reading for some people. It has a slightly different opinion on the essentiality of double-precision FP as this text. Also, tries to find middle ground for those non-FPU CPU's. -- AskoKauppi
    http://floating-point-gui.de/ . "What Every Programmer Should Know About Floating-Point Arithmetic or Why don't my numbers add up?"
    "

--- extracted from [16]

some annoying things about (IEEE 754) floating-point arithmetic:

there is both +0.0 and -0.0, and they are distinct; for example, 1/+0.0 = inf, but 1/(-0.0) = -inf
even though they are distinct, +0.0 == -0.0. This means that when you talk about floating-point arithmetic, you sometimes have to make a verbal distinction between whether two quantities are equal in the sense that they are same value, vs. equal in the sense that they compare to equal under the floating-point equality operator
(NaN? == NaN?) evaluates to false. This means that the floating-point equality operator does not obey the mathematical property of reflexivity and hence, mathematically speaking, the floating-point equality operator is not in fact an "equivalence relation" at all.

https://0.30000000000000004.com/

---

https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Integer-Overflow-Basics.html

---

trig

https://austinhenley.com/blog/cosine.html

---

intervals

zero-based indexing

 "LuaJIT? uses NaN? tagging for object storage. This is the straw that broke the camel's back. NaN? tagging is problematic for us for two reasons: it prevents us from introducing optimized "native" support for float3 type to the VM, which is key to performance in some code our users want to run fast, and it runs into issues on 64-bit platforms, specifically some AArch64 variants - so it's somewhat dangerous going forward. " -- [17]

proj-plbook-plChArithmeticImpl