Synthesizing arithmetic operations using bit-shifting tricks

By Joel Yliluoma, January 2014

How to implement various arithmetic and logical operations on platforms without native support for those operations.

Two's complement is assumed.

Table of contents [expand all] [collapse all]

ADD and SUB
SGN / Testing if value is negative
NEG
- Arithmetic negation
NOT
- Logical negation (NOT / invert operation)
AND and OR
- Logical bit-wise AND
- Logical bit-wise OR
XOR
- Logical bit-wise XOR
SHL/LSL/ASL/SL: Logical/binary/arithmetic bit-shift to left by 1
- Larger-size bit-shifting to left
SHR/LSR/SR: Logical/binary bit-shift to right by 1
ASR/SRA: Arithmetic bit-shift to the right
- When shifting by 1 bit
  - When shifting uses carry
  - Using bit-test
- When shifting by N bits
MUL: Unsigned multiplication
MLI/IMUL/MULI: Signed multiplication
DIV and MOD: Division and modulo
Exponentiation
- Exponentiation without any bit-operations
- Optimization for constant exponents
Comparisons
Appendix: Floating-point mathematics cookbook

ADD and SUB

Addition without carry

Using subtraction

Addition can be synthesized from subtraction by negating the source operand.

Requirements:

Subtraction
Negation

	uint_type a      = original_1;
	uint_type b      = original_2;
	b = -b;
	uint_type result = a - b;

Using bitwise operations, with XOR

Addition can be synthesized bit by bit using the logical XOR operation.

Requirements:

Logical bit-wise XOR
Binary bit-shift to left by 1
Logical bit-test

	uint16 a      = original_1;
	uint16 b      = original_2;
	uint16 result = 0;
	uint16 mask   = 1;
	while(mask != 0)
	{
	    uint16 nextmask = mask << 1;
	    if(a & mask) // If A has that bit set?
	    {
	        // If result also has that bit set, set carry (next bit in result)
	        if(result & mask) result ^= nextmask; // This could also be |= instead of ^= .
	        // Flip this bit in result
	        result ^= mask;
	    }
	    if(b & mask) // If B has that bit set?
	    {
	        // If result also has that bit set, set carry (next bit in result)
	        if(result & mask) result ^= nextmask; // This could also be |= instead of ^= .
	        // Flip this bit in result
	        result ^= mask;
	    }
	    mask = nextmask;
	}

The following version does potentially fewer iterations by applying carry calculations to all bits simultaneously.

Requirements:

Logical bit-wise XOR
Logical bit-wise AND
Binary bit-shift to left by 1

	uint_type a      = original_1;
	uint_type b      = original_2;
        uint_type carry  = a & b;
        uint_type result = a ^ b;
        while(carry != 0)
        {
            // If you need the mathematical carry from addition,
            // check the overflow from this shift.
            uint_type shiftedcarry = carry << 1;
            carry  = result & shiftedcarry;
            result = result ^ shiftedcarry;
        }

Other

If you don't have the logical bit-wise XOR operation, see the examples with ADC, below.

ADC: Addition with carry

Using addition & comparison

For platforms without add-with-carry, carry can be synthesized by comparing the result to the original.

Requirements:

Addition without carry
Unsigned less-than comparison

	uint_type a      = original_1;
	uint_type b      = original_2;
	uint_type result = a + b;
	bool      carry  = result < a; // result < b works too.

Using addition with larger operands

Alternatively, you can get carry by calculating with integer sizes larger than the source operands are.

Requirements:

Addition without carry
Logical bit shift right & bit-and, or addressing individual byte components

	uint16 a      = original_1;
	uint16 b      = original_2;
	uint32 temp   = a + b;

	// Alternative 1:
	uint16 tempc[] = temp;       // Treat temp as two 16-bit components
	uint16 result  = tempc[0];   // Take low word
	bool   carry   = tempc[1];   // Take high word

	// Alternative 2: Bit-wise operations
	bool   carry  = temp >> 16;    // Bit-wise logical shift right
	uint16 result = temp & 0xFFFF; // Bit-wise AND operation

Using bitwise operations, without XOR

Alternatively, if you don't have ADD or bit-wise XOR, you can use bit-wise OR and some comparison logic instead. This produces a carry flag, too.

Requirements:

Logical bit-wise OR
Binary bit-shift to left by 1
Logical bit-test
Logical bit-wise OR and AND (either as a test or calculation)

	uint16 a      = original_1;
	uint16 b      = original_2;
	uint16 result = 0;
	uint16 mask   = 1;
	bool carry = false;
	while(mask != 0)
	{
            bool bit0 = a & mask;
            bool bit1 = b & mask;

            if(carry)
                { if(bit0 == bit1) result |= mask; carry = bit0 | bit1; }
            else
                { if(bit0 != bit1) result |= mask; carry = bit0 & bit1; }
        
            mask = mask << 1;
        }

Using bitwise shift-with-carry or rotate-with-carry operations

If you don't have any OR or AND operations, but you do have a shift-to-right-with-carry or a rotate-to-right-with-carry, you can implement addition using those.

Requirements:

Rotation-to-right with carry
Either a logical right-shift that produces a carry, or a rotation-to-right that produces a carry

	uint16 a      = original_1;
	uint16 b      = original_2;
	uint16 result = 0;
        bool math_carry = false;

        for(int bit=0; bit<16; ++bit)
        {
            bool bit0 = a&1; a >>= 1; // Shift-to-right, get carry
            bool bit1 = b&1; b >>= 1; // Shift-to-right, get carry

            if(math_carry)
            {
                bool shift_carry = bit0 == bit1;
                // Shift-to-right with carry: sets msb=carry
                result >>= 1; if(shift_carry) result |= 0x80;

                if(bit0) math_carry = true;
                else if(bit1) math_carry = true;
                else math_carry = false;
                // ^ Same as: math_carry = (bit0 || bit1);
            }
            else
            {
                bool shift_carry = bit0 != bit1;
                // Shift-to-right with carry: sets msb=carry
                result >>= 1; if(shift_carry) result |= 0x80;

                math_carry = false;
                if(bit0)
                    if(bit1)
                        math_carry = true;
                // ^ Same as: math_carry = (bit0 && bit1);
            }
        }

Alternatively, as a state machine:

	uint16 a      = original_1;
	uint16 b      = original_2;
	uint16 result = 0;
        bool math_carry = false;
        for(int bit=0; bit<16; ++bit)
            switch(math_carry)
            {
                case false:
                    { bool bit0 = a&1; a>>=1; // Rotate-right A, get carry
                      if(bit0) goto sum1_x; }
                sum0_x:
                    { bool bit1 = b&1; b>>=1; // Rotate-right B, get carry
                      if(bit1) goto sum1; }
                sum0:
                    result >>= 1; // Clear math_carry, rotate-right result
                    math_carry=false; continue;
                sum1_x:
                    { bool bit1 = b&1; b>>=1; // Rotate-right B, get carry
                      if(bit1) goto sum2; }
                sum1:
                    result >>= 1; result |= 0x8000; // Set carry, rotate-right result
                    math_carry=false; continue;
                sum2:
                    result >>= 1; // Clear math_carry, rotate-right result
                    math_carry=true; continue;
                case true:
                    { bool bit0 = a&1; a>>=1; // Rotate-right A, get carry
                      if(bit0) goto sum2_x; }
                    goto sum1_x;
                sum2_x:
                    { bool bit1 = b&1; b>>=1; // Rotate-right B, get carry
                      if(bit1) goto sum3; }
                    goto sum2;
                sum3:
                    result >>= 1; result |= 0x8000; // Set carry, rotate-right result
                    math_carry=true; continue;
            }

Which would be equivalent to this 6502 code (8-bit operands), assuming the 6502 didn't already have an addition operation:

	; Inputs: a_operand = operand 1
	;         b_operand = operand 2
	;	  C         = carry
	; Outputs:
	;         result    = operand 1 + operand 2 + carry
	;         C         = carry from addition
	;         A         = 8
	;         a_operand = 0
	;         b_operand = 0
	;         bits_remaining = 0
	lda	#8
	sta	bits_remaining
	bcs	next_bit_carryset
	bcc	next_bit_carryclear
sum0_x:
	lsr	b_operand	; Sum=0. Get next LSB from B_operand
	; Arrived here when math_carry=0, A=0, B=x. (sum = 0 or 1)
	; Put B-bit (whatever it is) in result, and keep math_carry=false.
	ror	result	; Rotate-right result
	;jmp	next_bit_carryclear	; Unconditional jump.
next_bit_carryclear:
	dec	bits_remaining	; while(--bits_remaining > 0)
	beq	done_add_clc	; {
	;bne	carry_false
;carry_false:
	lsr	a_operand	; Sum=0. Get next LSB from A_operand
	bcc	sum0_x	        ; Goto sum1_x if set, sum0_x if clear
sum1_x:
	lsr	b_operand	; Sum=1. Get next LSB from B_operand
	bcs	sum2a	        ; Goto sum2a if set, sum1 if clear
	;bcc	sum1
;sum1:
	; Arrived here when math_carry=0, A=1, B=0 (sum=1)
	; Arrived here when math_carry=1, A=0, B=0 (sum=1)
	sec		; Set carry
	ror	result	; Rotate-right result
	bne	next_bit_carryclear	; Unconditional jump.
sum2a:
	; Arrived here when math_carry=0, A=1, B=1 (sum=2)
	; Arrived here when math_carry=1, A=0, B=1 (sum=2)
	clc		; Clear carry
	ror	result	; Rotate-right result
	bpl	next_bit_carryset	; Unconditional jump.

carry_true:
	lsr	a_operand	; Sum=1. Get next LSB from A_operand
	bcc	sum1_x	        ; Goto sum2_x if set, sum1_x if clear
	;bcs	sum2_x
;sum2_x:
	lsr	b_operand	; Sum=2. Get next LSB from B_operand
	; Arrived here when math_carry=1, A=1, B=x. (sum= 2 or 3)
	; Put B-bit (whatever it is) in result, and keep math_carry=true.
	ror	result	; Rotate-right result
	;jmp	next_bit_carryset	; Unconditional jump.
next_bit_carryset:
	dec	bits_remaining	; if(!--bits_remaining)
	bne	carry_true
	;beq	done_add_sec    ;
;done_add_sec:
	sec
	bcs	done_add
done_add_clc:
	clc
	;bcc	done_add
done_add:			; }

Addition of larger integers than native register size

When you need to add together integers that are larger than your native register size, you need to utilize the carry.

Requirements:

Addition with carry

	uint_type a[]     = original_1[];
	uint_type b[]     = original_2[];
	uint_type result[];
	
	bool carry = false;    // Clear carry
	for(n = 0; n < 4; ++n) // example if you're doing 32-bit addition using 8-bit integers
	{
	    result[n]     = a[n] + b[n] + carry;
	    carry         = /* save carry from this calculation*/;
	}

Subtraction without carry

Using addition

Subtraction can be synthesized from addition by negating the source operand.

Requirements:

Addition
Arithmetic negation

	uint_type a      = original_1;
	uint_type b      = original_2;
	b = -b;
	uint_type result = a + b;

SBC: Subtraction with carry

For platforms without sub-with-carry, carry can be synthesized by comparing the result to the original.

Requirements:

Subtraction without carry
Unsigned less-than comparison

	uint_type a      = original_1;
	uint_type b      = original_2;
	uint_type result = a - b;
	bool      carry  = a < result;

SGN / Testing if value is negative

Testing if value is negative can be done with a comparison:

	int_type a    = original;
	bool negative = a < 0;

or by testing the sign bit:

	uint16 a = original;
	bool negative = a & 0x8000; // The highest-order bit of 16-bit integer

Testing the sign bit can also be done on some platforms with a left-shift operation, if the operation produces a carry.

	uint16 a = original;
	a = a << 1; // Binary left-shift by 1
	bool carry    = /* save carry from this calculation*/;
	bool negative = carry;

And some platforms, such as the 6502, produce a is-negative flag simply by loading the value.

	lda value ; Load the highest-order byte of the value.
	; The N flag now contains the sign bit.
	bmi got_negative ;branch if N set
	bpl got_positive ;branch if N clear

Similarly on i386:

	mov eax, value   ; Load the highest-order word of the value.
	or eax, eax      ; Test its bits
	; test eax, eax  ; Alternative way of testing the bits
	js  got_negative ; branch if SF set
	jns got_positive ; branch if SF clear

NEG

Arithmetic negation

Two's complement negation can be synthesized as follows:

Requirements:

Logical negation (NOT operation / inversion)
Unsigned addition of 1

	uint_type a      = original;
	uint_type result = ~a;       // NOT operation
	result = result + 1;

Alternatively, you can implement it by subtracting from zero:

Requirements:

Subtraction

	uint_type a      = original;
	uint_type result = 0 - a;

NOT

Logical negation (NOT / invert operation)

Bit-wise NOT, i.e. logical negation, can be synthesized in various ways.

Using XOR

XOR the operand with an integer where every bit is 1.

	uint16 a      = original;
	uint16 result = a ^ 0xFFFF; // 0xFFFF has 16 bits

Using subtraction

Subtract the operand from an integer where every bit is 1.

	uint16 a      = original;
	uint16 result = 0xFFFF - a; // 0xFFFF has 16 bits

Using arithmetic negation

Negate the operand and subtract one. (Depends on two's complement arithmetic.)

	uint_type a      = original;
	uint_type result = -a;
	result = result - 1;

AND and OR

Logical bit-wise AND

Bit-wise AND, i.e. logical conjunction, can be synthesized in various ways.

Using NOT and OR

a & b = ~(~a | ~b)

Requirements:

Logical negation (NOT operation / inversion)
Logical disjunction (OR operation)

	uint_type a      = original_1;
	uint_type b      = original_2;
	a = ~a; // Logical NOT
	b = ~b; // Logical NOT
	uint_type result = a | b; // Logical OR
	result = ~result; // Invert result

Doing it bit-by-bit

If you have an operation that tests whether a particular bit is set, you can implement AND using a loop.

Requirements:

Logical bit-test
Logical bit-set

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	for(n = 0; n < 16; ++n)
	{
	    // Test whether the bit is set in both operands
	    if(a & (1 << n)) // Test whether bit n is set in a
	    if(b & (1 << n)) // Test whether bit n is set in b
	        result |= (1 << n); // Set bit N in target value
	}

Alternative:

Logical bit-test
Binary bit-shift to left by 1
Addition

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	for(uint16 mask = 1; mask != 0; mask <<= 1)
	{
	    // Test whether the bit is set in both operands
	    if(a & mask) // Test A and mask have common bits
	    if(b & mask) // Test B and mask have common bits
	        result = result + mask;
	}

Bit-by-bit using arithmetic operations only

Unsigned integer division (by 2), round towards zero
Unsigned integer modulo (by 2) (division remainder)
Integer addition
Integer equality comparison against zero

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	uint16 mask   = 1;
	for(int bit=0; bit<16; ++bit)
	{
	    if((a % 2) != 0 && (b % 2) != 0) result = result + mask;
	    a = a / 2; // Integer division, round towards zero
	    b = b / 2; // Integer division, round towards zero
	    mask = mask + mask; // Multiply mask by 2
	}

Logical bit-wise OR

Bit-wise OR, i.e. logical conjunction, can be synthesized in various ways.

Using NOT and AND

a | b = ~(~a & ~b)

Requirements:

Logical negation (NOT operation / inversion)
Logical conjunction (AND operation)

	uint_type a      = original_1;
	uint_type b      = original_2;
	a = ~a; // Logical NOT
	b = ~b; // Logical NOT
	uint_type result = a & b; // Logical AND
	result = ~result; // Invert result

Doing it bit-by-bit

If you have an operation that tests whether a particular bit is set, you can implement OR using a loop.

Requirements:

Logical bit-test
Logical bit-set

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	for(n = 0; n < 16; ++n)
	{
	    // Test whether the bit is set in either operand
	    if(a & (1 << n)) // Test whether bit n is set in a
	        result |= (1 << n); // Set bit N in target value
	    if(b & (1 << n)) // Test whether bit n is set in b
	        result |= (1 << n); // Set bit N in target value
	}

Alternative:

Logical bit-test
Binary bit-shift to left by 1
Subtraction

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0xFFFF;
	for(uint16 mask = 1; mask != 0; mask <<= 1)
	{
	    // Deduct bit, if it's CLEAR in both operands
	    if(!(a & mask)) // Test A and mask don't have common bits
	    if(!(b & mask)) // Test B and mask don't have common bits
	        result = result - mask;
	}

Bit-by-bit using arithmetic operations only

Unsigned integer division (by 2), round towards zero
Unsigned integer modulo (by 2) (division remainder)
Integer addition
Integer equality comparison against zero

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	uint16 mask   = 1;
	for(int bit=0; bit<16; ++bit)
	{
	    if((a % 2) != 0 || (b % 2) != 0) result = result + mask;
	    a = a / 2; // Integer division, round towards zero
	    b = b / 2; // Integer division, round towards zero
	    mask = mask + mask; // Multiply mask by 2
	}

XOR

Logical bit-wise XOR

Bit-wise XOR, i.e. logical bit-wise exclusive-OR, can be synthesized in various ways.

Using NOT, OR and AND

a ^ b = (a & ~b) | (b & ~a)
Two unary operations, three binary operations. Max sequence depth: 3

	uint_type a = original_1;
	uint_type b = original_2;
	uint_type nota = ~a; // Logical NOT
	uint_type notb = ~b; // Logical NOT
	uint_type result = (a & notb) | (b & nota);

a ^ b = (a | b) &~ (a & b)
One unary operation, three binary operations. Max sequence depth: 3

	uint_type a = original_1;
	uint_type b = original_2;
	uint_type result = (a | b) & ~(a & b);

Doing it bit-by-bit

You can test bit by bit whether the bit is different in both operands.

Requirements:

Logical bit-test
Logical bit-set

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	for(n = 0; n < 16; ++n)
	{
	    // Test whether the bit is set in a but not b
	    if(a & (1 << n))    // Test whether bit n is set in a
	    if(!(b & (1 << n))) // Test whether bit n is not set in b
	        result |= (1 << n); // Set bit N in target value

	    // Test whether the bit is set in b but not a
	    if(b & (1 << n))    // Test whether bit n is set in b
	    if(!(a & (1 << n))) // Test whether bit n is not set in a
	        result |= (1 << n); // Set bit N in target value
	}

Alternative:

Logical bit-test
Binary bit-shift to left by 1
Subtraction

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0xFFFF;
	for(uint16 mask = 1; mask != 0; mask <<= 1)
	{
	    // Deduct bit, if it's CLEAR in both operands
	    if(!(a & mask)) // Test A and mask don't have common bits
	    if(!(b & mask)) // Test B and mask don't have common bits
	        result = result - mask;

	    // Deduct bit, if it's SET in both operands
	    if(a & mask) // Test A and mask have common bits
	    if(b & mask) // Test B and mask have common bits
	        result = result - mask;
	}

Bit-by-bit using arithmetic operations only

Unsigned integer division (by 2), round towards zero
Unsigned integer modulo (by 2) (division remainder)
Integer addition
Integer equality comparison

	uint16 a = original_1;
	uint16 b = original_2;
	uint16 result = 0;
	uint16 mask   = 1;
	for(int bit=0; bit<16; ++bit)
	{
	    if(((a % 2) + (b % 2)) == 1) result = result + mask;
	    a = a / 2; // Integer division, round towards zero
	    b = b / 2; // Integer division, round towards zero
	    mask = mask + mask; // Multiply mask by 2
	}

SHL/LSL/ASL/SL: Logical/binary/arithmetic bit-shift to left by 1

Logical bit-shifting to the left can be accomplished by multiplying the value by 2, i.e. adding the value to itself.

	uint_type a      = original;
	uint_type result = a + a;
	// This result may also produce a carry. It's often needed.

Larger-size bit-shifting to left

	uint8 a[] = original[];
	
	bool carry = false;    // Clear carry
	for(n = 0; n < 4; ++n) // example if you're doing 32-bit shift using 8-bit integers
	{
	    bool new_carry = (a[n] & 0x80) ? 1 : 0; // Check the highest-order bit
	    result[n] = (a[n] << 1) | carry;
	    carry     = new_carry; /* save carry from this calculation*/
	}

SHR/LSR/SR: Logical/binary bit-shift to right by 1

Logical bit-shifting to the right can be accomplished by dividing the value by 2, assuming division always rounds towards zero.

	uint_type a      = original;
	uint_type result = a / 2;

Of course, often that operation is not available, so...

Without divisions

Logical bit-shifting to the right can also be accomplished through logical bit-shifting to the LEFT by inverse the amount. This requires the use of a register that is double the original's size.

Requirements:

Logical bit shift LEFT
Addressing individual byte components

	uint16 a        = original;
	uint_type amount  = shift_amount;

	amount = 16-amount; // Invert the shift amount

	uint32 temp = original << amount; // Shift LEFT

	uint16 tempc[] = temp;       // Treat temp as two 16-bit components
	uint16 result  = tempc[1];   // Take high word

Larger-size bit-shifting to right

	uint8 a[] = original[];
	
	bool carry = false;    // Clear carry
	for(n = 0; n < 4; ++n) // example if you're doing 32-bit shift using 8-bit integers
	{
	    bool new_carry = a[n] & 1; // Check the lowest-order bit
	    result[n] = (a[n] >> 1) | (carry ? 0x80 : 0x00);
	    carry     = new_carry; /* save carry from this calculation*/
	}

When you have arithmetic bit-shift to the right

If you have an arithmetic bit-shifting operator but not a logical one, you can synthesize the logical one by clearing the top-order bits.

Requirements:

Arithmetic bit-shift to right
Logical AND operation

	uint16 a      = original;
	uint16 result = a >> 1;
	result = result & 0x7FFF;  // Keep all bits except the topmost one.

Generalized to shifting by any amount:

Requirements:

Logical bit-shift to right
Logical bit-shift to left
Logical AND operation
Subtract by 1

	uint16 a       = original;
	uint_type amount = shift_amount;
	uint16 result  = a >> amount;
	uint16 mask    = (1 << (16-amount)) - 1; // Change 16 to the bit-width of result
	result = result & mask;

When you do not have any bit-operations

If you do not have bit-shifts or bit-tests, the following algorithm can still be used to shift one bit to the right, i.e. divide by two:

Requirements:

Addition and subtraction
Unsigned greater-than comparison

Note that you must know the upper limit of your input beforehand. E.g. if your values are in range 0–99, you will need six comparisons (2,4,8,16,32,64), but you do not need 128 or higher. Comparing against 1 is also unnecessary.

	uint_type a      = original;
	uint_type result = 0;
	// Repeat these lines for powers of two, according
	// to the upper limit of known value of A.
        if(a >=64) { result = result + 32; a = a - 64; }
        if(a >=32) { result = result + 16; a = a - 32; }
        if(a >=16) { result = result +  8; a = a - 16; }
        if(a >= 8) { result = result +  4; a = a -  8; }
        if(a >= 4) { result = result +  2; a = a -  4; }
        if(a >= 2) { result = result +  1; a = a -  2; }

ASR/SRA: Arithmetic bit-shift to the right

Whereas logical bit-shiting to the right clears the most significant bits, i.e. it shifts in zeroes, arithmetic bit-shifting to the right will make copies of the most significant bit.

Examples:

00101111 >> 1 = 00010111
00101111 >> 3 = 00000101
10101111 >> 1 = 11010111
10101111 >> 3 = 11110101

When shifting by 1 bit

When shifting uses carry

When your platform's right-shift operation shifts in the carry flag instead of a zero-bit, you can use this to your advantage.

Example on 6502

	cmp #0	; Compare value to zero. Sets carry if it's negative.
	ror a   ; Shift to the right.
	        ; Carry becomes the new highest bit,
	        ; lowest bit becomes the new carry.

Using bit-test

Requirements:

Binary bit-shift to right by 1
Logical bit-test
Logical bit-set

	uint16 a      = original;
	uint16 result = a >> 1; // Logical bit-shift to the right
	// Copy the sign bit from the original.
	result = result | (a & 0x8000);
	// ^ This can also be done with a bit-test & bit-set operation
	//   if it is cheaper, or if OR and AND operations are not available.

When shifting by N bits

	uint16 a         = original;
	uint_type amount = shift_amount;

	uint16 result = a >> amount;
	if(a & 0x8000) // If the number was signed
	{
	    // This shift depends on leftshiting discarding
	    // the excess bits:
	    uint16 signbits = 0xFFFF << (16-amount);

	    // Alternative way:
	    //uint16 signbits = ~((1 << (16-amount)) - 1);

	    // Patch in the sign bits:
	    result = result | signbits;
	}

MUL: Unsigned multiplication

Multiplication without multiplication operations

Unsigned multiplication can be performed with additions and logical bit-shifts.

Requirements:

Binary bit-shift to right by 1 (Note: This must be LOGICAL, not arithmetic shift.)
Least significant bit test
Addition

	uint_type a      = original_1;
	uint_type b      = original_2;
	uint_type result = 0;
	while(a != 0)
	{
	    if(a & 1) // If the lowest order bit is set in A?
	    {
	        result = result + b;
	    }
	    a = a >> 1; // Note: This must set highest order bit ZERO. It must not copy the sign bit.
	    b = b + b;  // Alternatively, left-shift by 1 bit
	}

If you need to do multiplication with larger integer sizes, just implement this algorithm using additions and shifts for larger integer sizes.

Multiplication without any bit-operations

If you do not have bit-shifts or bit-tests, the following algorithm can still be used:

Requirements:

Addition and subtraction
Unsigned greater-than comparison
Random access memory or a stack

Note that you must know the upper limit of one of the parameters beforehand. E.g. if your values are in range 0–99, you will need a stack of seven slots and seven comparisons (1,2,4,8,16,32,64), but you do not need 128 or higher.

	uint_type a = original_1;
	uint_type b = original_2;
        uint_type stack[7];
        stack[0] = b; b = b+b; // b*1
        stack[1] = b; b = b+b; // b*2
        stack[2] = b; b = b+b; // b*4
        stack[3] = b; b = b+b; // b*8
        stack[4] = b; b = b+b; // b*16
        stack[5] = b; b = b+b; // b*32
        stack[6] = b;          // b*64
	uint_type result = 0;
	// Repeat these lines for powers of two, according
	// to the upper limit of known value of A.
        if(a >=64) { result = result + stack[6]; a = a - 64; } else { /* ignore stack[6] */ }
        if(a >=32) { result = result + stack[5]; a = a - 32; } else { /* ignore stack[5] */ }
        if(a >=16) { result = result + stack[4]; a = a - 16; } else { /* ignore stack[4] */ }
        if(a >= 8) { result = result + stack[3]; a = a -  8; } else { /* ignore stack[3] */ }
        if(a >= 4) { result = result + stack[2]; a = a -  4; } else { /* ignore stack[2] */ }
        if(a >= 2) { result = result + stack[1]; a = a -  2; } else { /* ignore stack[1] */ }
        if(a >= 1) { result = result + stack[0];             } else { /* ignore stack[0] */ }

This algorithm is faster than a raw addition-loop for all but the smallest multipliers.

If no extra memory or a stack is available, then an addition loop is the only option.

Requirements:

Addition and subtraction
Is-zero comparison

	uint_type a = original_1;
	uint_type b = original_2;

        /* Optional: Ensure minimal number of loops by swapping operands if necessary */
        if(a > b) { uint_type temp = a; a = b; b = temp; }

        uint_type result = 0;
        while(a > 0)
        {
            result = result + b;
            a      = a - 1;
        }

Multiplication of larger integers

You can construct larger-size multiplications by hand-crafting the multiplication using bitshifts and additions for larger integer sizes. Alternatively, you can use actual multiplications in the same way as you would do it on paper:

      15              wx        EA=x*z   GF=w*z
     *74             *yz        JH=x*y   LK=w*y
    ----            ----
   =  20           =  EA        B = E+F+H
   +  4            + GF         C = G+J   + carry from B
   + 35            + JH         D = L     + carry from C
   + 7             +LK
    ----            ----
    1110            DCBA

Example in x86_64 assembler, producing 128-bit * 128-bit = 256-bit multiplication with 64-bit registers:

	mov rsi, 0  ; use a zero as a constant sometimes.
	; DCBA now contains ????.
	; store 00EA:
	mov rax, x  ; low 64 bits of operand 1
	mul      z  ; low 64 bits of operand 2
	mov A, rax  ; store low 64 bits of result (A)
	mov rbx, rdx  ; store high 64 bits of result (E)
	mov rcx, rsi  ; clear C and D
	mov rdi, rsi
	; add 0GF0:
	mov rax, x  ; low 64 bits of operand 1
	mul      y  ; high 64 bits of operand 2
	add rbx, rax  ; store low 64 bits of result (F)
	adc rcx, rdx  ; store high 64 bits of result (G)
	;adc rdi, rsi ; not needed here, EA+GF0 never overflows to fourth digit
	; add 0JH0:
	mov rax, w  ; high 64 bits of operand 1
	mul      z  ; low 64 bits of operand 2
	add rbx, rax  ; store low 64 bits of result (H)
	adc rcx, rdx  ; store high 64 bits of result (J)
	adc rdi, rsi
	; add LK00:
	mov rax, w  ; high 64 bits of operand 1
	mul      y  ; high 64 bits of operand 2
	add rcx, rax  ; store low 64 bits of result (K)
	adc rdi, rdx  ; store high 64 bits of result (L)
	; store to actual DCB
	mov B, rbx
	mov C, rcx
	mov D, rdi

MLI/IMUL/MULI: Signed multiplication

The unsigned multiplication works for signed multiplication too, if the target value is of the same bit-width as the source-values and two’s complement is used. No special attention is needed for the sign bits.

If these conditions are not met, then signed multiplication can be accomplished by doing an unsigned multiplication and taking manually care of the negative-signs.

Requirements:

Unsigned multiplication
Testing if value is negative
Arithmetic negation
Logical inversion

	int_type    a = original_1;
	int_type    b = original_2;
	bool sign = false;
	if(a < 0)
	{
	    sign = !sign; // Toggle the signed flag
	    a = -a; // Negate a
	}
	if(b < 0)
	{
	    sign = !sign; // Toggle the signed flag
	    b = -b; // Negate b
	}
	// a and b are both positive now
	int_type result = a * b; // Do unsigned multiplication
	if(sign)
	{
	    // Negate the result
	    result = -result;
	}

DIV and MOD: Division and modulo

Unsigned integer division and modulo can be performed with additions, subtractions and logical bit-shifts.

Requirements:

Binary bit-shift to right by 1 (Note: This must be LOGICAL, not arithmetic shift.)
Addition
Subtraction
Unsigned greater-or-equal-than comparison

	if(divisor == 0)
	{
	    // TODO: Do whatever should happen when dividing by zero.
	    //       Don't run the rest of this algorithm if divisor = 0.
	    return;
	}
        
	uint_type scaled_divisor = divisor;  // The right-hand side of division
	uint_type remain         = dividend; // The left-hand side of division, i.e. what is being divided
	uint_type result   = 0;
	uint_type multiple = 1;
        
	while(scaled_divisor < dividend)
	// Alternative: while(!(scaled_divisor & 0x8000)) // For 16-bit, test highest order bit.
	// Alternative: while(not_signed(scaled_divisor)) // Same as above.
	{
	    scaled_divisor = scaled_divisor + scaled_divisor; // Multiply by two.
	    multiple       = multiple       + multiple;       // Multiply by two.
	    // You can also use binary shift-by-left here (i.e. multiple = multiple << 1).
	}
	do {
	    if(remain >= scaled_divisor)
	    {
	        remain = remain - scaled_divisor;
	        result = result + multiple;
	    }
	    scaled_divisor = scaled_divisor >> 1; // Divide by two.
	    multiple       = multiple       >> 1;
	} while(multiple != 0);
	
	// Now: result = division result (quotient)
	//      remain = division remainder (modulo)

If you need to do division or modulo with larger integer sizes, just implement this algorithm using additions and shifts for larger integer sizes.

Explanation

To understand how the division algorithm works, consider this simple algorithm:

Requirements:

Addition
Subtraction
Unsigned greater-or-equal-than comparison

	uint_type remain  = dividend; // The left-hand side of division, i.e. what is being divided
	uint_type result  = 0;
	while(remain >= divisor)
	{
	    remain = remain - divisor;
	    result = result + 1;
	}

This is how you might do divisions, if you had no bit-shift operations available. You simply measure how many times the divisor can be subtracted from the dividend before it becomes negative.

However, suppose your dividend is 1500 and your divisor is 2. This algorithm will loop for 750 iterations. How to improve it? Well, you could first divide by sixteen, i.e. two times eight.

	uint_type remain  = dividend;
	uint_type result  = 0;
	while(remain >= divisor*8)
	{
	    remain = remain - divisor*8;
	    result = result + 8;
	}
	while(remain >= divisor*1)
	{
	    remain = remain - divisor*1;
	    result = result + 1;
	}

Now this algorithm will loop for 99 iterations, and produces the same correct result. But we could do better! How about dividing first by 128? And maybe we could subdivide that 16 into 4 and 8.

The first algorithm automates this subdivision process, doing the compare-and-subtract test for all power-of-two multiples of the divisor, converging towards the O(log2(n)) goal. It begins with the largest multiple of the divisor that is not larger than the dividend.

Implementing with a stack, without bit-shift operations

	uint_type remain         = dividend; // The left-hand side of division, i.e. what is being divided
	uint_type scaled_divisor = divisor;  // The right-hand side of division
	uint_type result   = 0;
	uint_type multiple = 1;
	// TODO: Check for zero divisor

	stack_push(0);
	while(scaled_divisor < dividend)
	{
	    stack_push(multiple);
	    stack_push(scaled_divisor);
	    scaled_divisor = scaled_divisor + scaled_divisor; // Multiply by two.
	    multiple       = multiple       + multiple;       // Multiply by two.
	}
	while( (scaled_divisor = stack_pop()) != 0 )
	{
	    multiple = stack_pop();
	    if(remain >= scaled_divisor)
	    {
	        remain = remain - scaled_divisor;
	        result = result + multiple;
	    }
	}
	// Now: result = division result (quotient)
	//      remain = division remainder (modulo)

DVI/IDIV/DIVI: Signed division

Signed division can be accomplished by doing an unsigned division and taking manually care of the negative-signs.

Requirements:

Unsigned division
Testing if value is negative
Arithmetic negation
Logical inversion

	int_type    a = original_1;
	int_type    b = original_2;
	bool sign = false;
	if(a < 0)
	{
	    sign = !sign; // Toggle the signed flag
	    a = -a; // Negate a
	}
	if(b < 0)
	{
	    sign = !sign; // Toggle the signed flag
	    b = -b; // Negate b
	}
	// a and b are both positive now
	int_type result = a / b; // Do unsigned division
	if(sign)
	{
	    // Negate the result
	    result = -result;
	}

MDI/IMOD/MODI: Signed modulo (remainder)

Signed modulo can be accomplished by calculating an unsigned modulo and taking manually care of the negative-signs.

Requirements:

Unsigned modulo
Testing if value is negative
Arithmetic negation
Logical inversion

	int_type    a = original_1;
	int_type    b = original_2;
	bool sign = false;
	if(a < 0)
	{
	    sign = !sign; // Toggle the signed flag
	    a = -a; // Negate a
	}
	if(b < 0)
	{
	    // Exception: Do NOT toggle the signed flag.
	    b = -b; // Negate b
	}
	// a and b are both positive now
	int_type result = a % b; // Do unsigned modulo
	if(sign)
	{
	    // Negate the result
	    result = -result;
	}

Exponentiation

From https://en.wikipedia.org/wiki/Exponentiation_by_squaring

Exponentiation is repeated multiplication. For positive integer exponents:

Requirements:

Multiplication
Least significant bit test
Binary bit-shift to right by 1
Unsigned greater-than comparison

	uint_type a = original_exponent;
	int_type  b = original_value;
	int_type result = 1;
	while(a > 1)
	{
	    if(a & 1) // A is odd?
	    {
	        result = result * b;
	    }
	    b = b * b;
	    a >>= 1; // right-shift, i.e. divide by 2, round down
	}
	if(a > 0)
	{
	    result = result * b;
	}

Exponentiation without any bit-operations

If you do not have bit-shifts or bit-tests, the following algorithm can still be used:

Requirements:

Multiplication
Subtraction
Unsigned greater-than comparison
Random access memory or a stack

Note that you must know the upper limit of the exponents beforehand. E.g. if your exponents are in range 0–9, you will need a stack of four slots and four comparisons (1,2,4,8), but you do not need 16 or higher.

	uint_type a = original_exponent;
	int_type  b = original_value;
        int_type stack[4];
        stack[0] = b; b = b*b; // b^1
        stack[1] = b; b = b*b; // b^2
        stack[2] = b; b = b*b; // b^4
        stack[3] = b;          // b^8
	int_type result = 1;
	// Repeat these lines for powers of two, according
	// to the upper limit of known value of A.
        if(a >= 8) { result = result * stack[3]; a = a -  8; } else { /* ignore stack[3] */ }
        if(a >= 4) { result = result * stack[2]; a = a -  4; } else { /* ignore stack[2] */ }
        if(a >= 2) { result = result * stack[1]; a = a -  2; } else { /* ignore stack[1] */ }
        if(a >= 1) { result = result * stack[0];             } else { /* ignore stack[0] */ }

This algorithm is faster than a multiplication-loop for exponents larger than 4.

It is not necessary to use the extra memory or stack or temporary variables, but then you would need to recalculate the same values many times. It is still more efficient than a multiplication loop if the multiplier is large enough and has few bits set.

	uint_type a = original_exponent;
	int_type  b = original_value;
	int_type  result = 1;
	// Repeat these lines for powers of two, according
	// to the upper limit of known value of A.
        if(a >= 8) { int_type tmp=b; b=b*b; b=b*b; b=b*b; result = result * tmp; a = a -  8; }
        if(a >= 4) { int_type tmp=b; b=b*b; b=b*b;        result = result * tmp; a = a -  4; }
        if(a >= 2) { int_type tmp=b; b=b*b;               result = result * tmp; a = a -  2; }
        if(a >= 1) {                                      result = result * b;               }

A multiplication loop is the simplest and slowest option.

Requirements:

Multiplication
Subtraction
Is-zero comparison

	uint_type a = original_exponent;
	uint_type b = original_value;

        uint_type result = 1;
        while(a > 0)
        {
            result = result * b;
            a      = a - 1;
        }

Optimization for constant exponents

If you know the exponent beforehand, there is a slightly more efficient way to subdivide the exponentiation, especially if the exponent is large.

For example, with this algorithm quoted above, n^15 evaluates into the following sequence of six multiplications:

     result = b          // b
     b = b*b             // b^2
     result = result * b // b^3 (b * b^2)
     b = b*b             // b^4
     result = result * b // b^7 (b^3 * b^4)
     b = b*b             // b^8
     result = result * b // b^15 (b^7 * b^8)

But n^15 can in fact be done with just five multiplications:

     result1 = b                // b
     result2 = result1*result1  // b^2
     result3 = result1*result2  // b^3
     result6 = result3*result3  // b^6
     result9 = result6*result3  // b^9
     result15 = result9*result6 // b^15

This works by recursively subdividing the exponent into two parts, and is done automatically by e.g. GCC. The GCC source code gives this reference for the algorithm:

   /* Expand power operator to optimal multiplications when a value is raised
      to an constant integer n. See section 4.6.3, "Evaluation of Powers" of
      Donald E. Knuth, "Seminumerical Algorithms", Vol. 2, "The Art of Computer
      Programming", 3rd Edition, 1998.  */

It features a 256-element lookup table. For exponent n, this table gives value i = table[n]. It means the following: x^n can be calculated as x^i * x^(n-i).

You can find the table in the GCC source code (search for “powi_table”), or in that book, or alternatively you can use this C++ program to generate the table in whichever size you desire: http://bisqwit.iki.fi/jkp/powi-table-generator.cc

The first 16 values of that table are: 0, 1, 1, 2, 2, 3, 3, 4, 4, 6, 5, 6, 6, 10, 7, and 9. For exponent 15, this means x^9 * x^6. For exponent 9, this means x^6 * x^3. And so on.

Comparisons

Equality comparison

Equality can be tested in various ways:

Non-equality and inversion:

   result = !(a != b);

Less-than operation:

   result = !(a < b) && !(b < a);

Subtraction and compare to zero:

   result = a-b; // Zero if equal

XOR and compare to zero:

   result = a^b; // Zero if equal

Non-equality comparison

Non-equality can be tested in various ways:

Equality and inversion:

   result = !(a == b);

Less-than operation:

   result = a < b || b < a;

Subtraction and compare to zero:

   result = a-b; // Non-zero if equal

XOR and compare to zero:

   result = a^b; // Non-zero if equal

Greater-than comparison

Less-or-equal-than and inversion:

   result = !(a <= b);

Less-than operation:

   result = b < a;

Greater-or-equal-than comparison

Less-than operation:

   result = !(a < b);

Less-than comparison

Greater-or-equal-than and inversion:

   result = !(a >= b);

Less-or-equal-than comparison

Greater-than and inversion:

   result = !(a > b);

Less-than operation:

   result = !(b < a);

Appendix: Floating-point mathematics cookbook

Often you also need floating point mathematics, and if the hardware does not have an integer multiplication unit, it most probably does not have floating point hardware either.

And if it does, it probably cannot calculate trigonometry at hardware level.

The list of standard trigonometric functions provided by a floating point math library may look intimidating, but a careful study reveals fascinating equivalencies, that enable reducing nearly everything into a set of just few functions: sqrt, log, exp and sin.

Reducing complicated functions into simpler components using complex arithmetic

When you operate using complex-number mathematics, everything can be reduced into sqrt, log and exp.

Operation Reduction Other equivalent functions

asinh(x) log(x + sqrt(x² + 1)) = −asinh(−x)
= −1 i* asin(−1i*x)
= 1 i*asin(1i*x)

acosh(x) log(x + sqrt(x² − 1)) = 1 i*acos(x) when tan(arg(x)) > 0
= −1 i*acos(x) when tan(arg(x)) < 0

asin(x) 1i *log(1i*(x + sqrt(x²−1) = −asin(−x)
= −1 i*asinh(−1i*x)
= 1 i*asinh(1i*x)

acos(x) −1 i *log(x + sqrt(x²−1)) or −1i*log(x − sqrt(²−1)) = 1 i*acosh(x) when tan(arg(x)) < 0
= −1 i*acosh(x) when tan(arg(x)) > 0

atanh(x) log( (1+x) / (1−x)) / 2 = i * atan(−1i * x)
= sometimes also −1i*atan(1i*x)

atan(x) −0.5 i*log((1+1i*x) / (1−1i*x)) = −1 i*atanh(1i*x)
= sometimes also 1 i*atanh(−1i*x)

sinh(x) 0.5 * (exp(x) − exp(−x)) = −sinh(−x)
= 1i*sin(−1i*x)
= −1i*sin(1i*x)

cosh(x) 0.5 *(exp(x) + exp(−x)) = cosh(−x)
= cos(1i*x)
= cos(−1i*x)

sin(x) 0.5 i*(exp(x*−1i) − exp(x*1i)) = −sin(−x)
= 1i*sinh(−1i *x)
= −1i*sinh(1i*x)

cos(x) 0.5 *(exp(x*−1i) − exp(x*1i)) = cos(−x)
= cosh(1i*x)
= cosh(−1i*x)

tanh(x) sinh(x) / cosh(x)
= −1 * (1−exp( 2*x)) / (1+exp( 2*x)) = −tanh(−x)
= tan(1i*x)
= −tan(−1i*x)

tan(x) sin(x) / cos(x)
= 1i * (1−exp(2i*x)) / (1+exp(2i*x)) = −tan(−x)
= tanh(1i*x)
= −tanh(−1i*x)

sqrt(x) exp(0.5 * log(x)) = pow(x,0.5)

exp(x) = pow(e, x) (where e is euler's constant)

pow(x,y) exp(y * log(x)) Note: If x is negative, log(x) will produce complex numbers.

log10(x) log(x) / log(10)

x/y = x * pow(y,-1)
= x * exp(-log(y))

Operation	Reduction	Other equivalent functions
asinh(x)	log(x + sqrt(x² + 1))	= −asinh(−x) = −1 i* asin(−1ix) = 1 iasin(1i*x)
acosh(x)	log(x + sqrt(x² − 1))	= 1 iacos(x) when tan(arg(x)) > 0 = −1 iacos(x) when tan(arg(x)) < 0
asin(x)	1i log(1i(x + sqrt(x²−1)	= −asin(−x) = −1 iasinh(−1ix) = 1 iasinh(1ix)
acos(x)	−1 i log(x + sqrt(x²−1)) or −1ilog(x − sqrt(²−1))	= 1 iacosh(x) when tan(arg(x)) < 0 = −1 iacosh(x) when tan(arg(x)) > 0
atanh(x)	log( (1+x) / (1−x)) / 2	= i * atan(−1i * x) = sometimes also −1iatan(1ix)
atan(x)	−0.5 ilog((1+1ix) / (1−1i*x))	= −1 iatanh(1ix) = sometimes also 1 iatanh(−1ix)
sinh(x)	0.5 * (exp(x) − exp(−x))	= −sinh(−x) = 1isin(−1ix) = −1isin(1ix)
cosh(x)	0.5 *(exp(x) + exp(−x))	= cosh(−x) = cos(1ix) = cos(−1ix)
sin(x)	0.5 i(exp(x−1i) − exp(x*1i))	= −sin(−x) = 1isinh(−1i x) = −1isinh(1ix)
cos(x)	0.5 (exp(x−1i) − exp(x*1i))	= cos(−x) = cosh(1ix) = cosh(−1ix)
tanh(x)	sinh(x) / cosh(x) = −1 * (1−exp( 2x)) / (1+exp( 2x))	= −tanh(−x) = tan(1ix) = −tan(−1ix)
tan(x)	sin(x) / cos(x) = 1i * (1−exp(2ix)) / (1+exp(2ix))	= −tan(−x) = tanh(1ix) = −tanh(−1ix)
sqrt(x)	exp(0.5 * log(x))	= pow(x,0.5)
exp(x)		= pow(e, x) (where e is euler's constant)
pow(x,y)	exp(y * log(x))	Note: If x is negative, log(x) will produce complex numbers.
log10(x)	log(x) / log(10)
x/y		= x * pow(y,-1) = x * exp(-log(y))

Implementing complex-number functions

Operation	Equivalent function using real functions only	What happens in special cases
sin(x)	sin(Xr)cosh(Xi) + icos(Xr)*sinh(Xi)	sin(Xr) if Xi=0 (duh) i*sinh(Xi) if Xr=0
cos(x)	cos(Xr)cosh(Xi) − isin(Xr)*sinh(Xi)	cos(Xr) if Xi=0 cos(Xi) if Xr=0
sinh(x)	sinh(Xr)cos(Xi) + icosh(Xr)*sin(Xi)	sinh(Xr) if Xi=0 i*sin(Xi) if Xr=0
cosh(x)	cosh(Xr)cos(Xi) + isinh(Xr)*sin(Xi)	cosh(Xr) if Xi=0 cos(Xi) if Xr=0
tan(x)	(sin(2Xr) + isinh(2Xi)) / (cos(2Xr) + cosh(2*Xi))	tan(Xr) if Xi=0 i*tanh(Xi) if Xr=0
tanh(x)	(sinh(2Xr) + isin(2Xi)) / (cos(2Xr) + cosh(2*Xi))	tanh(Xr) if Xi=0 i*tan(Xi) if Xr=0
exp(x)	exp(Xr)cos(Xi) + iexp(Xr)*sin(Xi)	exp(Xr) if Xi=0 i*sin(Xi) if Xr=0
log(x)	log(abs(X)) + i * arg(X)	log(Xr) if Xi=0 and Xr>=0 log(-Xr) − πi if Xi=0 and Xr<0 (doesn't reduce into real) -iπ/2 if Xr=0 and Xi<0 iπ/2 if Xr=0 and Xi>0.
abs(x)	sqrt(Xr² + Xi²)	Xr if Xi=0 Xi if Xr=0
arg(x)	atan2(Xi,Xr)	0 if Xi=0 and Xr>0 π if Xi=0 and Xr<0 -iπ/2 if Xr=0 and Xi<0 iπ/2 if Xr=0 and Xi>0.
atan(x)	0.25ilog((Xr²+Xi²+2Xi+1) / (Xr²+Xi² −2Xi+1)) + 0.5 (atan2(Xr,Xi+1)+atan2(Xr,1−Xi))	atan(Xr) if Xi=0
atanh(x)	0.25 log((Xr²+Xi²+2Xr+1) / (Xr²+Xi² −2Xr+1)) + 0.5i (atan2(Xi,Xr+1)+atan2(Xi,1−Xr))	atanh(Xr) if Xi=0
x+y	Xr + iXi + Yr + iYi
x−y	Xr + iXi − Yr − iYi
x*y	(XrYr − XiYi) + i(XiYr + Xr*Yi)	i*i = −1
x/y	(XrYr + XiYi) / (Xr² + Yi²) + i((XiYr − Xr*Yi) / (Yr² + Yi²))
sqrt(x)	sqrt((abs(x) + Xr) / 2) + isgn(Xi)sqrt((abs(x) − Xr) / 2) where sgn(n) = -1, 0 or 1 depending on n's sign.

It is especially interesting to see that complex-number sqrt, log and exp can all be implemented in terms of real-number sqrt, log, exp, sin/cos and atan. This defines the minimum set of floating point functions that must be implemented.

Implementing real-number functions

Operation	Alternatives
sin(x)	−sin(−x) cos(π/2 − x) tan(x) * cos(x) $http://upload.wikimedia.org/math/5/4/6/546ecab719ce73dfb34a7496c942972b.png$ Read: http://en.wikipedia.org/wiki/Trigonometric_function, http://en.wikipedia.org/wiki/Taylor_series
cos(x)	cos(−x) sin(π/2 − x) sin(x) / tan(x)
tan(x)	−tan(−x) sign(x) * sqrt(1 / cos(x)²-1) sin(x) / cos(x)
asin(x)	−asin(−x) π/2 − acos(x) atan(x / sqrt(1 − x²)) $http://upload.wikimedia.org/math/1/1/a/11a93a7c7a5d23f74cdbab5a1e513123.png$
acos(x)	π − acos(−x) π/2 − asin(x) π/2 − atan(x / sqrt(1 − x²)) asin(sqrt(1−x²)) if 0<=x<=1
atan(x)	−atan(−x) asin(x² / sqrt(x²+1)) 2 * atan((sqrt(1+x²) − 1) / x) 2 * atan(x / (1 + sqrt(1 + x²))) atan(y) + atan((x−y)/(1+x*y)) when sign(y)=sign(x) $http://upload.wikimedia.org/math/c/0/1/c01ed76e4510e32f09a91f34660dcd78.png$ Read: http://en.wikipedia.org/wiki/Inverse_trigonometric_functions
exp(x)	pow(e, x) cosh(x) + sinh(x) cosh(−x) − sinh(−x) pow(1 + x/y, y) when y approaches infinity exp(x−y) * exp(y) for any y $http://upload.wikimedia.org/math/0/b/c/0bc08045195dc823c22d1fa283cb0759.png$ Read: http://en.wikipedia.org/wiki/Exponential_function
log(x)	log10(x) / log10(e) log2(x) / log2(e) asinh((x²+1) / (2x)) acosh((x²−1) / (2x)) 2*atanh((x−1) / (x+1)) log(x/y) + log(y) for any y>0 $http://upload.wikimedia.org/math/6/d/b/6dbf56b2e5f34d298c1e71ddd93b3448.png$ Read: http://en.wikipedia.org/wiki/Natural_logarithm
log2(x)	log10(x) / log10(2) log(x) / log(2)
log10(x)	log2(x) / log2(10) log(x) / log(10)
pow(x,y)	1 / pow(x,−y) exp(y*log(x)) if y is integer, a special algorithm may be applicable always pay attention to special cases
sqrt(x)	pow(x,0.5) exp(0.5 * log(x)) Read: http://en.wikipedia.org/wiki/Methods_of_computing_square_roots, http://en.wikipedia.org/wiki/Newton%27s_method
sinh(x)	−sinh(−x) exp(x) − cosh(x) cosh(x) − exp(−x) 2sinh(x/2)cosh(x/2) 1/2 * (exp(x) − exp(−x)) (exp(2x)−1) / (2exp(x)) tanh(x) * cosh(x) $http://upload.wikimedia.org/math/4/3/0/43047f57ab3e2183d7875764566580f4.png$
cosh(x)	cosh(−x) exp(x) − sinh(x) exp(−x) + sinh(x) 1/2 * (exp(x) + exp(−x)) (exp(2x)+1) / (2exp(x)) sinh(x) / tanh(x) $http://upload.wikimedia.org/math/8/4/a/84aa1c9392c1d950fc6ab5a51043fc02.png$
tanh(x)	−tanh(−x) (exp(2x)−1) / (exp(2x)+1) sinh(x) / cosh(x) $http://upload.wikimedia.org/math/a/1/6/a1635f7fd0e9e031ad7f92b27271226a.png$
asinh(x)	log(x + sqrt(x² + 1)) 1/2 * acosh(2*x² + 1)
acosh(x)	log(x + sqrt(x² − 1)) 1/2 * acosh(2*x² − 1)
atanh(x)	1/2 * log(x+1) − 1/2 * log(1−x) log( (1+x) / (1−x)) / 2 $http://upload.wikimedia.org/math/4/7/3/47383305b98c57bef4dad00d1490dd84.png$

Note: Even though more complicated trigonometric functions such as sinh() and atanh() can be implemented in terms of the basic functions, it may still be beneficial to implement them separately using Taylor series for better accuracy.

Floating point addition

TODO

Floating point subtraction

TODO

Floating point multiplication

	int_type a_exponent = original_1_exponent;
	int_type a_mantissa = original_1_mantissa;
	bool     a_sign     = original_1_sign;
	int_type b_exponent = original_2_exponent;
	int_type b_mantissa = original_2_mantissa;
	bool     b_sign     = original_2_sign;

	int_type result_exponent = original_1_exponent + original_2_exponent;
	int_type result_mantissa = original_1_mantissa * original_2_mantissa;
	bool     result_sign     = a_sign ^ b_sign; // XOR
	
	// After this, comes renormalization, which is the adjusting of
	// the exponent & mantissa until the mantissa is in proper range.

Floating point division

	int_type a_exponent = original_1_exponent;
	int_type a_mantissa = original_1_mantissa;
	bool     a_sign     = original_1_sign;
	int_type b_exponent = original_2_exponent;
	int_type b_mantissa = original_2_mantissa;
	bool     b_sign     = original_2_sign;

	int_type result_exponent = original_1_exponent - original_2_exponent;
	int_type result_mantissa = original_1_mantissa / original_2_mantissa;
	bool     result_sign     = a_sign ^ b_sign; // XOR
	
	// After this, comes renormalization, which is the adjusting of
	// the exponent & mantissa until the mantissa is in proper range.

Bit mathematics cookbook

ADD and SUB

Addition without carry

Using subtraction

Using bitwise operations, with XOR

Other

ADC: Addition with carry

Using addition & comparison

Using addition with larger operands

Using bitwise operations, without XOR

Using bitwise shift-with-carry or rotate-with-carry operations

Addition of larger integers than native register size

Subtraction without carry

Using addition

SBC: Subtraction with carry

SGN / Testing if value is negative

NEG

Arithmetic negation

NOT

Logical negation (NOT / invert operation)

Using XOR

Using subtraction

Using arithmetic negation

AND and OR

Logical bit-wise AND

Using NOT and OR

Doing it bit-by-bit

Bit-by-bit using arithmetic operations only

Logical bit-wise OR

Using NOT and AND

Doing it bit-by-bit

Bit-by-bit using arithmetic operations only

XOR

Logical bit-wise XOR

Using NOT, OR and AND

Doing it bit-by-bit

Bit-by-bit using arithmetic operations only

SHL/LSL/ASL/SL: Logical/binary/arithmetic bit-shift to left by 1

Larger-size bit-shifting to left

SHR/LSR/SR: Logical/binary bit-shift to right by 1

Without divisions

Larger-size bit-shifting to right

When you have arithmetic bit-shift to the right

When you do not have any bit-operations

ASR/SRA: Arithmetic bit-shift to the right

When shifting by 1 bit

When shifting uses carry

Using bit-test

When shifting by N bits

MUL: Unsigned multiplication

Multiplication without multiplication operations

Multiplication without any bit-operations

Multiplication of larger integers

MLI/IMUL/MULI: Signed multiplication

DIV and MOD: Division and modulo

Explanation

Implementing with a stack, without bit-shift operations

DVI/IDIV/DIVI: Signed division

MDI/IMOD/MODI: Signed modulo (remainder)

Exponentiation

Exponentiation without any bit-operations

Optimization for constant exponents

Comparisons

Equality comparison

Non-equality comparison

Greater-than comparison

Greater-or-equal-than comparison

Less-than comparison

Less-or-equal-than comparison

Appendix: Floating-point mathematics cookbook

Reducing complicated functions into simpler components using complex arithmetic

Implementing complex-number functions

Implementing real-number functions

Floating point addition

Floating point subtraction

Floating point multiplication

Floating point division