Talk:Binary GCD algorithm/Archive 1

Latest comment: 1 year ago by Ohconfucius in topic Citing uutils
Archive 1

ARM implementation

The ARM implementation lacks the initial u = 0 test. The code also does not appear to make good use of the conditional instruction set, so I have some optimizations to suggest:

remove_twos_loop optimization

gcd:
    @ Arguments arrive in registers r0 and r1
    mov     r3, #0            @ Initialize r3, the power of 2 in the result
    orr     r12, r0, r1       @ r12 = (r0 | r1)
    tst     r12, #1           @ Test least significant bit of (r0 | r1)
    bne     gcd_loop          @ If nonzero, either r0 or r1 are odd, jump into middle of next loop
remove_twos_loop:
    mov     r0, r0, lsr #1    @ Shift r0 right, dividing it by 2
    mov     r1, r1, lsr #1    @ Shift r1 right, dividing it by 2
    add     r3, r3, #1        @ Increment r3
    tst     r0, #1            @ Check least significant bit of r0
    bne     gcd_loop          @ If nonzero, r0 is odd, jump into middle of next loop
    tst     r1, #1            @ Check least significant bit of r1
    beq     remove_twos_loop  @ If zero, r0 and r1 still even, keep looping

optimization 1

gcd:
    @ Arguments arrive in registers r0 and r1
    mov     r3, #0            @ Initialize r3, the power of 2 in the result
    orr     r12, r0, r1       @ r12 = (r0 | r1)
remove_twos_loop:
    tst     r12, #1           @ Test least significant bit of (r0 | r1)
    moveq   r0, r0, lsr #1    @ Shift r0 right, dividing it by 2
    moveq   r1, r1, lsr #1    @ Shift r1 right, dividing it by 2
    moveq   r12, r12, lsr #1  @ Shift r12 right, dividing it by 2
    beq     remove_twos_loop  @ If zero, r0 and r1 still even, keep looping

optimization 2

gcd:
    @ Arguments arrive in registers r0 and r1
    mov     r3, #0            @ Initialize r3, the power of 2 in the result
remove_twos_loop:
    tst     r0, #1            @ Test least significant bit of r0
    tsteq   r1, #1            @ Test least significant bit of r1
    moveq   r0, r0, lsr #1    @ Shift r0 right, dividing it by 2
    moveq   r1, r1, lsr #1    @ Shift r1 right, dividing it by 2
    beq     remove_twos_loop  @ If zero, r0 and r1 still even, keep looping

gcd_loop optimization

gcd_loop:                     @ Loop finding gcd of r0, r1 not both even
    tst     r0, #1            @ Check least significant bit of r0
    moveq   r0, r0, lsr #1    @ If r0 is even, shift r0 right, dividing it by 2, and...
    beq     gcd_loop_continue @ ... continue to next iteration of loop.
    tst     r1, #1            @ Check least significant bit of r1
    moveq   r1, r1, lsr #1    @ If r1 is even, shift it right by 1 and...
    beq     gcd_loop_continue @ ... continue to next iteration of loop.
    cmp     r0, r1            @ Compare r0 to r1
    subcs   r2, r0, r1        @ If r0 >= r1, take r0 minus r1, and...
    movcs   r0, r2, lsr #1    @ ... shift it right and put it in r0.
    subcc   r2, r1, r0        @ If r0 < r1, take r1 minus r0, and...
    movcc   r1, r2, lsr #1    @ ... shift it right and put it in r1.
gcd_loop_continue:
    cmp     r0, #0            @ Has r0 dropped to zero?
    bne     gcd_loop          @ If not, iterate
    mov     r0, r1, asl r3    @ Put r1 << r3 in the result register
    bx      lr                @ Return to caller

optimization

gcd_loop:                     @ Loop finding gcd of r0, r1 not both even
    tst     r0, #1            @ Check least significant bit of r0
    moveq   r0, r0, lsr #1    @ If r0 is even, shift r0 right, dividing it by 2, and...
    beq     gcd_loop          @ ... continue to next iteration of loop.
gcd_loop_2:                   @ Loop finding gcd of r0, r1
    tst     r1, #1            @ Check least significant bit of r1
    moveq   r1, r1, lsr #1    @ If r1 is even, shift it right by 1 and...
    beq     gcd_loop_2        @ ... continue to next iteration of loop.
    cmp     r0, r1            @ Compare r0 to r1
    subcc   r2, r1, r0        @ If r0 < r1, take r1 minus r0, and...
    movcc   r1, r2, lsr #1    @ ... shift it right and put it in r1.
    bcc     gcd_loop_2        @ ... continue to next iteration of loop (r0 is still odd).
    sub     r2, r0, r1        @ If r0 >= r1, take r0 minus r1, and...
    movs    r0, r2, lsr #1    @ ... shift it right and put it in r0.
    bne     gcd_loop          @ Has r0 dropped to zero? If not, iterate
    mov     r0, r1, asl r3    @ Put r1 << r3 in the result register
    bx      lr                @ Return to caller

— Preceding unsigned comment added by 80.175.250.218 (talk) 17:09, 31 October 2005 (UTC)

Sounds good to me. The point of this code sample is just to illustrate how efficient binary GCD is on a real machine. Feel free to plug in the most optimized version you can imagine. Deco 19:50, 31 October 2005 (UTC)

unsigned integers

If don't see, why the algorithm is restricted to unsigned integers. As far as I can see it works equally on negative numbers. Even more the algorithm would be simplified because no v>=u comparison is necessary. 134.102.210.237 11:54, 31 March 2006 (UTC)

Implementation in assembly

Thanks Balabiot for pointing out the incorrect result (always zero) when u=0 or v=0. I have fixed the code to return the sum of the inputs when either or both are zero. The code now matches the output of the Euclidean algorithm.

I have reverted movne/movne to movs/movnes to ensure that r0 and r1 are both nonzero on entry to check_two_r0. (The previous code hung e.g. with u=0, v=3.)

I didn't understand why the case where u and v are powers of two needed fixing, could you please elaborate? -- Regregex (talk) 17:02, 10 October 2008 (UTC)

I don't remember, probably I just screwed up. :-) Balabiot (talk) 17:03, 23 October 2008 (UTC)

GCD in hardware

The comparison of the Binary GCD with GCD implemented by hardware division lead me to the question, how fast GCD could be when implemented as hardware instruction. I assume it must be as fast or slow as one division. HenningThielemann (talk) 20:33, 2 October 2009 (UTC)

gcd(0,0)

gcd(x,y) is the integer, that is divided by all common divisors of x and y. Thus gcd(0,0)=0, because the common divisors of 0 and 0 are all natural numbers, and these are also divisors of 0. HenningThielemann (talk) 20:22, 2 October 2009 (UTC)

A quality source would be needed to confirm this. Cursory examination of the literature shows that, indeed, gcd(0,0) is typically not defined. (See, for instance, Niven, Montgomery, Zuckerman, "Introduction to the theory of numbers".) 71.182.236.76 (talk) 20:55, 31 October 2009 (UTC)

Implementations

I've removed the ML implementation, because it teaches more about ML than about the algorithm. I have my doubts about the value of the assembly version, but I can't really assess it as I can't read it. IMHO, the C implementation is important because it exemplifies the use of bitwise operations for efficiency. Qwertyus 22:31, 13 April 2006 (UTC)

I think you did the right thing. I just restored the C and ARM implementations after User:Arvindn blanked them, for these reasons: IMHO, C isn't important for showing bitwise operations; it should be obvious that you'd use bitwise ops where appropriate. But the C implementation is easier to read than the English algorithm at the moment, so it needs to stay at least until there's an appropriate substitute (pseudocode? Knuth-style algorithm description?). IMHO, the ARM implementation is really enlightening, because it shows the real size and speed benefit of the algorithm in a way that no higher-level language's compiler even approaches. (In particular, it totally stomps the C implementation.) Without some hand-optimized assembly implementation, the advantage of binary GCD over Euclid is not obvious, IMHO. --Quuxplusone 21:18, 1 January 2007 (UTC)

The following pseudocode describes the algoritm probably best:

Input: Positive integers u and v
Output: gcd(u,v)
Note: division and multiplication by 2 can in many languages most efficiently be done by bitshifting.
START 
s:=1
while even(u) and even(v) do
   u:=u/2;v:=v/2;s:=s*2;
while even(u) do u:=u/2
while even(v) do v:=v/2
while u<>v do
begin
  if u>v then
     begin
        u:=u-v
        while even(u) do u:=u/2
     end
     else
     begin
        v:=v-u
        while even(v) do v:=v/2
     end
end
return(u*s)
END

Hkleinnl (talk) 20:07, 12 August 2010 (UTC)

Why (u | v) & 1 == 0 is the correct test

Regarding edits by 94.43.147.234 (talk): Step 2 states:

If u and v are both even, then gcd(u, v) = 2·gcd(u/2, v/2), because 2 is a common divisor.

Therefore, before u and v may both be halved and shift incremented, the LSB of u must be zero, and the LSB of v must be zero. The test (in line 12 of the C code fragment) would look like this (avoiding the short-circuit operators for simplicity):

    for (shift = 0; (u & 1) == 0 & (v & 1) == 0; ++shift) {

This is an expression of the form  , which by de Morgan's laws can be rewritten as  . In other words, Halve u and v and increment shift if the following is false: that u is odd and/or v is odd. Notice the Boolean negation ! inserted at the front:

    for (shift = 0; !((u & 1) == 1 | (v & 1) == 1); ++shift) {

As the only non-zero result of x & 1 is 1, the equalities disappear:

    for (shift = 0; !((u & 1) | (v & 1)); ++shift) {

& is distributive over | so the two & 1 operations can be combined:

    for (shift = 0; !(((u) | (v)) & 1); ++shift) {

Comparing with zero is equivalent to the Boolean negation. After removing redundant parentheses we obtain the test as originally written:

    for (shift = 0; ((u | v) & 1) == 0; ++shift) {

As a further test, the code fragment as it stood (comprising ((u & v) & 1) == 0) was compiled, and gcd(1,2) was called. As u and v have no two bits set in the same column, the test never became false and the code hung in a loop between lines 12–15. u | v also corresponds to the ARM code (ORRS) which has been tested. – Regregex (talk) 16:28, 11 July 2011 (UTC)

bug in wikipedia underlying framework?

It appears that there are two pages:

a) http://en.wiki.x.io/wiki/Binary_gcd

(last modified: "4 January 2012 at 17:09.")

trying to edit that page will take you to:

  http://en.wiki.x.io/w/index.php?title=Binary_GCD_algorithm&action=edit

b) http://en.wiki.x.io/wiki/Binary_GCD_algorithm

(last modified: "12 January 2012 at 15:34.")

trying to edit that page will take you to:

  http://en.wiki.x.io/w/index.php?title=Binary_GCD_algorithm&action=edit

However, a) and b) appear different as a) does not reflects changes to b).

By manually modifying the "edit" URL from a) to:

  http://en.wiki.x.io/w/index.php?title=Binary_gcd&action=edit

It appears that a) is just "#REDIRECTBinary GCD algorithm"

However, it does not seems to work.

I hope this helps (I could not find a "report a bug" thing so I'm posting here) — Preceding unsigned comment added by 78.31.46.6 (talk) 16:16, 12 January 2012 (UTC)

Binary gcd is a redirect to Binary GCD algorithm. That link is intentional; anybody looking for the first is redirected to the article at the second. There is no need to edit Binary gcd. Glrx (talk) 18:58, 13 January 2012 (UTC)

Assembly code

I have once again reverted the addition of assembly code to this article.[1] It is the duty of the person trying to insert material to garner consensus; see WP:BRD.

The material is inappropriate for the article. It is irrelevant for order statistics because it deals with a fixed size number. GCD for 32-bit or 64-bit numbers is O(1). The issue for algorithms is behavior for very large numbers (e.g., arbitrary precision).

The assembly code is also too detailed for this article. Algorithms should be given in psuedo code. Higher level languages can also be used. ARM assembly is just too distant.

The code wants to be kept because the branch predication article referred to it; branch prediction is an irrelevant technique when subtracting 1000 digit numbers. The branch predication article should contain its own examples.

Glrx (talk) 22:43, 11 February 2014 (UTC)

The sample code was present for some considerable time before you removed it. Therefore the onus was on you to start the discussion and garner consensus to remove it. Several different editors have replaced it but you have repeatedly removed it. So as not to perpetuate an edit war I have refrained from reverting your edit, but you might like to self-revert until the matter is settled here. 81.136.202.93 (talk) 16:45, 18 February 2014 (UTC)
There's a discussion on the table. I've given reasons why the material is inappropriate for this article. You are silent on that issue. Why do you think the unsourced ARM assembly code is necessary for this article about binary GCD? Glrx (talk) 22:19, 18 February 2014 (UTC)
I support removal of the assembly code for these reasons: (1) It is an obvious violation of WP:NOR. (2) It is a violation of WP:WEIGHT, since the number of people who care about such details is miniscule. (3) Wikipedia is not a programming manual, especially not a programming manual for assembly hobbyists. McKay (talk) 04:24, 19 February 2014 (UTC)
I agree that the ARM assembler example was overbearing on this page. Someone not using his real name (talk) 14:19, 2 March 2014 (UTC)

Is that page obsolete ?

I ran 10 000 000 random tests on Mac OS X.6.4 with C Code and got the following results:

  • Euclide : 6180 ms
  • Stein : 9038 ms

and compiling with option -O3

  • Euclide : 4728 ms
  • Stein : 5302 ms
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define N 10000000
#define MAX_UINTplus1 0x100000000LLU
#define uint64 unsigned long long
#define INIT_RANDOM srandom(3)

uint64 pgcdEuclide(uint64,uint64);
uint64 pgcdStein(uint64,uint64);


int main(void) {
	int i;
	uint64 a,b,d,t0,t1;
	struct  timeval tv;
	
	// Version Euclide
	INIT_RANDOM;
	gettimeofday(&tv,0);
	t0 = tv.tv_sec* 1000 *1000+tv.tv_usec;
	for(i=0;i<N;i++){
		a=random()*MAX_UINTplus1+random();
		b=random()*MAX_UINTplus1+random();		
		d=pgcdEuclide(a,b);
	}	
	gettimeofday(&tv,0);
	t1 = tv.tv_sec*1000*1000+tv.tv_usec;	
	printf("%llu,Euclide : %llu ms\n",d,(t1-t0)/1000);
	
	// Version Stein
	INIT_RANDOM;
	gettimeofday(&tv,0);
	t0 = tv.tv_sec* 1000 *1000+tv.tv_usec;
	for(i=0;i<N;i++){
		a=(uint64)random()*MAX_UINTplus1+random();
		b=(uint64)random()*MAX_UINTplus1+random();		
		d=pgcdStein(a,b);
	}	
	gettimeofday(&tv,0);
	t1 = tv.tv_sec*1000*1000+tv.tv_usec;	
	printf("%llu,Stein : %llu ms\n",d,(t1-t0)/1000);	
	
	return 0;
}

uint64 pgcdEuclide(uint64 a,uint64 b){
	if(b==0) return a;
	return pgcdEuclide(b,a%b);
}

uint64 pgcdStein(uint64 u,uint64 v){
		int shift;
		
		/* GCD(0,x) := x */
		if (u == 0 || v == 0)
			return u | v;
		
		/* Let shift := lg K, where K is the greatest power of 2
		 dividing both u and v. */
		for (shift = 0; ((u | v) & 1) == 0; ++shift) {
			u >>= 1;
			v >>= 1;
		}
		
		while ((u & 1) == 0)
			u >>= 1;
		
		/* From here on, u is always odd. */
		do {
			while ((v & 1) == 0)  /* Loop X */
				v >>= 1;
			
			/* Now u and v are both odd, so diff(u, v) is even.
			 Let u = min(u, v), v = diff(u, v)/2. */
			if (u < v) {
				v -= u;
			} else {
				uint64 diff = u - v;
				u = v;
				v = diff;
			}
			v >>= 1;
		} while (v != 0);
		
		return u << shift;
}

This is terrrible because mostly the random number generator is being timed. McKay (talk) 07:56, 11 July 2012 (UTC)

RNGs are super slow. Also note that you only went up to numbers sized 2 ** 64, meaning that arithmetic operations will take constant time if you're on a 64 bit hardware/os/etc. To do an emperical comparison you should use a bigint library (eg, ints in Python) and test numbers which are >> your CPU's native ALU (or implement your own arithmetic routines so the processor is not used directly) — Preceding unsigned comment added by 131.215.176.115 (talk) 18:06, 16 March 2013 (UTC)

As McKay mentionned, you are timing your RNG so the microbenchmark is invalid. I made a more-precise benchmark using the criterion statistical tool, when evaluating Stein's algorithm for use in an integer factoring application, and found it to be ~60% faster than Euclid's, on random 64b numbers. Interestingly, this matches the theory (Stein's algorithm uses 60% fewer bit operations than Euclid's[1]) nicoo (talk) 10:06, 3 November 2020 (UTC)

References

  1. ^ Akhavi, Ali; Vallée, Brigitte (2000), "Average Bit-Complexity of Euclidean Algorithms", Proceedings ICALP'00, Lecture Notes Computer Science 1853: 373–387, CiteSeerX: 10.1.1.42.7616, archived from the original on 2006-10-02

ctz version

I think the unreferenced section "Iterative version in C++ using ctz (count trailing zeros)" should also be removed. It makes vague claims of efficiency, but as far as I can tell these rely on hardware parallelism for ctz. If you look at Count trailing zeros, the typical (x86) ctz instructions have limited bit width, so the claims to improved performance suffer from the same problem as the ARM assembly example, i.e. they are not asymptotic. Someone not using his real name (talk) 14:26, 2 March 2014 (UTC)

Totally agreed, went ahead and WP:BOLDly deleted this section. Count trailing zeros has more to do with compilers using the instruction as part of the platform optimization, rather than with a C++ implementation which may or may not end up using that instruction. — Dsimic (talk | contribs) 17:44, 2 March 2014 (UTC)
  • Keep. The article confuses the notion of an algorithm with the bit-bumming aspects of programming. Using machine arithmetic types (int) to illustrate the algorithm is one thing, but it should not be taken so literally. The algorithm and not the programming is the important thing. The cycles in a hardware divide instruction are not interesting. The availability of an x86 instruction to count zeros is also uninteresting. GCD on arbitrary-precision numbers (bignums) is a different story; think million-digit numbers. Binary GCD can be done with shifts and adds -- operations that are linear in the number of bits. Division is not. It may be that bignum packages use binary GCD for that reason. CTZ applied to the last digit of a bignum would offer a reasonable speed optimization (and one could use a lookup table to implement CTZ). I think there's value. Glrx (talk) 23:27, 2 March 2014 (UTC)

People looking for a fast way to compute GCD, are disappointed by this article, when they find out that the C code is slower than euclidean algorithm. It would be nice to point out that: Actually this C code may be faster than euclidean algorithm, when replacing while((v&1)==0) v>>=1; by v>>=__builtin_ctz(v); This makes sense even for big numbers, since time of the new version is linear in the number of bits, while time of the first version is multiplied by the number of trailing zeros. Of course the actual reason why it is faster, is the removal of a short loop which is made twice in average, and involves a lot of mispredicted branches. But there is no need to mention it. It would just be cool to have a really fast function. — Preceding unsigned comment added by 78.196.87.18 (talk) 03:39, 12 April 2020 (UTC)

78.196.87.18, unfortunately there is no way for a C implementation to be both portable and efficient (__builtin_ctz isn't part of the standard). That's part of why I replaced the iterative implementation with a Rust one, as it lends itself to faster and more-readable code while still being portable. nicoo (talk) 10:15, 3 November 2020 (UTC)

mod 4 optimisation

I removed the “Variant” section, originally added by 64.44.80.252 :

As mentioned in the Rust code above, the difference of two odd values uv is always even, so it may be divided by two unconditionally, or a following while loop to remove factors of two may be changed to a do while loop.

A common optimization is to add an additional reduction depending on the values of u and v modulo 4. If uv (mod 4), then their difference is divisible by 4 and the loop may immediately proceed to |uv|/4. If they are unqual, then the sum must be a multiple of 4, and one may use (u + v)/4 instead.

The implementation must be careful to avoid integer overflow during the addition. Since it is known that (u mod 4) + (v mod 4) = 4, one way to compute the result is u/4 + v/4 + 1. A second is to compute (uv)/2, and if it is odd, proceed to ((uv)/2 + v)/2.

I did it because:

  • The optimizations described are either a repetition of what's just above (moving the exit condition check to the end of the look) or subsumed by strictly more-powerful ones (working mod 4, whereas the Rust version just removes all power-2 factors in a single step: v >>= v.trailing_zeroes())
  • It is doubtful that using mod 4 leads to any speedup:
    • it prevents some optimizations mentioned in the article, like using branch-free code;
    • the conditionals involved will be hard-to-predict for a concrete machine's branch predictor;
    • the same number of arithmetic operations is performed.
  • The claim is not supported by any citation, and I couldn't find anything supporting it with a quick web search, so this is looking like original research.

nicoo (talk) 18:46, 10 January 2021 (UTC)

“WP is not a code repository!” / 91.56.241.188

91.56.241.188 removed the iterative Rust version, with “WP is not a code repository!” as a stated reason... then reintroduced an x86 assembly version that used the same techniques, a week later. This is obviously incoherent behaviour, and conveyed much less clearly the point of the iterative implementation (showing where the concrete speedups are, beyond the algorithm-as-written); as such, I simply rolled back their changes.

However, I do agree that WP is not a code repository, and that code samples that do not fulfill a didactic purpose should not be kept; I had shied from removing the C example initially, even though it doesn't convey additional information compared to the algorithm's description, but I will go and boldly rework the “Implementation” section.

nicoo (talk) 13:02, 5 October 2021 (UTC)

Reverted 1054325962 “Implementation in C” for the same reasons:
- no didactic purpose: restated the same thing as the Rust example right above, was probably a direct translation copied verbatim from Lemire's blog ;
- lack of clarity: code wasn't commented, used “bit tricks” like ctz(u | v) = min(ctz(u), ctz(v)) for no particular reason.
Furthermore, I doubt it is possible to implement this algorithm in C in a way that is both standard-compliant, realistic (“nobody” would implement this without a ctz primitive), and readable. A code example that doesn't meet those criteria doesn't seem fit for encyclopedic purposes. nicoo (talk) 12:46, 12 November 2021 (UTC)

Arbitrary unique factorization domains

As far as I can see, the cited paper only provides an algorithm for integer rings of number fields, not for arbitrary UFDs. Or am I missing something? Mgs2804 (talk) 18:41, 4 April 2021 (UTC)

Good catch. <3
I seem to recall spending a while looking for the relevant literature, when I overhauled the page, so I suspect I made a mistake and cited the wrong paper. I will annotate as “ref. needed” for now, as I do not currently have the energy to properly fix it, so feel free to fix it. nicoo (talk) 10:00, 17 February 2022 (UTC)
I couldn't easily find the relevant citation, and don't really have a lot of energy to invest in that specific detail, so I am simply removing the mention of UFDs under the assumption this was a mistake.
Sorry it took so long to resolve this ^^' nicoo (talk) 09:46, 15 February 2023 (UTC)

Citing uutils

The Rust implementation in the article is derived from the one I wrote in the uutils project, which is mentioned upfront.

User:Ohconfucius removed the link to the original in rev. 1084500718 "Script-assisted style fixes" so I am assuming it was an issue with that particular script rather than a violation of the MoS.

Still, I am not keen on the inline link, so I'd welcome another way to cite source code from open projects? I know of {{Cite web}} but that doesn't seem quite appropriate? nicoo (talk) 10:03, 15 February 2023 (UTC)