20051103, 19:58  #1 
∂^{2}ω=0
Sep 2002
República de California
2×3×29×67 Posts 
Mfactor sieve code thread
Dmitri Gribenko (a.k.a. gribozavr) has kindly built an AMD64/linux binary of my Mfactor mersennesieving code for me. Binaries are available at
http://hogranch.com/mayer/bin/mfactor_amd64.tar.gz or, using ftp: ftp://hogranch.com/pub/mayer/bin/mfactor_amd64.tar.gz The zip archive contains two executables, a dynamicallylinked and a staticallylinked one. These should run on both plain AMD64 and Opteron systems. The speed of the 2 is virtually equal, so I suggest simply using the staticallylinked binary, as that should run on the widest variety of setups. To give a rough idea of speed, testing a gigabitsize M(p) to a bit depth of 70 should need roughly 3 hours on a 2GHz system.  24 March 2006: An Itanium/Linux binary is available at http://hogranch.com/mayer/bin/mfactor_ia64_linux.gz  24 March 2006: A binary for Alpha (should run under both TruUnix and Linux) is available at http://hogranch.com/mayer/bin/mfactor_alpha.gz  22 March 2006: A Win32 binary is available at http://hogranch.com/mayer/bin/mfactor_win32.zip  The simple way to invoke the code (assuming you've called your local executable "Mfactor") is Mfactor m {exponent} bmax {some positive number} & The code allows p up to 2^50 and q's up to 2^96 (i.e. bmax as large as 96). The exponent must be an odd prime. The argument following the bmax flag is the maximum bit depth of factoring, i.e. log2(qmax), where we use q = 2*k*p+1 (k a positive int) to denote factor candidates. The bmax argument may have a fractional part, e.g. to sieve M(3321928381) to depth 2^(72.5) enter Mfactor m 3321928381 bmax 72.5 The program can also start at a nonminimal bit depth; for instance if you (or someone else) had already sieved the above exponent to 2^70 and you wanted to sieve from 2^70 to 2^73, you'd enter Mfactor m 3321928381 bmin 70 bmax 73 Like bmax, bmin may also have a fractional part. Note that if your run is of this type but bmin is more than 2 or 3 bits smaller than bmax I suggest just specifiying bmax (i.e. running from scratch)  that will cost little extra runtime over running from bmin, and will have the added benefit of doublechecking the results (or nonresults) of the previous, shallower sieving run. The code will print some diagnostic output (much of will eventually be suppressed, but as this is a beta I want it for now) and then passbypass info about progress through each of the default 16 factoring passes (each of which tests a set of q's having a distinct value of k mod 60  you don't really need to worry about that, just know that a full factoring run needs to do all 16 passes.) To suppress this onscreen output (e.g. if you want to run quietly in background or on a remote system) you can either divert the output to a file (e.g. by appending ' > foo.txt' to the above) or use the initfile method described a few paragraphs below. As it runs, the code periodically (every few minutes on a fast system) writes a small checkpoint file named t{exponent} (brackets for emphasis only), so if running in background mode you can examine that file if you want to see which of the 16 passes it's currently doing and which k it's at relative to kmax for each pass. This file has the following format: Code:
exponent bmin <== log2(minimum factor to try) bmax <== log2(maximum factor to try) KNow = (some int) <== Current k of this pass KMax = (some int) <== Maximum k of this pass PassNow = (some int <= 15) <== Current pass PassMax = (some int <= 15) <== Maximum pass (typically 15) 3321928381 0.0000000 73.000013 KNow = 1154259025940 KMax = 1421586566400 PassNow = 4 PassMax = 15 That tells me the program is roughly 80% of the way through the 5th pass (counting from 0) of the 16. This file tells the code where to resume factoring in the event of a restartfrominterrupt. To resume you simply reenter the same command line you used to start the run, or (since the restart file contains all the needed information about the bmin/bmax bounds) simply enter Mfactor m {exponent} & i.e. simply provide the exponent, so the program knows which restart file to read. You can also use the same restart file format to initiate a run, if you don't want to deal with the screenoutput kludge mentioned above: To start a run of exponent {p} from bit bounds {bmin} to {bmax}, create a file t{p} containing (the {} indicate fields you need to fill in) {p} {bmin} {bmax} KNow = 0 KMax = 0 PassNow = 0 PassMax = 15 and invoke the program using Mfactor m {p} & In that case all screen output except for the startofrun diagnostics will be suppressed, i.e. after starting the run you can log out and walk away. The code will automatically generate the KMin and KMax values corresponding to your bmin and bmax bounds and write them to the checkpoint file at the time of the next fileupdate. Results are written to a file named results.txt  at the moment the code also writes some small additional debug information (every 2^26th q that was tried) to the file; for kicks you could check that each of these intermediate q's is either prime or has no factor less than the smallfactorsieve bound of 611957 (that's the 50000th odd prime, in case you were wondering), but I expect most of you will mainly be interested in the lines containing the word "factor". I'll be adding more binaries as the need arises and time permits  the code also runs extremely fast on Alpha and Itanium systems (which have similarly good 64integer MUL throughput as the AMD64), and should be within a factor of 2 of that (in terms of percycle throughput) on the Mac G5, with some tuning of the key modmul macros. Pentium users will have to be patient  the pentium has poor integer MUL capability relative to the aforementioned platforms, which means I'll be using a floatingpoint modmul algorithm I recently wrote (that allows q's up to 78 bits), but that will need to be SSE2ified to give decent performance. Have fun, let me know if you have any problems. Ernst Last fiddled with by ewmayer on 20060324 at 17:23 
20051106, 21:24  #2 
∂^{2}ω=0
Sep 2002
República de California
2×3×29×67 Posts 
Does anybody have an AMD64 running Windows and with gcc installed which they could use to do a Windows build for me? If so, PM me and I'll send you a zip of the source code and simple build and test instructions.
Thanks, Ernst 
20051107, 08:20  #3 
May 2003
3·7·11 Posts 
I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us.
I think that a 3x25bit pure FPU version may have some potential, the a=+/(b*c)+/d instruction should help pack those operations in. 
20051107, 13:37  #4  
"Mark"
Apr 2003
Between here and the
2·5·643 Posts 
Quote:
Remote  a CHUD remote client program calls chudStartRemotePerfMonitor()/chudStopRemotePerfMonitor() to start/stop sampling. If you are using shark over a remote login (rlogin, telnet, ssh, etc.) connection you will need to run it and the remote client from the same session. at Apple 

20051107, 19:42  #5  
∂^{2}ω=0
Sep 2002
República de California
2×3×29×67 Posts 
Quote:
I'll probably try a G5 build later this week  my plan is to do timings of pure64bitint vs. purefloat versions on various platforms, and then design routines that interleave various numbers of modpow operations of each type in an attempt to keep both the IU and FPU pipelines as full as possible, while having both sets of parallel powerings finish in roughly the same number of total cycles. For instance on Alpha/Itanium/AMD64 a 2:1 ratio of int/float modpows looks about right (because the former are so fast), whereas on the G5 the ratio will probably be inverted, because 64bit integer mul has 1/3 the throughput. I must say, the AMD64 has been a pleasant revelation in this regard  here I was working under the mistaken assumption that only the Opteron version of that chip can do screamingly fast 64x64=>128bit MULs, when in fact all AMD64s can. 

20051108, 09:50  #6  
May 2003
3×7×11 Posts 
Quote:
For limbs a,b,c, it's nice to be able to exactly represent (2*a*c+b^2) without exceeding the precision of the FPU. Quote:
Quote:
Phil 

20051108, 10:36  #7  
May 2003
3·7·11 Posts 
Quote:


20051108, 13:57  #8  
Tribal Bullet
Oct 2004
3·1,181 Posts 
Quote:
jasonp 

20051108, 14:41  #9  
May 2003
3×7×11 Posts 
Quote:
(OK, everything that was on my next 2 or 3 weeks' todo list.) I better ask  does it guarantee <0.5 ulp? Phil 

20051108, 19:36  #10  
∂^{2}ω=0
Sep 2002
República de California
2×3×29×67 Posts 
Quote:
Quote:
In fact, it appears that this scheme with input magnitude bounds of 2^{25,25,...,26} should work all the way up to 6vector inputs, i.e. inputs as large as 156 bits, as long as the hardware properly supports 53mantissabit IEEE doubles. Of course for polynomials of length 4 and above it becomes worthwhile to consider an optimized Nussbaumerstyle shortlength convolution rather than straight grammarschool, and for the latter the input bounds may be slightly less lenient  I haven't done the detailed math, though, so right now I'm just talking off the top of my head w.r.to the latter. Quote:
Last fiddled with by ewmayer on 20051108 at 19:38 

20051109, 13:59  #11  
Tribal Bullet
Oct 2004
3×1,181 Posts 
Quote:
A while ago I experimented with using this technique to implement 50bit modular multiplication. On the G5 a single modmul takes ~50 clocks, and I wrote a massively pipelined vectorized version that is supposed to finish one 50bit modmul, with arbitrary modulus, in 7 clocks. Unfortunately the actual throughput is 12.5 clocks. For reference, years ago I wrote an allinteger vectorized modmul on the Alpha that sustained one modmul, with 62bit modulus, in 6 clocks. Of course, the G5 has a much faster clock and so ended up faster in realworld terms. Phil, before you spend as much time on this as I did, the following is a reference modmul. The quantity 'magic' is 3*2^51. n has an upper limit of about 2^50; I've tried higher and it occaisionally doesn't work. Code:
#include <ppc_intrinsics.h> double fast_modmul(double a, double b, double n, double recip_n, double magic) { double q, r, nq_hi, nq_lo; q = a * b; /* compute high 53 bits of a * b */ q = __fmadd(q, recip_n, magic); /* multiply by the reciprocal */ q = q  magic; /* round to nearest integer */ nq_hi = n * q; nq_lo = __fmsub(n, q, nq_hi); /* nq_hi + nq_lo = n * q exactly */ r = __fmsub(a, b, nq_hi); r = r  nq_lo; /* compute a * b  n * q */ q = r + n; return __fsel(r, r, q); /* return r, or r+n if r < 0 */ } Happy crunching, jasonp Last fiddled with by ewmayer on 20051111 at 01:17 Reason: (With Jasonp's approval) "Magic" FP rounding constant is 3*2^51, not 3*2^52 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Easiest working Quadratic Sieve code?  skan  Programming  4  20180324 09:54 
Sieve Benchmark Thread  Historian  Twin Prime Search  105  20130205 01:35 
Factor5 source code thread  ET_  Operation Billion Digits  10  20080917 12:28 
Where I should write C code (thread moved)  maqableh  Programming  9  20060512 16:22 
New Sieve Thread Discussion  Citrix  Prime Sierpinski Project  15  20050829 13:56 