Efficient long division via Montgomery multiply
Abstract
We present a novel righttoleft long division algorithm based on the Montgomery modular multiply, consisting of separate highly efficient loops with simply carry structure for computing first the remainder (x mod q) and then the quotient floor(x/q). These loops are ideally suited for the case where x occupies many more machine words than the divide modulus q, and are strictly linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic performance test of multiword dividend and single 64bitword divisor, exploitation of the inherent dataparallelism of the algorithm effectively mitigates the long latency of hardware integer MUL operations, as a result of which we are able to achieve respective costs for remainderonly and fullDIV (remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel Core 2 implementation of the x86_64 architecture, in singlethreaded execution mode. We further describe a simple "bitdoubling modular inversion" scheme, which allows the entire iterative computation of the modinverse required by the Montgomery multiply at arbitrarily large precision to be performed with cost less than that of a single Newtonian iteration performed at the full precision of the final result. We also show how the Montgomerymultiplybased powering can be efficiently used in Mersenne and Fermatnumber trial factorization via direct computation of a modular inverse power of 2, without any need for explicit radixmod scalings.
 Publication:

arXiv eprints
 Pub Date:
 March 2013
 arXiv:
 arXiv:1303.0328
 Bibcode:
 2013arXiv1303.0328M
 Keywords:

 Computer Science  Data Structures and Algorithms;
 11Y16 (Primary);
 68Q25;
 68W40 (Secondary);
 G.1.0;
 B.2.4;
 F.2.1
 EPrint:
 23 pages