久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

模塊化算法和 NTT(有限域 DFT)優化

Modular arithmetics and NTT (finite field DFT) optimizations(模塊化算法和 NTT(有限域 DFT)優化)
本文介紹了模塊化算法和 NTT(有限域 DFT)優化的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

我想使用 NTT 進行快速平方(請參閱快速 bignum 平方計算),但即使對于非常大的數字……超過 12000 位.

I wanted to use NTT for fast squaring (see Fast bignum square computation), but the result is slow even for really big numbers .. more than 12000 bits.

所以我的問題是:

  1. 有沒有辦法優化我的 NTT 轉換?我并不是要通過并行(線程)來加速它;這只是低級層.
  2. 有沒有辦法加快我的模塊化算術的速度?

這是我在 C++ 中為 NTT 編寫的(已經優化的)源代碼(它是完整的并且 100% 在 C++ 中工作,不需要第三方庫,并且還應該是線程安全的.注意源數組被用作臨時!!!, 也不能將數組轉化為自身).

This is my (already optimized) source code in C++ for NTT (it's complete and 100% working in C++ whitout any need for third-party libs and should also be thread-safe. Beware the source array is used as a temporary!!!, Also it cannot transform the array to itself).

//---------------------------------------------------------------------------
class fourier_NTT                                    // Number theoretic transform
    {

public:
    DWORD r,L,p,N;
    DWORD W,iW,rN;
    fourier_NTT(){ r=0; L=0; p=0; W=0; iW=0; rN=0; }

    // main interface
    void  NTT(DWORD *dst,DWORD *src,DWORD n=0);               // DWORD dst[n] = fast  NTT(DWORD src[n])
    void INTT(DWORD *dst,DWORD *src,DWORD n=0);               // DWORD dst[n] = fast INTT(DWORD src[n])

    // Helper functions
    bool init(DWORD n);                                       // init r,L,p,W,iW,rN
    void  NTT_fast(DWORD *dst,DWORD *src,DWORD n,DWORD w);    // DWORD dst[n] = fast  NTT(DWORD src[n])

    // Only for testing
    void  NTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w);    // DWORD dst[n] = slow  NTT(DWORD src[n])
    void INTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w);    // DWORD dst[n] = slow INTT(DWORD src[n])

    // DWORD arithmetics
    DWORD shl(DWORD a);
    DWORD shr(DWORD a);

    // Modular arithmetics
    DWORD mod(DWORD a);
    DWORD modadd(DWORD a,DWORD b);
    DWORD modsub(DWORD a,DWORD b);
    DWORD modmul(DWORD a,DWORD b);
    DWORD modpow(DWORD a,DWORD b);
    };

//---------------------------------------------------------------------------
void fourier_NTT:: NTT(DWORD *dst,DWORD *src,DWORD n)
    {
    if (n>0) init(n);
    NTT_fast(dst,src,N,W);
//    NTT_slow(dst,src,N,W);
    }

//---------------------------------------------------------------------------
void fourier_NTT::INTT(DWORD *dst,DWORD *src,DWORD n)
    {
    if (n>0) init(n);
    NTT_fast(dst,src,N,iW);
    for (DWORD i=0;i<N;i++) dst[i]=modmul(dst[i],rN);
       //    INTT_slow(dst,src,N,W);
    }

//---------------------------------------------------------------------------
bool fourier_NTT::init(DWORD n)
    {
    // (max(src[])^2)*n < p else NTT overflow can ocur !!!
    r=2; p=0xC0000001; if ((n<2)||(n>0x10000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x30000000/n; // 32:30 bit best for unsigned 32 bit
//    r=2; p=0x78000001; if ((n<2)||(n>0x04000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x3c000000/n; // 31:27 bit best for signed 32 bit
//    r=2; p=0x00010001; if ((n<2)||(n>0x00000020)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x00000020/n; // 17:16 bit best for 16 bit
//    r=2; p=0x0a000001; if ((n<2)||(n>0x01000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x01000000/n; // 28:25 bit
     N=n;                // size of vectors [DWORDs]
     W=modpow(r,    L);    // Wn for NTT
    iW=modpow(r,p-1-L);    // Wn for INTT
    rN=modpow(n,p-2  );    // scale for INTT
    return true;
    }

//---------------------------------------------------------------------------
void fourier_NTT:: NTT_fast(DWORD *dst,DWORD *src,DWORD n,DWORD w)
    {
    if (n<=1) { if (n==1) dst[0]=src[0]; return; }
    DWORD i,j,a0,a1,n2=n>>1,w2=modmul(w,w);
    // reorder even,odd
    for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
    for (    j=1;i<n ;i++,j+=2) dst[i]=src[j];
    // recursion
    NTT_fast(src   ,dst   ,n2,w2);    // even
    NTT_fast(src+n2,dst+n2,n2,w2);    // odd
    // restore results
    for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w))
        {
        a0=src[i];
        a1=modmul(src[j],w2);
        dst[i]=modadd(a0,a1);
        dst[j]=modsub(a0,a1);
        }
    }

//---------------------------------------------------------------------------
void fourier_NTT:: NTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w)
    {
    DWORD i,j,wj,wi,a,n2=n>>1;
    for (wj=1,j=0;j<n;j++)
        {
        a=0;
        for (wi=1,i=0;i<n;i++)
            {
            a=modadd(a,modmul(wi,src[i]));
            wi=modmul(wi,wj);
            }
        dst[j]=a;
        wj=modmul(wj,w);
        }
    }

//---------------------------------------------------------------------------
void fourier_NTT::INTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w)
    {
    DWORD i,j,wi=1,wj=1,a,n2=n>>1;
    for (wj=1,j=0;j<n;j++)
        {
        a=0;
        for (wi=1,i=0;i<n;i++)
            {
            a=modadd(a,modmul(wi,src[i]));
            wi=modmul(wi,wj);
            }
        dst[j]=modmul(a,rN);
        wj=modmul(wj,iW);
        }
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::shl(DWORD a) { return (a<<1)&0xFFFFFFFE; }
DWORD fourier_NTT::shr(DWORD a) { return (a>>1)&0x7FFFFFFF; }

//---------------------------------------------------------------------------
DWORD fourier_NTT::mod(DWORD a)
    {
    DWORD bb;
    for (bb=p;(DWORD(a)>DWORD(bb))&&(!DWORD(bb&0x80000000));bb=shl(bb));
    for (;;)
        {
        if (DWORD(a)>=DWORD(bb)) a-=bb;
        if (bb==p) break;
        bb =shr(bb);
        }
    return a;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modadd(DWORD a,DWORD b)
    {
    DWORD d,cy;
    a=mod(a);
    b=mod(b);
    d=a+b;
    cy=(shr(a)+shr(b)+shr((a&1)+(b&1)))&0x80000000;
    if (cy) d-=p;
    if (DWORD(d)>=DWORD(p)) d-=p;
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modsub(DWORD a,DWORD b)
    {
    DWORD d;
    a=mod(a);
    b=mod(b);
    d=a-b; if (DWORD(a)<DWORD(b)) d+=p;
    if (DWORD(d)>=DWORD(p)) d-=p;
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modmul(DWORD a,DWORD b)
    {    // b bez orezania !
    int i;
    DWORD d;
    a=mod(a);
    for (d=0,i=0;i<32;i++)
        {
        if (DWORD(a&1))    d=modadd(d,b);
        a=shr(a);
        b=modadd(b,b);
        }
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modpow(DWORD a,DWORD b)
    {    // a,b bez orezania !
    int i;
    DWORD d=1;
    for (i=0;i<32;i++)
        {
        d=modmul(d,d);
        if (DWORD(b&0x80000000)) d=modmul(d,a);
        b=shl(b);
        }
    return d;
    }
//---------------------------------------------------------------------------

我的 NTT 類的使用示例:

Example of usage of my NTT class:

fourier_NTT ntt;
const DWORD n=32
DWORD x[N]={0,1,2,3,....31},y[N]={32,33,34,35,...63},z[N];

ntt.NTT(z,x,N);    // z[N]=NTT(x[N]), also init constants for N
ntt.NTT(x,y);    // x[N]=NTT(y[N]), no recompute of constants, use last N
// modular convolution y[]=z[].x[]
for (i=0;i<n;i++) y[i]=ntt.modmul(z[i],x[i]);
ntt.INTT(x,y);    // x[N]=INTT(y[N]), no recompute of constants, use last N
// x[]=convolution of original x[].y[]

優化前的一些測量(非 NTT 類):

Some measurements before optimizations (non Class NTT):

a = 0.98765588997654321000 | 389*32 bits
looped 1x times
sqr1[ 3.177 ms ] fast sqr
sqr2[ 720.419 ms ] NTT sqr
mul1[ 5.588 ms ] simpe mul
mul2[ 3.172 ms ] karatsuba mul
mul3[ 1053.382 ms ] NTT mul

我優化后的一些測量(當前代碼、較低的遞歸參數大小/計數和更好的模塊化算法):

Some measurements after my optimizations (current code, lower recursion parameter size/count, and better modular arithmetics):

a = 0.98765588997654321000 | 389*32 bits
looped 1x times
sqr1[ 3.214 ms ] fast sqr
sqr2[ 208.298 ms ] NTT sqr
mul1[ 5.564 ms ] simpe mul
mul2[ 3.113 ms ] karatsuba mul
mul3[ 302.740 ms ] NTT mul

檢查 NTT mul 和 NTT sqr 時間(我的優化加快了 3 倍多一點).它只有 1 倍循環,所以它不是很精確(誤差 ~ 10%),但即使現在加速也很明顯(通常我循環它 1000 倍甚至更多,但我的 NTT 太慢了).

Check the NTT mul and NTT sqr times (my optimizations speed it up little over 3x times). It's only 1x times loop so it's not very precise (error ~ 10%), but the speedup is noticeable even now (normally I loop it 1000x and more, but my NTT is too slow for that).

您可以自由使用我的代碼...只需將我的昵稱和/或指向此頁面的鏈接保留在某處(rem in code、readme.txt、about 或其他內容).我希望它有幫助......(我沒有在任何地方看到快速 NTT 的 C++ 源代碼,所以我不得不自己編寫).對所有接受的 N 都測試了統一根,請參閱 fourier_NTT::init(DWORD n) 函數.

You can use my code freely... Just keep my nick and/or link to this page somewhere (rem in code, readme.txt, about or whatever). I hope it helps... (I did not see C++ source for fast NTTs anywhere so I had to write it by myself). Roots of unity were tested for all accepted N, see the fourier_NTT::init(DWORD n) function.

P.S.:有關 NTT 的更多信息,請參閱從復雜 FFT 到有限域 FFT 的轉換.此代碼基于我在該鏈接中的帖子.

P.S.: For more information about NTT, see Translation from Complex-FFT to Finite-Field-FFT. This code is based on my posts inside that link.

[edit1:] 代碼的進一步變化

通過利用模素數始終為 0xC0000001 并消除不必要的調用,我設法進一步優化了我的模算術.由此產生的加速現在令人驚嘆(超過 40 倍),并且在大約 1500 * 32 位閾值之后,NTT 乘法比 karatsuba 更快.順便說一句,我的 NTT 的速度現在與我在 64 位雙打上優化的 DFFT 的速度相同.

I managed to further optimize my modular arithmetics, by exploiting that modulo prime is allways 0xC0000001 and eliminating unnecessary calls. The resulting speedup is stunning (more than 40x times) now and NTT multiplication is faster than karatsuba after about the 1500 * 32 bits threshold. BTW, the speed of my NTT is now the same as my optimized DFFT on 64-bit doubles.

一些測量:

a = 0.98765588997654321000 | 1553*32bits
looped 10x times
mul2[ 28.585 ms ] karatsuba mul
mul3[ 26.311 ms ] NTT mul

模塊化算術的新源代碼:

New source code for modular arithmetics:

//---------------------------------------------------------------------------
DWORD fourier_NTT::mod(DWORD a)
    {
    if (a>p) a-=p;
    return a;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modadd(DWORD a,DWORD b)
    {
    DWORD d,cy;
    if (a>p) a-=p;
    if (b>p) b-=p;
    d=a+b;
    cy=((a>>1)+(b>>1)+(((a&1)+(b&1))>>1))&0x80000000;
    if (cy ) d-=p;
    if (d>p) d-=p;
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modsub(DWORD a,DWORD b)
    {
    DWORD d;
    if (a>p) a-=p;
    if (b>p) b-=p;
    d=a-b;
    if (a<b) d+=p;
    if (d>p) d-=p;
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modmul(DWORD a,DWORD b)
    {
    DWORD _a,_b,_p;
    _a=a;
    _b=b;
    _p=p;
    asm    {
        mov    eax,_a
        mov    ebx,_b
        mul    ebx        // H(edx),L(eax) = eax * ebx
        mov    ebx,_p
        div    ebx        // eax = H(edx),L(eax) / ebx
        mov    _a,edx    // edx = H(edx),L(eax) % ebx
        }
    return _a;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modpow(DWORD a,DWORD b)
    {    // b bez orezania!
    int i;
    DWORD d=1;
    if (a>p) a-=p;
    for (i=0;i<32;i++)
        {
        d=modmul(d,d);
        if (DWORD(b&0x80000000)) d=modmul(d,a);
        b<<=1;
        }
    return d;
    }

//---------------------------------------------------------------------------

如您所見,函數 shlshr?? 不再使用.我認為 modpow 可以進一步優化,但它不是一個關鍵函數,因為它只被調用了很少的次數.最關鍵的函數是 modmul,它似乎處于最佳狀態.

As you can see, functions shl and shr are no more used. I think that modpow can be further optimized, but it's not a critical function because it is called only very few times. The most critical function is modmul, and that seems to be in the best shape possible.

其他問題:

  • 還有其他方法可以加速 NTT 嗎?
  • 我對模塊化算法的優化安全嗎?(結果似乎是一樣的,但我可能會遺漏一些東西.)

[edit2] 新的優化

a = 0.99991970486 | 2000*32 bits
looped 10x
sqr1[  13.908 ms ] fast sqr
sqr2[  13.649 ms ] NTT sqr
mul1[  19.726 ms ] simpe mul
mul2[  31.808 ms ] karatsuba mul
mul3[  19.373 ms ] NTT mul

我從你的所有評論中實現了所有可用的東西(感謝你的洞察力).

I implemented all the usable stuff from all of your comments (thanks for the insight).

加速:

  • 通過移除不必要的安全模組(Mandalf The Beige)+2.5%
  • +34.9% 通過使用預先計算的 W,iW 功率(神秘)
  • +35% 總計

實際完整源代碼:

//---------------------------------------------------------------------------
//--- Number theoretic transforms: 2.03 -------------------------------------
//---------------------------------------------------------------------------
#ifndef _fourier_NTT_h
#define _fourier_NTT_h
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
class fourier_NTT        // Number theoretic transform
    {
public:
    DWORD r,L,p,N;
    DWORD W,iW,rN;        // W=(r^L) mod p, iW=inverse W, rN = inverse N
    DWORD *WW,*iWW,NN;    // Precomputed (W,iW)^(0,..,NN-1) powers

    // Internals
    fourier_NTT(){ r=0; L=0; p=0; W=0; iW=0; rN=0; WW=NULL; iWW=NULL; NN=0; }
    ~fourier_NTT(){ _free(); }
    void _free();                                            // Free precomputed W,iW powers tables
    void _alloc(DWORD n);                                    // Allocate and precompute W,iW powers tables

    // Main interface
    void  NTT(DWORD *dst,DWORD *src,DWORD n=0);                // DWORD dst[n] = fast  NTT(DWORD src[n])
    void iNTT(DWORD *dst,DWORD *src,DWORD n=0);               // DWORD dst[n] = fast INTT(DWORD src[n])

    // Helper functions
    bool init(DWORD n);                                          // init r,L,p,W,iW,rN
    void  NTT_fast(DWORD *dst,DWORD *src,DWORD n,DWORD w);    // DWORD dst[n] = fast  NTT(DWORD src[n])
    void  NTT_fast(DWORD *dst,DWORD *src,DWORD n,DWORD *w2,DWORD i2);

    // Only for testing
    void  NTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w);    // DWORD dst[n] = slow  NTT(DWORD src[n])
    void iNTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w);    // DWORD dst[n] = slow INTT(DWORD src[n])

    // Modular arithmetics (optimized, but it works only for p >= 0x80000000!!!)
    DWORD mod(DWORD a);
    DWORD modadd(DWORD a,DWORD b);
    DWORD modsub(DWORD a,DWORD b);
    DWORD modmul(DWORD a,DWORD b);
    DWORD modpow(DWORD a,DWORD b);
    };
//---------------------------------------------------------------------------

//---------------------------------------------------------------------------
void fourier_NTT::_free()
    {
    NN=0;
    if ( WW) delete[]  WW;  WW=NULL;
    if (iWW) delete[] iWW; iWW=NULL;
    }

//---------------------------------------------------------------------------
void fourier_NTT::_alloc(DWORD n)
    {
    if (n<=NN) return;
    DWORD *tmp,i,w;
    tmp=new DWORD[n]; if ((NN)&&( WW)) for (i=0;i<NN;i++) tmp[i]= WW[i]; if ( WW) delete[]  WW;  WW=tmp;  WW[0]=1; for (i=NN?NN:1,w= WW[i-1];i<n;i++){ w=modmul(w, W);  WW[i]=w; }
    tmp=new DWORD[n]; if ((NN)&&(iWW)) for (i=0;i<NN;i++) tmp[i]=iWW[i]; if (iWW) delete[] iWW; iWW=tmp; iWW[0]=1; for (i=NN?NN:1,w=iWW[i-1];i<n;i++){ w=modmul(w,iW); iWW[i]=w; }
    NN=n;
    }

//---------------------------------------------------------------------------
void fourier_NTT:: NTT(DWORD *dst,DWORD *src,DWORD n)
    {
    if (n>0) init(n);
    NTT_fast(dst,src,N,WW,1);
//    NTT_fast(dst,src,N,W);
//    NTT_slow(dst,src,N,W);
    }

//---------------------------------------------------------------------------
void fourier_NTT::iNTT(DWORD *dst,DWORD *src,DWORD n)
    {
    if (n>0) init(n);
    NTT_fast(dst,src,N,iWW,1);
//    NTT_fast(dst,src,N,iW);
    for (DWORD i=0;i<N;i++) dst[i]=modmul(dst[i],rN);
//    iNTT_slow(dst,src,N,W);
    }

//---------------------------------------------------------------------------
bool fourier_NTT::init(DWORD n)
    {
    // (max(src[])^2)*n < p else NTT overflow can ocur!!!
    r=2; p=0xC0000001; if ((n<2)||(n>0x10000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x30000000/n; // 32:30 bit best for unsigned 32 bit
//    r=2; p=0x78000001; if ((n<2)||(n>0x04000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x3c000000/n; // 31:27 bit best for signed 32 bit
//    r=2; p=0x00010001; if ((n<2)||(n>0x00000020)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x00000020/n; // 17:16 bit best for 16 bit
//    r=2; p=0x0a000001; if ((n<2)||(n>0x01000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x01000000/n; // 28:25 bit
     N=n;                // Size of vectors [DWORDs]
     W=modpow(r,    L);  // Wn for NTT
    iW=modpow(r,p-1-L);  // Wn for INTT
    rN=modpow(n,p-2  );  // Scale for INTT
    _alloc(n>>1);        // Precompute W,iW powers
    return true;
    }

//---------------------------------------------------------------------------
void fourier_NTT:: NTT_fast(DWORD *dst,DWORD *src,DWORD n,DWORD w)
    {
    if (n<=1) { if (n==1) dst[0]=src[0]; return; }
    DWORD i,j,a0,a1,n2=n>>1,w2=modmul(w,w);

    // Reorder even,odd
    for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
    for (    j=1;i<n ;i++,j+=2) dst[i]=src[j];

    // Recursion
    NTT_fast(src   ,dst   ,n2,w2);    // Even
    NTT_fast(src+n2,dst+n2,n2,w2);    // Odd

    // Restore results
    for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w))
        {
        a0=src[i];
        a1=modmul(src[j],w2);
        dst[i]=modadd(a0,a1);
        dst[j]=modsub(a0,a1);
        }
    }

//---------------------------------------------------------------------------
void fourier_NTT:: NTT_fast(DWORD *dst,DWORD *src,DWORD n,DWORD *w2,DWORD i2)
    {
    if (n<=1) { if (n==1) dst[0]=src[0]; return; }
    DWORD i,j,a0,a1,n2=n>>1;

    // Reorder even,odd
    for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
    for (    j=1;i<n ;i++,j+=2) dst[i]=src[j];

    // Recursion
    i=i2<<1;
    NTT_fast(src   ,dst   ,n2,w2,i);    // Even
    NTT_fast(src+n2,dst+n2,n2,w2,i);    // Odd

    // Restore results
    for (i=0,j=n2;i<n2;i++,j++,w2+=i2)
        {
        a0=src[i];
        a1=modmul(src[j],*w2);
        dst[i]=modadd(a0,a1);
        dst[j]=modsub(a0,a1);
        }
    }

//---------------------------------------------------------------------------
void fourier_NTT:: NTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w)
    {
    DWORD i,j,wj,wi,a;
    for (wj=1,j=0;j<n;j++)
        {
        a=0;
        for (wi=1,i=0;i<n;i++)
            {
            a=modadd(a,modmul(wi,src[i]));
            wi=modmul(wi,wj);
            }
        dst[j]=a;
        wj=modmul(wj,w);
        }
    }

//---------------------------------------------------------------------------
void fourier_NTT::iNTT_slow(DWORD *dst,DWORD *src,DWORD n,DWORD w)
    {
    DWORD i,j,wi=1,wj=1,a;
    for (wj=1,j=0;j<n;j++)
        {
        a=0;
        for (wi=1,i=0;i<n;i++)
            {
            a=modadd(a,modmul(wi,src[i]));
            wi=modmul(wi,wj);
            }
        dst[j]=modmul(a,rN);
        wj=modmul(wj,iW);
        }
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::mod(DWORD a)
    {
    if (a>p) a-=p;
    return a;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modadd(DWORD a,DWORD b)
    {
    DWORD d,cy;
    //if (a>p) a-=p;
    //if (b>p) b-=p;
    d=a+b;
    cy=((a>>1)+(b>>1)+(((a&1)+(b&1))>>1))&0x80000000;
    if (cy ) d-=p;
    if (d>p) d-=p;
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modsub(DWORD a,DWORD b)
    {
    DWORD d;
    //if (a>p) a-=p;
    //if (b>p) b-=p;
    d=a-b;
    if (a<b) d+=p;
    if (d>p) d-=p;
    return d;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modmul(DWORD a,DWORD b)
    {
    DWORD _a,_b,_p;
    _a=a;
    _b=b;
    _p=p;
    asm    {
        mov    eax,_a
        mov    ebx,_b
        mul    ebx        // H(edx),L(eax) = eax * ebx
        mov    ebx,_p
        div    ebx        // eax = H(edx),L(eax) / ebx
        mov    _a,edx    // edx = H(edx),L(eax) % ebx
        }
    return _a;
    }

//---------------------------------------------------------------------------
DWORD fourier_NTT::modpow(DWORD a,DWORD b)
    {    // b is not mod(p)!
    int i;
    DWORD d=1;
    //if (a>p) a-=p;
    for (i=0;i<32;i++)
        {
        d=modmul(d,d);
        if (DWORD(b&0x80000000)) d=modmul(d,a);
        b<<=1;
        }
    return d;
    }
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
#endif
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------

通過將 NTT_fast 分成兩個函數,仍然有可能使用更少的堆垃圾.一個帶有 WW[],另一個帶有 iWW[],這導致遞歸調用中的一個參數減少.但我對它的期望并不高(僅限 32 位指針),而是有一個功能可以在未來更好地管理代碼.許多函數現在處于休眠狀態(用于測試),例如慢變體、mod 和較舊的快速函數(使用 w 參數代替 *w2,i2).

There is still the possibility to use less heap trashing by separating NTT_fast to two functions. One with WW[] and the other with iWW[] which leads to one parameter less in recursion calls. But I do not expect much from it (32-bit pointer only) and rather have one function for better code management in the future. Many functions are dormant now (for testing) Like slow variants, mod and the older fast function (with w parameter instead of *w2,i2).

為避免大數據集溢出,將輸入數字限制為 p/4 位. 其中 p 是每個 的位數NTT 元素,因此對于此 32 位版本,請使用最大 (32 位/4 -> 8 位) 輸入值.

To avoid overflows for big datasets, limit input numbers to p/4 bits. Where p is number of bits per NTT element so for this 32 bit version use max (32 bit/4 -> 8 bit) input values.

[edit3] 用于測試的簡單字符串 bigint 乘法

[edit3] Simple string bigint multiplication for testing

//---------------------------------------------------------------------------
char* mul_NTT(const char *sx,const char *sy)
    {
    char *s;
    int i,j,k,n;
    // n = min power of 2 <= 2 max length(x,y)
    for (i=0;sx[i];i++); for (n=1;n<i;n<<=1);        i--;
    for (j=0;sx[j];j++); for (n=1;n<j;n<<=1); n<<=1; j--;
    DWORD *x,*y,*xx,*yy,a;
    x=new DWORD[n]; xx=new DWORD[n];
    y=new DWORD[n]; yy=new DWORD[n];

    // Zero padding
    for (k=0;i>=0;i--,k++) x[k]=sx[i]-'0'; for (;k<n;k++) x[k]=0;
    for (k=0;j>=0;j--,k++) y[k]=sy[j]-'0'; for (;k<n;k++) y[k]=0;

    //NTT
    fourier_NTT ntt;
    ntt.NTT(xx,x,n);
    ntt.NTT(yy,y);

    // Convolution
    for (i=0;i<n;i++) xx[i]=ntt.modmul(xx[i],yy[i]);

    //INTT
    ntt.iNTT(yy,xx);

    //suma
    a=0; s=new char[n+1]; for (i=0;i<n;i++) { a+=yy[i]; s[n-i-1]=(a%10)+'0'; a/=10; } s[n]=0;
    delete[] x; delete[] xx;
    delete[] y; delete[] yy;

    return s;
    }
//---------------------------------------------------------------------------

我使用AnsiString,所以我希望將它移植到char*,我沒有做錯.看起來它工作正常(與 AnsiString 版本相比).

I use AnsiString's, so I port it to char* hopefully, I did not do some mistake. It looks like it works properly (in comparison to the AnsiString version).

  • sx,sy 是十進制整數
  • 返回分配的字符串(char*)=sx*sy
  • sx,sy are decadic integer numbers
  • Returns allocated string (char*)=sx*sy

每 32 位數據字只有 ~4 位,因此沒有溢出的風險,但當然速度較慢.在我的 bignum 庫中,我使用二進制表示,并為 NTT 每 32 位 WORD 使用 8 位 塊.如果 N 很大,那么風險更大......

This is only ~4 bit per 32 bit data word so there is no risk of overflow, but it is slower of course. In my bignum lib I use a binary representation and use 8 bit chunks per 32-bit WORD for NTT. More than that is risky if N is big ...

玩得開心

推薦答案

首先,非常感謝您發布并免費使用它.我真的很感激.

First off, thank you very much for posting and making it free to use. I really appreciate that.

我能夠使用一些小技巧來消除一些分支,重新排列主循環并修改程序集,并且能夠獲得 1.35 倍的加速.

I was able to use some bit tricks to eliminate some branching, rearranged the main loop, and modified the assembly, and was able to get a 1.35x speedup.

另外,我為 64 位添加了一個預處理器條件,因為 Visual Studio 不允許在 64 位模式下進行內聯匯編(謝謝微軟;你可以自己搞砸了).

Also, I added a preprocessor condition for 64 bit, seeing as Visual Studio doesn't allow inline assembly in 64 bit mode (thank you Microsoft; feel free to go screw yourself).

在優化 modsub() 函數時發生了一些奇怪的事情.我像 modadd 一樣使用 bit hacks 重寫了它(速度更快).但出于某種原因,modsub 的位明智版本較慢.不知道為什么.可能只是我的電腦.

Something strange happened when I was optimizing the modsub() function. I rewrote it using bit hacks like I did modadd (which was faster). But for some reason, the bit wise version of modsub was slower. Not sure why. Might just be my computer.

//
// Mandalf The Beige
// Based on:
// Spektre
// http://stackoverflow.com/questions/18577076/modular-arithmetics-and-ntt-finite-field-dft-optimizations
//
// This code may be freely used however you choose, so long as it is accompanied by this notice.
//




#ifndef H__OPTIMIZED_NUMBER_THEORETIC_TRANSFORM__HDR
#define H__OPTIMIZED_NUMBER_THEORETIC_TRANSFORM__HDR

#include <string.h>

#ifndef uint32
#define uint32 unsigned long int
#endif

#ifndef uint64
#define uint64 unsigned long long int
#endif


class fast_ntt                                   // number theoretic transform
{
    public:
    fast_ntt()
    {
        r = 0; L = 0;
        W = 0; iW = 0; rN = 0;
    }
    // main interface
    void  NTT(uint32 *dst, uint32 *src, uint32 n = 0);             // uint32 dst[n] = fast  NTT(uint32 src[n])
    void INTT(uint32 *dst, uint32 *src, uint32 n = 0);             // uint32 dst[n] = fast INTT(uint32 src[n])
    // helper functions

    private:
    bool init(uint32 n);                                     // init r,L,p,W,iW,rN
    void NTT_calc(uint32 *dst, uint32 *src, uint32 n, uint32 w);  // uint32 dst[n] = fast  NTT(uint32 src[n])

    void  NTT_fast(uint32 *dst, uint32 *src, uint32 n, uint32 w);  // uint32 dst[n] = fast  NTT(uint32 src[n])
    void NTT_fast(uint32 *dst, const uint32 *src, uint32 n, uint32 w);
    // only for testing
    void  NTT_slow(uint32 *dst, uint32 *src, uint32 n, uint32 w);  // uint32 dst[n] = slow  NTT(uint32 src[n])
    void INTT_slow(uint32 *dst, uint32 *src, uint32 n, uint32 w);  // uint32 dst[n] = slow INTT(uint32 src[n])
    // uint32 arithmetics


    // modular arithmetics
    inline uint32 modadd(uint32 a, uint32 b);
    inline uint32 modsub(uint32 a, uint32 b);
    inline uint32 modmul(uint32 a, uint32 b);
    inline uint32 modpow(uint32 a, uint32 b);

    uint32 r, L, N;//, p;
    uint32 W, iW, rN;

    const uint32 p = 0xC0000001;
};

//---------------------------------------------------------------------------
void fast_ntt::NTT(uint32 *dst, uint32 *src, uint32 n)
{
    if (n > 0)
    {
        init(n);
    }
    NTT_fast(dst, src, N, W);
    //  NTT_slow(dst,src,N,W);
}

//---------------------------------------------------------------------------
void fast_ntt::INTT(uint32 *dst, uint32 *src, uint32 n)
{
    if (n > 0)
    {
        init(n);
    }
    NTT_fast(dst, src, N, iW);
    for (uint32 i = 0; i<N; i++)
    {
        dst[i] = modmul(dst[i], rN);
    }
    //  INTT_slow(dst,src,N,W);
}

//---------------------------------------------------------------------------
bool fast_ntt::init(uint32 n)
{
    // (max(src[])^2)*n < p else NTT overflow can ocur !!!
    r = 2;
    //p = 0xC0000001;
    if ((n < 2) || (n > 0x10000000))
    {
        r = 0; L = 0; W = 0; // p = 0;
        iW = 0; rN = 0; N = 0;
        return false;
    }
    L = 0x30000000 / n; // 32:30 bit best for unsigned 32 bit
    //  r=2; p=0x78000001; if ((n<2)||(n>0x04000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x3c000000/n; // 31:27 bit best for signed 32 bit
    //  r=2; p=0x00010001; if ((n<2)||(n>0x00000020)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x00000020/n; // 17:16 bit best for 16 bit
    //  r=2; p=0x0a000001; if ((n<2)||(n>0x01000000)) { r=0; L=0; p=0; W=0; iW=0; rN=0; N=0; return false; } L=0x01000000/n; // 28:25 bit
    N = n;               // size of vectors [uint32s]
    W = modpow(r, L); // Wn for NTT
    iW = modpow(r, p - 1 - L); // Wn for INTT
    rN = modpow(n, p - 2); // scale for INTT
    return true;
}

//---------------------------------------------------------------------------

void fast_ntt::NTT_fast(uint32 *dst, uint32 *src, uint32 n, uint32 w)
{
    if(n > 1)
    {
        if(dst != src)
        {
            NTT_calc(dst, src, n, w);
        }
        else
        {
            uint32* temp = new uint32[n];
            NTT_calc(temp, src, n, w);
            memcpy(dst, temp, n * sizeof(uint32));
            delete [] temp;
        }
    }
    else if(n == 1)
    {
        dst[0] = src[0];
    }
}

void fast_ntt::NTT_fast(uint32 *dst, const uint32 *src, uint32 n, uint32 w)
{
    if (n > 1)
    {
        uint32* temp = new uint32[n];
        memcpy(temp, src, n * sizeof(uint32));
        NTT_calc(dst, temp, n, w);
        delete[] temp;
    }
    else if (n == 1)
    {
        dst[0] = src[0];
    }
}



void fast_ntt::NTT_calc(uint32 *dst, uint32 *src, uint32 n, uint32 w)
{
    if(n > 1)
    {
        uint32 i, j, a0, a1,
        n2 = n >> 1,
        w2 = modmul(w, w);

        // reorder even,odd
        for (i = 0, j = 0; i < n2; i++, j += 2)
        {
            dst[i] = src[j];
        }
        for (j = 1; i < n; i++, j += 2)
        {
            dst[i] = src[j];
        }
        // recursion
        if(n2 > 1)
        {
            NTT_calc(src, dst, n2, w2);  // even
            NTT_calc(src + n2, dst + n2, n2, w2);  // odd
        }
        else if(n2 == 1)
        {
            src[0] = dst[0];
            src[1] = dst[1];
        }

        // restore results

        w2 = 1, i = 0, j = n2;
        a0 = src[i];
        a1 = src[j];
        dst[i] = modadd(a0, a1);
        dst[j] = modsub(a0, a1);
        while (++i < n2)
        {
            w2 = modmul(w2, w);
            j++;
            a0 = src[i];
            a1 = modmul(src[j], w2);
            dst[i] = modadd(a0, a1);
            dst[j] = modsub(a0, a1);
        }
    }
}

//---------------------------------------------------------------------------
void fast_ntt::NTT_slow(uint32 *dst, uint32 *src, uint32 n, uint32 w)
{
    uint32 i, j, wj, wi, a,
        n2 = n >> 1;
    for (wj = 1, j = 0; j < n; j++)
    {
        a = 0;
        for (wi = 1, i = 0; i < n; i++)
        {
            a = modadd(a, modmul(wi, src[i]));
            wi = modmul(wi, wj);
        }
        dst[j] = a;
        wj = modmul(wj, w);
    }
}

//---------------------------------------------------------------------------
void fast_ntt::INTT_slow(uint32 *dst, uint32 *src, uint32 n, uint32 w)
{
    uint32 i, j, wi = 1, wj = 1, a, n2 = n >> 1;

    for (wj = 1, j = 0; j < n; j++)
    {
        a = 0;
        for (wi = 1, i = 0; i < n; i++)
        {
            a = modadd(a, modmul(wi, src[i]));
            wi = modmul(wi, wj);
        }
        dst[j] = modmul(a, rN);
        wj = modmul(wj, iW);
    }
}    


//---------------------------------------------------------------------------
uint32 fast_ntt::modadd(uint32 a, uint32 b)
{
    uint32 d;
    d = a + b;

    if(d < a)
    {
        d -= p;
    }
    if (d >= p)
    {
        d -= p;
    }
    return d;
}

//---------------------------------------------------------------------------
uint32 fast_ntt::modsub(uint32 a, uint32 b)
{
    uint32 d;
    d = a - b;
    if (d > a)
    {
        d += p;
    }
    return d;
}

//---------------------------------------------------------------------------
uint32 fast_ntt::modmul(uint32 a, uint32 b)
{
    uint32 _a = a;
    uint32 _b = b;

    // Original
    uint32 _p = p;
    __asm
    {
        mov eax, _a;
        mul _b;
        div _p;
        mov eax, edx;
    };
}


uint32 fast_ntt::modpow(uint32 a, uint32 b)
{
    //*
    uint64 D, M, A, P;

    P = p; A = a;
    M = 0llu - (b & 1);
    D = (M & A) | ((~M) & 1);

    while ((b >>= 1) != 0)
    {
        A = modmul(A, A);
        //A = (A * A) % P;

        if ((b & 1) == 1)
        {
            //D = (D * A) % P;
            D = modmul(D, A);
        }
    }
    return (uint32)D;
}

新模塊

uint32 fast_ntt::modmul(uint32 a, uint32 b)
{
    uint32 _a = a;
    uint32 _b = b;   

    __asm
    {
    mov eax, a;
    mul b;
    mov ebx, eax;
    mov eax, 2863311530;
    mov ecx, edx;
    mul edx;
    shld edx, eax, 1;
    mov eax, 3221225473;

    mul edx;
    sub ebx, eax;
    mov eax, 3221225473;
    sbb ecx, edx;
    jc addback;

            neg ecx;
            and ecx, eax;
            sub ebx, ecx;

    sub ebx, eax;
    sbb edx, edx;
    and eax, edx;
            addback:
    add eax, ebx;          
    };  
}

Spektre,根據您的反饋,我更改了 modadd &modsub 回到原來的狀態.我也意識到我對遞歸 NTT 函數做了一些我不應該做的更改.

Spektre, based on your feedback I changed the modadd & modsub back to their original. I also realized I made some changes to the recursive NTT function I shouldn't have.

刪除了不需要的 if 語句和按位函數.

Removed unneeded if statements and bitwise functions.

添加了新的 modmul 內聯程序集.

Added new modmul inline assembly.

這篇關于模塊化算法和 NTT(有限域 DFT)優化的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題,如果有圖片或者內容侵犯了您的權益,請聯系我們刪除處理,感謝您的支持!

相關文檔推薦

Algorithm to convert RGB to HSV and HSV to RGB in range 0-255 for both(將 RGB 轉換為 HSV 并將 HSV 轉換為 RGB 的算法,范圍為 0-255)
How to convert an enum type variable to a string?(如何將枚舉類型變量轉換為字符串?)
When to use inline function and when not to use it?(什么時候使用內聯函數,什么時候不使用?)
Examples of good gotos in C or C++(C 或 C++ 中好的 goto 示例)
Significance of ios_base::sync_with_stdio(false); cin.tie(NULL);(ios_base::sync_with_stdio(false) 的意義;cin.tie(NULL);)
Is TCHAR still relevant?(TCHAR 仍然相關嗎?)
主站蜘蛛池模板: 99精品国产一区二区青青牛奶 | 亚洲精品一区二区在线观看 | 国产美女自拍视频 | 国内精品视频 | 91电影| 一区二区视频在线观看 | 日韩在线免费视频 | 成人一区二区在线 | 91影库| 极品粉嫩国产48尤物在线播放 | 亚洲视频免费在线观看 | 中文字幕亚洲视频 | 精品美女久久久 | 欧美a级成人淫片免费看 | 超碰美女在线 | 狠狠操电影 | 成人午夜黄色 | 亚洲第一中文字幕 | 国产视频2021 | 国产精品揄拍一区二区久久国内亚洲精 | 欧美在线国产精品 | 欧美激情视频一区二区三区免费 | 日韩区 | 久久久成人网 | 一呦二呦三呦国产精品 | 五月网婷婷 | 国产精品无码久久久久 | 97精品超碰一区二区三区 | 综合色婷婷 | 精品国产1区2区3区 在线国产视频 | 日韩精品一区二区在线观看 | 成人精品一区 | 超碰97免费观看 | 日本久久精品视频 | xxxcom在线观看 | 啪啪精品 | 97精品超碰一区二区三区 | 黄色一级大片在线免费看产 | 国产线视频精品免费观看视频 | 91污在线| 久久久久久久久久久91 |