First download gmp-5.0.1, uncompress it, and go in the gmp-5.0.1 directory.

$ patch -p 0 -i /tmp/fft_patch.gmp-5.0.1 (assuming the patch is in /tmp)
patching file mpn/generic/mul_fft.c
patching file gmp-h.in
patching file tfft.c

$ ./configure; make

$ gcc -I. tfft.c .libs/libgmp.a

$ ./a.out 1000 1000 1500 # performs a product of 1000 limbs by 1000 limbs,
	                 # with a result mod B^r+1 with r>=1500
			 # and B=2^BITS_PER_MP_LIMB
r=1536                   # prints the actual value of r

This patch adds the following interface to GMP (it does not change anything
in the existing interface):

A new structure "mp_fft_t" is defined:

typedef struct
{
  mp_size_t n;     /* multiplication is performed mod 2^(n*GMP_NUMB_BITS)+1 */
  unsigned long k; /* fft length is K = 2^k */
  unsigned long l; /* allow to accumulate up to 2^l products */
  unsigned long M;
  mp_size_t nprime;
  mp_ptr *d;       /* array of K arrays of nprime+1 limbs */
} __mp_fft_struct;
typedef __mp_fft_struct mp_fft_t[1];

Seven new functions are exported (their interface is preliminary and may
change, please send any comments to zimmerma at loria dot fr):

mp_size_t mpn_fft_init (mp_fft_t r, mp_size_t n0, unsigned long l)
Initializes a FFT structure r, to perform multiplication
modulo 2^(n*GMP_NUMB_BITS)+1, with n >= n0. Return the value of n.
The 3rd argument indicates that up to 2^l products may be accumulated
in the Fourier domain; in usual cases l = 0.

void mpn_fft_transform (mp_fft_t r, mp_ptr ap, mp_size_t an)
Puts in the FFT structure r the transform of {ap, an}.
We should have an <= n, where n is the value returned when initializing r.

void mpn_fft_mult (mp_fft_t r, mp_fft_t s)
Puts in the FFT structure r the term to term product of r and s.

void mpn_fft_add (mp_fft_t r, mp_fft_t s)
Puts in the FFT structure r the term to term sum of r and s.

void mpn_fft_sub (mp_fft_t r, mp_fft_t s)
Puts in the FFT structure r the term to term difference of r and s.

void mpn_fft_itransform (mp_ptr ap, mp_fft_t r)
Puts in {ap, n+1} the inverse transform of the FFT structure r,
where n is the value returned when initializing r.

void mpn_fft_clear (mp_fft_t r)
Frees the memory allocated by the FFT structure r.

Examples:

1) a multiplication call mpn_mul (rp, ap, an, bp, bn) can be replaced by:

   mp_fft_t at, bt;
   mp_ptr tp;
   mp_size_t m;

   m = mpn_fft_init (at, an + bn, 0);
   mpn_fft_init (bt, an + bn, 0);
   mpn_fft_transform (at, ap, an);
   mpn_fft_transform (bt, bp, bn);
   mpn_fft_mult (at, bt);
   tp = malloc ((m + 1) * sizeof (mp_limb_t));
   mpn_fft_itransform (tp, at);
   MPN_COPY (rp, tp, an + bn);
   mpn_fft_clear (at);
   mpn_fft_clear (bt);
   free (tp);

2) in some cases, we want to reuse several times a transformed representation.
   However, mpn_fft_mult() destroys its first argument. Here is a copy
   function, which assumes that both r and s were initialized with the same
   2nd argument of mpn_fft_init:

/* copy s into r, assumes that r was initialized with mpn_fft_init(r, n),
   with the same 'n' than for s */
void
mpn_fft_copy (mp_fft_t r, mp_fft_t s)
{
   MPN_COPY (r->d[0], s->d[0], s->K * (s->nprime + 1));
}