CUFFT之后,需要将每一个cufftComplex数都乘以1/N,数据长度为51200*1024
我使用了下面2种方法,用event计时,发现计算时间都是6100ms左右,请问怎样优化,谢谢
方法1:
使用51200个block块,每个块1024个线程,每个线程处理1个数,如下
__global__ static void kernel1(cufftComplex *pSource,float Valpha)
{
int threadID= threadIdx.x;
int blockID= blockIdx.x;
int IndexThread=blockID*blockDim.x+threadID;
cufftComplex tmp=*(pSource+IndexThread);
(pSource+IndexThread)->x=(tmp.x*Valpha);
(pSource+IndexThread)->y=(tmp.y*Valpha);
}
方法2:
使用512个block块,每个块1024个线程,每个线程处理100个数,如下
__global__ static void kernel2(cufftComplex *pSource,float Valpha)
{
int threadID= threadIdx.x;
int blockID= blockIdx.x;
int IndexThread=blockID*blockDim.x+threadID;
cufftComplex tmp; int i;
for (i=0;i<100;i++)
{
tmp=*(pSource+IndexThread+i*1024*512);
(pSource+IndexThread+i*1024*512)->x=(tmp.x*Valpha);
(pSource+IndexThread+i*1024*512)->y=(tmp.y*Valpha);
}
}
两种方法计算时间都差不多,不过6100多ms确实太吓人了,请问各位大侠该怎么优化谢谢 |