我自己写的一个向量相加kernel,
__global__ void vectorAdd(const double *A, const double *B, double *scale,double *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < ROW*COL)
{
C[i] = A[i] + (*scale) * B[i];
}
}
我的数据规模是65526,专门写的了对这个kernel的测试函数,测得执行时间10us左右,但是我把它加入我的项目之后,用cudaProfilerStart(),cudaProfilerStop测得时间竟然是1ms,同时nvidiA visual profiler分析的时候出现如下错误“Unable to collect metric and enent values.The order of kernel exectution does not match the timeline.To asssociate event and metrics with the correct kernel,the application must behave identically on each run.Discarding all collected events and metrics” 请问这是什么原因呢?
|