用户
 找回密码
 立即注册
siheng303 该用户已被删除
发表于 2010-7-4 11:02:15
41473
A Simple CUDPP Example
This code sample demonstrates a basic usage of CUDPP for computing the parallel prefix sum of a floating point array on the GPU.

Sample Code Walkthrough
The simpleCUDPP sample is the "hello" world example for CUDPP. Its aim is to show you how to initialize, run, and shut down CUDPP functions.

The main function in simpleCUDPP.cu is runTest().

runTest()
simpleCUDPP uses libCUTIL, a CUDA utility library used in NVIDIA CUDA SDK samples to initialize CUDA and to check for errors. runTest starts by initializing the CUDA device and then declaring the number of elements and the array size for the arrays we plan to scan. It then allocates the host-side (CPU-side) input array, h_idata, and initializes the data it contains with some random values between 0 and 15.


void
runTest( int argc, char** argv)
{
    CUT_DEVICE_INIT(argc, argv);

    unsigned int numElements = 32768;
    unsigned int memSize = sizeof( float) * numElements;

    // allocate host memory
    float* h_idata = (float*) malloc( memSize);
    // initalize the memory
    for (unsigned int i = 0; i < numElements; ++i)
    {
        h_idata = (float) (rand() & 0xf);
    }


After the input data is created on the host, we allocate a device (GPU) array d_idata and copy the input data from the host using cudaMemcpy(). We also allocate a device array for the output results, d_odata.


    // allocate device memory
    float* d_idata;
    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_idata, memSize));
    // copy host memory to device
    CUDA_SAFE_CALL( cudaMemcpy( d_idata, h_idata, memSize,
                                cudaMemcpyHostToDevice) );

    // allocate device memory for result
    float* d_odata;
    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, memSize));


CUDPP Plans
Next comes the real CUDPP stuff. First we have to configure CUDPP to scan our array. Configuration of algorithms in CUDPP relies on the concept of the plan. A plan is a data structure that maintains intermediate storage for the algorithm, as well as information that CUDPP may use to optimize execution of the present hardware. When invoked using cudppPlan(), the CUDPP planner takes the configuration details passed to it and generates an internal plan object. It returns a CUDPPHandle -- an opaque pointer type that is used to refer to the plan object -- that must be passed to other CUDPP functions in order to execute algorithms.

In this case we are going to do a forward exclusive float sum-scan of numElements elements. We tell the planner this by filling out a CUDPPConfiguration struct with the algorithm (CUDPP_SCAN), datatype (CUDPP_FLOAT), operation (CUDPP_ADD), and options (CUDPP_OPTION_FORWARD, CUDPP_OPTION_EXCLUSIVE). We then pass this config to cudppPlan along with the maximum number of elements we want to scan, numElements. Finally, we pass 1 and 0 for the numRows and rowPitch parameters, since we only want to scan a one-dimensional array. See the documentation for cudppPlan() for more details on these parameters.


    CUDPPConfiguration config;
    config.op = CUDPP_ADD;
    config.datatype = CUDPP_FLOAT;
    config.algorithm = CUDPP_SCAN;
    config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE;
   
    CUDPPHandle scanplan = 0;
    CUDPPResult result = cudppPlan(&scanplan, config, numElements, 1, 0);  

    if (CUDPP_SUCCESS != result)
    {
        printf("Error creating CUDPPPlan\n");
        exit(-1);
    }


We now have a handle to our plan object in scanplan. Next, after making sure that cudppPlan() did not return an error, we put CUDPP to work by invoking cudppScan(), to which we pass our plan handle, the output and input device arrays, and the number of elements to scan.


    // Run the scan
    cudppScan(scanplan, d_odata, d_idata, numElements);


Next, we read the results of the scan from d_odata back to the host, compute a reference solution on the CPU (computeSumScanGold()), and compare the results for correctness.


    // allocate mem for the result on host side
    float* h_odata = (float*) malloc( memSize);
    // copy result from device to host
    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, memSize,
                                cudaMemcpyDeviceToHost) );
    // compute reference solution
    float* reference = (float*) malloc( memSize);
    computeSumScanGold( reference, h_idata, numElements, config);

    // check result
    CUTBoolean res = cutComparef( reference, h_odata, numElements);
    printf( "Test %s\n", (1 == res) ? "PASSED" : "FAILED");


Finally, we tell CUDPP to clean up the memory used for our plan object, using cudppDestroyPlan(). Then we free the host and device arrays using free() and cudaFree, respectively. After that we're finished.


    result = cudppDestroyPlan(scanplan);
    if (CUDPP_SUCCESS != result)
    {
        printf("Error destroying CUDPPPlan\n");
        exit(-1);
    }


Using CUDPP for parallel prefix sums is easy!
使用道具 举报 回复
发表于 2010-7-5 11:03:48
这个速度如何?
使用道具 举报 回复 支持 反对
发表于 2010-7-5 12:14:26
原帖由 yyfn风辰 于 2010-7-5 11:03 发表
这个速度如何?


速度肯定是很快的,就是目前功能还比较有限
CUDPPCurrent in CUDPP:
(Segmented) scan,
stream compactRadix sort,
sparse matrix-vector multiply
Random number generation
In progress: Parallel reduction, more sorts, graphs, treesOpen
使用道具 举报 回复 支持 反对
发表于 2011-4-8 20:50:04

回复 #1 siheng303 的帖子

现在用CUDA4.0,怎么编译cudpp_src_1.1.1出lib,我编译老出错
使用道具 举报 回复 支持 反对
发新帖
您需要登录后才可以回帖 登录 | 立即注册