레이블이 CUDA인 게시물을 표시합니다. 모든 게시물 표시
레이블이 CUDA인 게시물을 표시합니다. 모든 게시물 표시

2022년 7월 1일 금요일

CUDA pytorch GPU 오류, CUDA capability sm_86 is not compatible, docker에서 pytorch 설치 with CUDA

GPU 관련 오류 발생시 조치법


오류 문구

CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.

설치된 CUDA 버전이 GPU를 지원하지 않기 때문에 발생하는 현상입니다.


자신의 GPU의 Compute capability(version) 찾기

https://developer.nvidia.com/cuda-gpus


오류 문구에 나오는 sm_86 이 8.6 이며 Compute capability 이 됩니다.


적절한 CUDA 버전 찾기

https://en.wikipedia.org/wiki/CUDA

GPUs supported[edit]

Supported CUDA level of GPU and card.

  • CUDA SDK 1.0 support for compute capability 1.0 – 1.1 (Tesla)[29]
  • CUDA SDK 1.1 support for compute capability 1.0 – 1.1+x (Tesla)
  • CUDA SDK 2.0 support for compute capability 1.0 – 1.1+x (Tesla)
  • CUDA SDK 2.1 – 2.3.1 support for compute capability 1.0 – 1.3 (Tesla)[30][31][32][33]
  • CUDA SDK 3.0 – 3.1 support for compute capability 1.0 – 2.0 (Tesla, Fermi)[34][35]
  • CUDA SDK 3.2 support for compute capability 1.0 – 2.1 (Tesla, Fermi)[36]
  • CUDA SDK 4.0 – 4.2 support for compute capability 1.0 – 2.1+x (Tesla, Fermi, more?).
  • CUDA SDK 5.0 – 5.5 support for compute capability 1.0 – 3.5 (Tesla, Fermi, Kepler).
  • CUDA SDK 6.0 support for compute capability 1.0 – 3.5 (Tesla, Fermi, Kepler).
  • CUDA SDK 6.5 support for compute capability 1.1 – 5.x (Tesla, Fermi, Kepler, Maxwell). Last version with support for compute capability 1.x (Tesla).
  • CUDA SDK 7.0 – 7.5 support for compute capability 2.0 – 5.x (Fermi, Kepler, Maxwell).
  • CUDA SDK 8.0 support for compute capability 2.0 – 6.x (Fermi, Kepler, Maxwell, Pascal). Last version with support for compute capability 2.x (Fermi) (Pascal GTX 1070Ti Not Supported).
  • CUDA SDK 9.0 – 9.2 support for compute capability 3.0 – 7.2 (Kepler, Maxwell, Pascal, Volta) (Pascal GTX 1070Ti Not Supported. CUDA SDK 9.0 and support CUDA SDK 9.2).
  • CUDA SDK 10.0 – 10.2 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing). Last version with support for compute capability 3.x (Kepler). 10.2 is the last official release for macOS, as support will not be available for macOS in newer releases.
  • CUDA SDK 11.0 support for compute capability 3.5 – 8.0 (Kepler (in part), Maxwell, Pascal, Volta, Turing, Ampere (in part)).[37]
  • CUDA SDK 11.1 – 11.7 support for compute capability 3.5 – 8.6 (Kepler (in part), Maxwell, Pascal, Volta, Turing, Ampere).[38]

sm_86 => 8.6 이므로 현재 CUDA SDK 11.1 – 11.7 설치해줘야 합니다.


CUDA SDK 설치

가장 쉬운 방법은 docker 를 이용하는 방법입니다.

docker hub에서 nvidia/cuda 를 검색해서 11.1 – 11.7 사이의 버전을 선택해서 이미지를 다운로드 받습니다.

docker pull nvidia/cuda:11.7.0-runtime-ubuntu20.04


Docker 실행

Docker 사용법은 추가 검색 필요합니다. docker 설치는 아래 링크를 참고 바랍니다.

https://swlock.blogspot.com/2022/06/docker-windows-1110-home-docker-docker.html

docker run --gpus all -i -t -v C:\(자신의작업폴더명)\:/work nvidia/cuda:11.7.0-runtime-ubuntu20.04 /bin/bash

docker를 한번 실행한 후 다음번에는 docker attach 를 이용해서 접속합니다.


Docker 내에서 python 설치

apt-get update

apt install python3.8

apt install python3.8-dev

apt install python3.8-venv

apt install python3.8-distutils

apt install python3-pip


python 가상 환경 만들기

venv 사용법은 따로 검색해 보시기 바랍니다.

위에서 연결한 폴더명에 /work 폴더라서 아래와 같이 진행합니다.

cd /work

python3 -m venv myenv38 

source myenv38/bin/activate


pytorch 설치

커맨드는 여기 참고 https://pytorch.org/get-started/locally/

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116



2014년 11월 8일 토요일

CUDA lib GPUMLib 소개

요즘 CUDA에 관심이 있어서 CUDA를 설치도 해보고, 실행도 시켜 보았습니다.
현재 만들어진 좋은 library없을까 검색하던 중 CUDA 소스도 참고할만한 library가 있었습니다.

GPUMLib Download: http://sourceforge.net/projects/gpumlib
GPUMLib Homepage: http://gpumlib.sourceforge.net/

GPU machine learning 하는 open source입니다. C++ 과 CUDA로 되어있으며, 이미 BP,MBP 가 구현이 되어있어서 소스를 참고할 수 있어서 좋았습니다.
GPUMLib is an open source Graphic Processing Unit Machine Learning Library. This library aims to provide machine learning researchers and practitioners with a high performance library by taking advantage of the GPU enormous computational power. The library is developed in C++ and CUDA.

CUDA kernel 소스를 참고하자면 대표적으로 배열에 들어있는 합을 구할때 CUDA에서는 어떻게 하는게 효율적일까 고민을 좀 했습니다. BP 혹은 MLP 에서 이전 노드와 가중치를 곱한 모든 node의 합이 다음 노드값을 갖기 때문에 반듯이 필요한 부분입니다. Host 쪽에서 구현해도 되지만 node가 많아진다면 속도가 떨어지겠죠.
해당 소스에서는 다음과 같이 구현했습니다.

template <int blockSize> __device__ __forceinline__ void SumBeforeWarp(cudafloat * s) {
 if (blockSize >= 1024) {
  if (threadIdx.x < 512) s[threadIdx.x] += s[threadIdx.x + 512];
  __syncthreads();
 }

 if (blockSize >= 512) {
  if (threadIdx.x < 256) s[threadIdx.x] += s[threadIdx.x + 256];
  __syncthreads();
 }

 if (blockSize >= 256) {
  if (threadIdx.x < 128) s[threadIdx.x] += s[threadIdx.x + 128];
  __syncthreads();
 }

 if (blockSize >= 128) {
  if (threadIdx.x < 64) s[threadIdx.x] += s[threadIdx.x + 64];
  __syncthreads();
 }
}

template <int blockSize> __device__ __forceinline__ void SumWarp(volatile cudafloat * s) {
 if (blockSize >= 64) s[threadIdx.x] += s[threadIdx.x + 32];
 if (blockSize >= 32) s[threadIdx.x] += s[threadIdx.x + 16];
 if (blockSize >= 16) s[threadIdx.x] += s[threadIdx.x + 8];
 if (blockSize >= 8) s[threadIdx.x] += s[threadIdx.x + 4];
 if (blockSize >= 4) s[threadIdx.x] += s[threadIdx.x + 2];
 if (blockSize >= 2) s[threadIdx.x] += s[threadIdx.x + 1];
}

앞쪽 코드 SumBeforeWarp 함수에서는 절반씩 나누어서 합을 더하고 __syncthreads()를 호출하여 모든 thread들이 barrier에 멈추도록 합니다. 따라서 반반씩 하면 위와같은 코드가 나오는데 Index가 32 보다 작아지면 SumWarp라는 함수를 이용합니다.
아래는 호출하는 쪽 코드인데 threadIdx.x <32라는 조건이 있습니다. 32를 비교하는 이유는 NVIDIA GPU에서는 Warp단위로 스케줄링을 한다고 하는데요. Warp 단위가 32라고 합니다. 그래서 __syncthreads를 호출 안한다고 합니다. 문제는 Warp단위가 현재는 32이지만 나중에 바뀔 수도 있고요 __syncthreads 를 많이 호출하면 효율이 떨어진다고 합니다.

template <int blockSize> KERNEL FireLayerNeurons(cudafloat * inputs, cudafloat * weights, cudafloat * m, int mOffset, int totalNeuronsWithSelectiveActivation, cudafloat * outputs, int numInputs) {
 extern __shared__ cudafloat iw[];

 iw[threadIdx.x] = CUDA_VALUE(0.0);
 for(int i = threadIdx.x; i <= numInputs; i += blockDim.x) {
  cudafloat i_w = weights[NEURON * (numInputs + 1) + i];
  if (i > BIAS) i_w *= inputs[PATTERN * numInputs + (i - 1)];  
  iw[threadIdx.x] += i_w;
 }
 __syncthreads();

 SumBeforeWarp<blockSize>(iw);

 if (threadIdx.x < 32) {
  SumWarp<blockSize>(iw);
 
  if (threadIdx.x == 0) {
   cudafloat output = CUDA_SIGMOID(iw[0]);
   if (m != nullptr) output *= m[PATTERN * totalNeuronsWithSelectiveActivation + NEURON + mOffset];
   outputs[PATTERN * NUM_NEURONS + NEURON] = output;
  }
 }
}


2014년 11월 6일 목요일

CUDA kernel에서 수학 함수 사용


CUDA kernel에서 수학 함수 사용

C/C++ 기본 라이브러리에 있는 수학 함수는 사용이 가능하다고 합니다. D.1 절 참조
그 외에 고유한 함수가 있는데 D.2절에 나오는데 속도는 빠르지만 정확도가 떨어지는 함수라고 합니다.


D. Mathematical Functions

The reference manual lists, along with their description, all the functions of the C/C++ standard library mathematical functions that are supported in device code, as well as all intrinsic functions (that are only supported in device code).
This appendix provides accuracy information for some of these functions when applicable.

D.1. Standard Functions

The functions from this section can be used in both host and device code.
This section specifies the error bounds of each function when executed on the device and also when executed on the host in the case where the host does not supply the function.
The error bounds are generated from extensive but not exhaustive tests, so they are not guaranteed bounds.

Single-Precision Floating-Point Functions

Addition and multiplication are IEEE-compliant, so have a maximum error of 0.5 ulp. However, on the device, the compiler often combines them into a single multiply-add instruction (FMAD) and for devices of compute capability 1.x, FMAD truncates the intermediate result of the multiplication as mentioned in Floating-Point Standard.This combination can be avoided by using the __fadd_[rn,rz,ru,rd]() and __fmul_[rn,rz,ru,rd]() intrinsic functions (see Intrinsic Functions).
The recommended way to round a single-precision floating-point operand to an integer, with the result being a single-precision floating-point number is rintf(), not roundf(). The reason is that roundf() maps to an 8-instruction sequence on the device, whereas rintf() maps to a single instruction. truncf(), ceilf(), and floorf() each map to a single instruction as well.
Table 6. Single-Precision Mathematical Standard Library Functions with Maximum ULP Error. The maximum error is stated as the absolute value of the difference in ulps between a correctly rounded single-precision result and the result returned by the CUDA library function.
FunctionMaximum ulp error
x+y
0 (IEEE-754 round-to-nearest-even)
(except for devices of compute capability 1.x when addition is merged into an FMAD)
x*y
0 (IEEE-754 round-to-nearest-even)
(except for devices of compute capability 1.x when multiplication is merged into an FMAD)
x/y
0 for compute capability ≥ 2 when compiled with -prec-div=true
2 (full range), otherwise
1/x
0 for compute capability ≥ 2 when compiled with -prec-div=true
1 (full range), otherwise
rsqrtf(x)
1/sqrtf(x)
2 (full range)
Applies to 1/sqrtf(x) only when it is converted to rsqrtf(x) by the compiler.
sqrtf(x)
0 for compute capability ≥ 2 when compiled with -prec-sqrt=true
3 (full range), otherwise
cbrtf(x)1 (full range)
rcbrtf(x)2 (full range)
hypotf(x,y)3 (full range)
rhypotf(x,y)2 (full range)
norm3df(x,y,z)3 (full range)
rnorm3df(x,y,z)2 (full range)
expf(x)2 (full range)
exp2f(x)2 (full range)
exp10f(x)2 (full range)
expm1f(x)1 (full range)
logf(x)1 (full range)
log2f(x)3 (full range)
log10f(x)3 (full range)
log1pf(x)2 (full range)
sinf(x)2 (full range)
cosf(x)2 (full range)
tanf(x)4 (full range)
sincosf(x,sptr,cptr)2 (full range)
sinpif(x)2 (full range)
cospif(x)2 (full range)
sincospif(x,sptr,cptr)2 (full range)
asinf(x)4 (full range)
acosf(x)3 (full range)
atanf(x)2 (full range)
atan2f(y,x)3 (full range)
sinhf(x)3 (full range)
coshf(x)2 (full range)
tanhf(x)2 (full range)
asinhf(x)3 (full range)
acoshf(x)4 (full range)
atanhf(x)3 (full range)
powf(x,y)8 (full range)
erff(x)2 (full range)
erfcf(x)6 (full range)
erfinvf(x)3 (full range)
erfcinvf(x)4 (full range)
erfcxf(x)6 (full range)
normcdff(x)6 (full range)
normcdfinvf(x)5 (full range)
lgammaf(x)6 (outside interval -10.001 ... -2.264; larger inside)
tgammaf(x)11 (full range)
fmaf(x,y,z)0 (full range)
frexpf(x,exp)0 (full range)
ldexpf(x,exp)0 (full range)
scalbnf(x,n)0 (full range)
scalblnf(x,l)0 (full range)
logbf(x)0 (full range)
ilogbf(x)0 (full range)
j0f(x)
9 for |x| < 8
otherwise, the maximum absolute error is 2.2 x 10-6
j1f(x)
9 for |x| < 8
otherwise, the maximum absolute error is 2.2 x 10-6
jnf(x)For n = 128, the maximum absolute error is 2.2 x 10-6
y0f(x)
9 for |x| < 8
otherwise, the maximum absolute error is 2.2 x 10-6
y1f(x)
9 for |x| < 8
otherwise, the maximum absolute error is 2.2 x 10-6
ynf(x)
ceil(2 + 2.5n) for |x| < n
otherwise, the maximum absolute error is 2.2 x 10-6
fmodf(x,y)0 (full range)
remainderf(x,y)0 (full range)
remquof(x,y,iptr)0 (full range)
modff(x,iptr)0 (full range)
fdimf(x,y)0 (full range)
truncf(x)0 (full range)
roundf(x)0 (full range)
rintf(x)0 (full range)
nearbyintf(x)0 (full range)
ceilf(x)0 (full range)
floorf(x)0 (full range)
lrintf(x)0 (full range)
lroundf(x)0 (full range)
llrintf(x)0 (full range)
llroundf(x)0 (full range)

Double-Precision Floating-Point Functions

The errors listed below only apply when compiling for devices with native double-precision support. When compiling for devices without such support, such as devices of compute capability 1.2 and lower, the double type gets demoted to float by default and the double-precision math functions are mapped to their single-precision equivalents.
The recommended way to round a double-precision floating-point operand to an integer, with the result being a double-precision floating-point number is rint(), not round(). The reason is that round() maps to an 8-instruction sequence on the device, whereas rint() maps to a single instruction. trunc(), ceil(), and floor() each map to a single instruction as well.
Table 7. Double-Precision Mathematical Standard Library Functions with Maximum ULP Error. The maximum error is stated as the absolute value of the difference in ulps between a correctly rounded double-precision result and the result returned by the CUDA library function.
FunctionMaximum ulp error
x+y
0 (IEEE-754 round-to-nearest-even)
x*y
0 (IEEE-754 round-to-nearest-even)
x/y
0 (IEEE-754 round-to-nearest-even)
1/x
0 (IEEE-754 round-to-nearest-even)
sqrt(x)0 (IEEE-754 round-to-nearest-even)
rsqrt(x)
1 (full range)
cbrt(x)1 (full range)
rcbrt(x)1 (full range)
hypot(x,y)2 (full range)
rhypot(x,y)1 (full range)
norm3d(x,y,z)2 (full range)
rnorm3d(x,y,z)1 (full range)
exp(x)1 (full range)
exp2(x)1 (full range)
exp10(x)1 (full range)
expm1(x)1 (full range)
log(x)1 (full range)
log2(x)1 (full range)
log10(x)1 (full range)
log1p(x)1 (full range)
sin(x)1 (full range)
cos(x)1 (full range)
tan(x)2 (full range)
sincos(x,sptr,cptr)1 (full range)
sinpi(x)1 (full range)
cospi(x)1 (full range)
sincospi(x,sptr,cptr)1 (full range)
asin(x)2 (full range)
acos(x)1 (full range)
atan(x)2 (full range)
atan2(y,x)2 (full range)
sinh(x)1 (full range)
cosh(x)1 (full range)
tanh(x)1 (full range)
asinh(x)2 (full range)
acosh(x)2 (full range)
atanh(x)2 (full range)
pow(x,y)2 (full range)
erf(x)2 (full range)
erfc(x)4 (full range)
erfinv(x)5 (full range)
erfcinv(x)6 (full range)
erfcx(x)3 (full range)
normcdf(x)5 (full range)
normcdfinv(x)7 (full range)
lgamma(x)4 (outside interval -11.0001 ... -2.2637; larger inside)
tgamma(x)8 (full range)
fma(x,y,z)0 (IEEE-754 round-to-nearest-even)
frexp(x,exp)0 (full range)
ldexp(x,exp)0 (full range)
scalbn(x,n)0 (full range)
scalbln(x,l)0 (full range)
logb(x)0 (full range)
ilogb(x)0 (full range)
j0(x)
7 for |x| < 8
otherwise, the maximum absolute error is 5 x 10-12
j1(x)
7 for |x| < 8
otherwise, the maximum absolute error is 5 x 10-12
jn(x)For n = 128, the maximum absolute error is 5 x 10-12
y0(x)
7 for |x| < 8
otherwise, the maximum absolute error is 5 x 10-12
y1(x)
7 for |x| < 8
otherwise, the maximum absolute error is 5 x 10-12
yn(x)
For |x| > 1.5n, the maximum absolute error is 5 x 10-12
fmod(x,y)0 (full range)
remainder(x,y)0 (full range)
remquo(x,y,iptr)0 (full range)
mod(x,iptr)0 (full range)
fdim(x,y)0 (full range)
trunc(x)0 (full range)
round(x)0 (full range)
rint(x)0 (full range)
nearbyint(x)0 (full range)
ceil(x)0 (full range)
floor(x)0 (full range)
lrint(x)0 (full range)
lround(x)0 (full range)
llrint(x)0 (full range)
llround(x)0 (full range)

D.2. Intrinsic Functions

The functions from this section can only be used in device code.
Among these functions are the less accurate, but faster versions of some of the functions of Standard Functions .They have the same name prefixed with __ (such as __sinf(x)). They are faster as they map to fewer native instructions. The compiler has an option (-use_fast_math) that forces each function in Table 8 to compile to its intrinsic counterpart. In addition to reducing the accuracy of the affected functions, it may also cause some differences in special case handling. A more robust approach is to selectively replace mathematical function calls by calls to intrinsic functions only where it is merited by the performance gains and where changed properties such as reduced accuracy and different special case handling can be tolerated.

Table 8. Functions Affected by -use_fast_math
Operator/FunctionDevice Function
x/y
__fdividef(x,y)
sinf(x)
__sinf(x)
cosf(x)
__cosf(x)
tanf(x) __tanf(x)
sincosf(x,sptr,cptr)__sincosf(x,sptr,cptr)
logf(x)
__logf(x)
log2f(x)__log2f(x)
log10f(x)__log10f(x)
expf(x)__expf(x)
exp10f(x)__exp10f(x)
powf(x,y)__powf(x,y)
Functions suffixed with _rn operate using the round to nearest even rounding mode.
Functions suffixed with _rz operate using the round towards zero rounding mode.
Functions suffixed with _ru operate using the round up (to positive infinity) rounding mode.
Functions suffixed with _rd operate using the round down (to negative infinity) rounding mode.

Single-Precision Floating-Point Functions

__fadd_[rn,rz,ru,rd]() and __fmul_[rn,rz,ru,rd]() map to addition and multiplication operations that the compiler never merges into FMADs. By contrast, additions and multiplications generated from the '*' and '+' operators will frequently be combined into FMADs.
The accuracy of floating-point division varies depending on the compute capability of the device and whether the code is compiled with -prec-div=false or -prec-div=true. For devices of compute capability 2.x and higher when the code is compiled with -prec-div=false or for devices of compute capability 1.x, both the regular division / operator and __fdividef(x,y) have the same accuracy, but for 2126 < y < 2128, __fdividef(x,y) delivers a result of zero, whereas the / operator delivers the correct result to within the accuracy stated in Table 9. Also, for 2126 < y < 2128, if x is infinity, __fdividef(x,y) delivers a NaN (as a result of multiplying infinity by zero), while the / operator returns infinity. On the other hand, the / operator is IEEE-compliant on devices of compute capability 2.x and higher when the code is compiled with -prec-div=true or without any -prec-div option at all since its default value is true.

Table 9. Single-Precision Floating-Point Intrinsic Functions. (Supported by the CUDA Runtime Library with Respective Error Bounds)
FunctionError bounds
__fadd_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__fsub_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__fmul_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__fmaf_[rn,rz,ru,rd](x,y,z)
IEEE-compliant.
__frcp_[rn,rz,ru,rd](x)IEEE-compliant.
__fsqrt_[rn,rz,ru,rd](x)IEEE-compliant.
__frsqrt_rn(x)IEEE-compliant.
__fdiv_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__fdividef(x,y)For y in [2-126, 2126], the maximum ulp error is 2.
__expf(x)The maximum ulp error is 2 + floor(abs(1.16 * x)).
__exp10f(x)The maximum ulp error is 2+ floor(abs(2.95 * x)).
__logf(x)For x in [0.5, 2], the maximum absolute error is 2-21.41, otherwise, the maximum ulp error is 3.
__log2f(x)For x in [0.5, 2], the maximum absolute error is 2-22, otherwise, the maximum ulp error is 2.
__log10f(x)For x in [0.5, 2], the maximum absolute error is 2-24, otherwise, the maximum ulp error is 3.
__sinf(x)For x in [-π,π], the maximum absolute error is 2-21.41, and larger otherwise.
__cosf(x)For x in [-π,π], the maximum absolute error is 2-21.19, and larger otherwise.
__sincosf(x,sptr,cptr)Same as __sinf(x) and __cosf(x).
__tanf(x)Derived from its implementation as __sinf(x) * (1/__cosf(x)).
__powf(x, y)Derived from its implementation as exp2f(y * __log2f(x)).

Double-Precision Floating-Point Functions

__dadd_rn() and __dmul_rn() map to addition and multiplication operations that the compiler never merges into FMADs. By contrast, additions and multiplications generated from the '*' and '+' operators will frequently be combined into FMADs.
Table 10. Double-Precision Floating-Point Intrinsic Functions. (Supported by the CUDA Runtime Library with Respective Error Bounds)
FunctionError bounds
__dadd_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__dsub_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__dmul_[rn,rz,ru,rd](x,y)
IEEE-compliant.
__fma_[rn,rz,ru,rd](x,y,z)
IEEE-compliant.
__ddiv_[rn,rz,ru,rd](x,y)(x,y)
IEEE-compliant.
Requires compute capability > 2.
__drcp_[rn,rz,ru,rd](x)
IEEE-compliant.
Requires compute capability > 2.
__dsqrt_[rn,rz,ru,rd](x)
IEEE-compliant.
Requires compute capability > 2.

JCUDA 유용한 batch 파일

JCUDA로 작업하다 보면 kernel 파일을 빌드할때 eclipse에서 자동으로 빌드되지않기 때문에 편하게 작업하기위해서 batch파일을 만들어 보았습니다.

cmd.bat, nvcc_batch.bat 파일 두개입니다. cmd.bat를 더블클릭하면 프로젝트 폴더에서 자동으로 command line환경으로 nvcc가 동작할 수 있는 환경으로 창이 열립니다.
또한 nvcc_batch.bat는 cu파일을 빌드할때 쓰이는데 nvcc_batch <cu파일명> 확장자 제외 외를 하면 ptx 파일이 생성됩니다.



cmd.bat 파일 내용입니다. 여기에서 자신의 작업폴더는 자신에게 맞게 수정해야합니다. d:  과 cd로 변경되는 부분입니다.

call "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\vcvarsx86_amd64.bat"
d:
cd D:\work\개인\ProgramMake\workspace\CUDAtest\
cmd.exe

nvcc_batch.bat 파일 내용입니다.

nvcc.exe --use-local-env --cl-version 2012 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64" -maxrregcount=0  --machine 64 -ptx %1.cu -o %1.ptx

JCUDA(CUDA) grid, block ?

CUDA를 접하기전에 막연히 CUDA만 설치하고 GPU에 코드를 전달하면 병렬처리가 자동으로 되는지 알았습니다.
만만한 세상은 없나 봅니다. 병렬 처리할 수 있도록 코드를 만든 뒤 GPU로 보내는 작업을 해야한다고 합니다.
CPU로 보낼때는 앞에서도 언급했지만 cu 파일을 ptx로 컴파일 해서 해당 파일을 Driver API를 통해서 실행시키면 됩니다.
그 후 해당 ptx파일이 동시에 실행되어 thread들이 돌기 시작하고 모든 thread들이 종료되면 끝나게 됩니다.

cu(Kernel)에 넣었던 코드들이 동시에 몇개나 어떤 방식으로 실행할지 결정하는게 block, grid라는 개념이 있습니다. 해당 개념을 접하고 이건 뭘까 하던 생각이 납니다. 하루 동안 계속 검색해보며 정체를 찾아봤었습니다. 아직도 이해하기 힘든 부분인것 같네요.

Grid안에는 여러개의 블럭이 존재하고, Block 안에는 여러개의 thread가 존재합니다. (존재할 수 있는 thread갯수는 GPU마다 다릅니다. GTX 750 1024개)
그리고 Grid, Block 모두 3 차원 구조를 외부에서 설정할 수 있습니다. (차원의 최대값도 GPU에 따라 다릅니다.)

Block의 차원을 4*2*1, Grid의 차원을 3*2*1 로 했다면 아래와 같이 표현됩니다.
그림에서 꼬부랑 화살표 하나가 thread가 됩니다.


기존에 있던 예제입니다.
        int blockSizeX = 256;
        int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
        cuLaunchKernel(function,
            gridSizeX,  1, 1,      // Grid dimension
            blockSizeX, 5, 1,      // Block dimension
            0, null,               // Shared memory size and stream
            kernelParameters, null // Kernel- and extra parameters
        );
        cuCtxSynchronize();

위와같이 blockSizeX*5*1=256*5*1=1280이 되면 GTX 750에서는 블럭당1024개의 thread만 만들수 있기 때문에 오류가 납니다.

Exception in thread "main" jcuda.CudaException: CUDA_ERROR_INVALID_VALUE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:288)
at jcuda.driver.JCudaDriver.cuLaunchKernel(JCudaDriver.java:14366)
at CUDAtest.main(CUDAtest.java:90)

예제 프로그램을 바꿔서 grid 차원을 변경하였습니다.
        // Call the kernel function.
        int blockSizeX = 256;
        int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
        System.out.println("gridsize:"+gridSizeX+" "+(gridSizeX*1024*100));
        cuLaunchKernel(function,
            gridSizeX,  1024, 100,      // Grid dimension
            blockSizeX, 1, 1,      // Block dimension
            0, null,               // Shared memory size and stream
            kernelParameters, null // Kernel- and extra parameters
        );
결과 그리드 차원이 무려 40,038,400 이 숫자가 결국 block가 되고 block에 다시 블럭 차원 숫자를 곱하면 엄청난 thread수가 실행되는 것인데 GTX 750에서 순식간에 처리가 됩니다. block은 동시에 처리가 되겠지만 grid는 동시에 처리 되지는 않을꺼라고 생각됩니다. h/w적으로 그만한 thread를 실행할 공간이 없기 때문에 처리하고 나서 다음 block을 구동하는 방법을 사용하지 않을까 추측해봅니다.
gridsize:391 40038400
Test PASSED

그러면 병렬 프로그래밍을 어떻게 해야할까요?
답은 cu파일 Kernel에 있습니다. 예제를 보시죠
blockIdx threadIdx blockDim 이런 변수를 이용해서 차원과 현재의 위치를 구해서 연산 결과를 특정 메모리에 넣는 작업을 하게됩니다. 해당 예제는 block,grid 모두 1차원으로 실행시키기 때문에 blockDim.x를 이용해서 block의 크기를 구해서 block index를 곱한후 threadidx를 더해주면 고유한 i위치가 나오게 됩니다.

extern "C"
__global__ void add(int n, float *a, float *b, float *sum)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    {
        sum[i] = a[i] + b[i];
    }

}

그리고 CPU를 host라고 표현하고 GPU를 device라고 표현합니다.
메모리 할당을 device쪽에 해서 host와 device간 메모리를 전달할때 cuMemcpyDtoH ,cuMemcpyHtoD 이런 함수를 이용하게 됩니다.

2014년 11월 5일 수요일

JCUDA JCudaDeviceQurey 실행 결과


JCudaDeviceQurey 실행 결과

GTX 750

Found 1 devices
Device 0: GeForce GTX 750 with Compute Capability 5.0
    Maximum number of threads per block                  : 1024
    Maximum x-dimension of a block                       : 1024
    Maximum y-dimension of a block                       : 1024
    Maximum z-dimension of a block                       : 64
    Maximum x-dimension of a grid                        : 2147483647
    Maximum y-dimension of a grid                        : 65535
    Maximum z-dimension of a grid                        : 65535
    Maximum shared memory per thread block in bytes      : 49152
    Total constant memory on the device in bytes         : 65536
    Warp size in threads                                 : 32
    Maximum pitch in bytes allowed for memory copies     : 2147483647
    Maximum number of 32-bit registers per thread block  : 65536
    Clock frequency in kilohertz                         : 1084500
    Alignment requirement                                : 512
    Number of multiprocessors on the device              : 4
    Whether there is a run time limit on kernels         : 1
    Device is integrated with host memory                : 0
    Device can map host memory into CUDA address space   : 1
    Compute mode                                         : 0
    Maximum 1D texture width                             : 65536
    Maximum 2D texture width                             : 65536
    Maximum 2D texture height                            : 65536
    Maximum 3D texture width                             : 4096
    Maximum 3D texture height                            : 4096
    Maximum 3D texture depth                             : 4096
    Maximum 2D layered texture width                     : 16384
    Maximum 2D layered texture height                    : 16384
    Maximum layers in a 2D layered texture               : 2048
    Alignment requirement for surfaces                   : 512
    Device can execute multiple kernels concurrently     : 1
    Device has ECC support enabled                       : 0
    PCI bus ID of the device                             : 1
    PCI device ID of the device                          : 0
    Device is using TCC driver model                     : 0
    Peak memory clock frequency in kilohertz             : 2505000
    Global memory bus width in bits                      : 128
    Size of L2 cache in bytes                            : 2097152
    Maximum resident threads per multiprocessor          : 2048
    Number of asynchronous engines                       : 1
    Device shares a unified address space with the host  : 1
    Maximum 1D layered texture width                     : 16384
    Maximum layers in a 1D layered texture               : 2048
    PCI domain ID of the device                          : 0