-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed fp16 breakage caused by CUDA9 changes #485
Conversation
Also fixed nccl headers for nccl-2.0 (required for CUDA9). ERROR: test_interface1 (main.TestWorker)Traceback (most recent call last): |
Do I need to have MPI installed in order to use collectives ? It won't install on Ubuntu 16.04 |
CMakeLists.txt
Outdated
set(CUDA_VERSION_MAJOR 8) | ||
endif() | ||
|
||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DCUDA_VERSION_MAJOR=${CUDA_VERSION_MAJOR}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to depend on an installed version of cuda to compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ? Would you ever build Theano on CUDA8 system and run on CUDA9 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you're right - no need to introduce build time dependency here.
src/gpuarray_buffer_cuda.c
Outdated
" asm(\"{ cvt.f32.f16 %0, %1;}\\n\" : \"=f\"(val) : \"h\"(__HALF_TO_CUS(h)));\n" | ||
" return val;\n" | ||
"}\n" | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have to be detected at runtime by consulting the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes runtime is possible, but, again - why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - the condition is not the context though, but the RTC version.
@@ -2,6 +2,7 @@ | |||
#define LIBGPU_EXT_CUDA | |||
|
|||
#include <cuda.h> | |||
#include <cuda_fp16.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this include here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we have fp16-related code anywhere, cuda_fp16_h should be thought of as extension of cuda.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@@ -208,10 +208,10 @@ static int gen_elemwise_basic_kernel(GpuKernel *k, gpucontext *ctx, | |||
} | |||
for (j = 0; j < n; j++) { | |||
if (is_array(args[j])) { | |||
strb_appendf(&sb, "%s %s;", ctype(ISSET(gen_flags, GEN_CONVERT_F16) && args[j].typecode == GA_HALF ? | |||
strb_appendf(&sb, "%s %s;", ctype(/* ISSET(gen_flags, GEN_CONVERT_F16) && */ args[j].typecode == GA_HALF ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you disabling the flag here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting some test errors - since half is now defined as struct, there are no cases when explicit conversion is not needed - if I understood this flag correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes test errors went away since then. I even ran nose-test with floatX=float16 with very few errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of the flag is to choose between
- store float16, compute float32
and - store float16, compute float16.
In the latter case I'll admit that I just let the compiler figure out how to do that and it never worked properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tests would fail if not removing ISSET:
ERROR: test_long (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_inc_and_set_subtensor (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_ellipsis (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_advanced1_inc_and_set (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_strided (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_rows_finite (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_range_finite (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_col (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test1_ok_strided (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll have to take a look at this to figure out the proper solution. I don't have much time to do it this week so this may wait a while.
I'm leaning towards adding a new type for native f16 compute and keeping the existing type for f32 compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to warn you using native fp16 is almost never a good idea - we do not do it in most frameworks due to the precision issues, and in many cases it is also slower. Implementing the switch via separate types was also tried and proved more trouble than worth - better to use single storage type and have a switch. If you return to this - Volta also has FP16 HMMA (compute f16 with f32 accumulator) - all three maths (pseudo f16, native f16, f16 hmma) are slightly different, so plan in advance :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we will stick with float16 meaning f32 compute for now then and nothing for native float16. In any case that would allow for removing the flag.
DEF_PROC(ncclResult_t, ncclReduceScatter, (const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)); | ||
DEF_PROC(ncclResult_t, ncclBcast, (void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream )); | ||
DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl Got updated to use size_t now? Is there a way to detect that when we load the library. I would like to prevent people from loading the older one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are nccl 2.0 definitions. I think, if you #include 'nccl.h' as well, you will get compiler errors if definitions are not the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These definitions are used to dlopen the library and grab some function pointers. There won't be any compiler double checking our work, so we need to be careful.
That being said, I have no issues with dropping support for nccl 1.0 and blocking the load in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why, if you #include nccl.h in .cpp, and then include your .fn with proper expansion, you would end up with 2 sets of extern function definitions. If they won't match, compile would break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never include the real nccl.h anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron : my sentiment, exactly: should you include it as I suggested above, you would be able to detect API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to include it because I want to be able to build on machines where it is not present and then load it if later installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is important. Could be an optional target only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to do this might be to add one of the new group API to the set of required functions. This will make the load fail for version 1.0, which should prevent problems of the sort.
I agree we should not spent time on making float16 storage and float16
compute work.
Le mer. 26 juil. 2017 15:12, abergeron <notifications@github.com> a écrit :
… ***@***.**** commented on this pull request.
------------------------------
In src/gpuarray_elemwise.c
<#485 (comment)>:
> @@ -208,10 +208,10 @@ static int gen_elemwise_basic_kernel(GpuKernel *k, gpucontext *ctx,
}
for (j = 0; j < n; j++) {
if (is_array(args[j])) {
- strb_appendf(&sb, "%s %s;", ctype(ISSET(gen_flags, GEN_CONVERT_F16) && args[j].typecode == GA_HALF ?
+ strb_appendf(&sb, "%s %s;", ctype(/* ISSET(gen_flags, GEN_CONVERT_F16) && */ args[j].typecode == GA_HALF ?
Maybe we will stick with float16 meaning f32 compute for now then and
nothing for native float16. In any case that would allow for removing the
flag.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#485 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALC-4tnDaiHMMV1i0RrZOVgdlIZGk_bks5sR4-UgaJpZM4Ofs1A>
.
|
va_start(ap, fmt); | ||
vsnprintf(e->msg, ERROR_MSGBUF_LEN, fmt, ap); | ||
va_end(ap); | ||
#ifdef DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bad. We want to set the error string in all cases, but only print it when in DEBUG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever being used w/o DEBUG? The reasons I did it is because I have noticed all tests print humungous number of those messages in DEBUG (without failing - the errors are being ignored) - meaning you are currently spend a lot of CPU time in sprintf() on non-error path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those messages are used by theano to report the precise cause of the error and are more generally available through gpucontext_error().
It is true that there are a couple of "regular" paths where we encounter an error so I am open to a way to improve this.
One of the way to improve this might be to spot the paths where the error can't get to the user and remove the message from those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks! I would also suggest to look at the flip side of this observation - when in DEBUG mode, useful output is being flooded by thousands of messages about wrong dimensions - if those are not real errors, a different check should be used (or a 'silent' parameter added).
I've made some changes into how we handle float16 and some other stuff in preparation for 0.7 since a lot of those changes break API (#499). This will affect a number of changes that you did here in this PR. I would like to integrate the work you did for cuda 9 float16 and NCCL in follow-up PRs. Do you want me to cherry-pick/reorganize the commits or do you want to do it? |
@ abergeron: you will definitely do a better job on the merge - please go ahead. |
It there something else from this PR that wasn't merged to master? If not, we can merge it. |
I mean, if not we can close it. |
I think all of this was merged in other PR, so closing. |
I can run most of my tests now with CUDA9.