Fixed fp16 breakage caused by CUDA9 changes #485

borisfom · 2017-07-21T17:48:38Z

I can run most of my tests now with CUDA9.

borisfom · 2017-07-22T23:00:53Z

Also fixed nccl headers for nccl-2.0 (required for CUDA9).
Apparently, there are more issues with nccl - still fails to initialize
/usr/local/lib/python2.7/dist-packages/nose_parameterized/init.py:7: UserWarning: The 'nose-parameterized' package has been renamed 'parameterized'. For the two step migration instru
ctions, see: https://github.com/wolever/parameterized#migrating-from-nose-parameterized-to-parameterized (set NOSE_PARAMETERIZED_NO_WARN=1 to suppress this warning)
"The 'nose-parameterized' package has been renamed 'parameterized'. "
/usr/local/lib/python2.7/dist-packages/theano/gpuarray/dnn.py:169: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgradi
ng cuDNN to a version >= v5 and <= v6.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 7001 on context None
Mapped name None to device cuda0: Graphics Device (0000:01:00.0)
WARNING! Failed to register in a local GPU comm world.
Reason: 'utf8' codec can't decode byte 0xb6 in position 2: invalid start byte
WARNING! Platoon all_reduce interface will not be functional.
FEEEThe following warning is produced by testing procedure:
WARNING! Worker instance has already been initialized.
Args: (123413,), Kwds: {}
.EClosing connections and unlinking memory...

ERROR: test_interface1 (main.TestWorker)

Traceback (most recent call last):
File "test_worker.py", line 40, in test_interface1
self.worker.all_reduce(sinp, '+', sout)
File "/usr/local/lib/python2.7/dist-packages/platoon/channel/worker.py", line 452, in all_reduce
raise PlatoonError("all_reduce interface is not available. Check log.")
PlatoonError: ERROR! all_reduce interface is not available. Check log.

borisfom · 2017-07-22T23:21:53Z

Do I need to have MPI installed in order to use collectives ? It won't install on Ubuntu 16.04

abergeron · 2017-07-24T17:17:12Z

CMakeLists.txt

+   set(CUDA_VERSION_MAJOR 8)
+endif()
+
+set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DCUDA_VERSION_MAJOR=${CUDA_VERSION_MAJOR}")


I don't want to depend on an installed version of cuda to compile.

Why ? Would you ever build Theano on CUDA8 system and run on CUDA9 ?

Actually, you're right - no need to introduce build time dependency here.

abergeron · 2017-07-24T17:17:47Z

src/gpuarray_buffer_cuda.c

+  "  asm(\"{  cvt.f32.f16 %0, %1;}\\n\" : \"=f\"(val) : \"h\"(__HALF_TO_CUS(h)));\n"
+  "  return val;\n"
+  "}\n"
+#endif


This would have to be detected at runtime by consulting the context.

Yes runtime is possible, but, again - why?

Fixed - the condition is not the context though, but the RTC version.

abergeron · 2017-07-24T17:18:21Z

src/gpuarray/ext_cuda.h

@@ -2,6 +2,7 @@
 #define LIBGPU_EXT_CUDA

 #include <cuda.h>
+#include <cuda_fp16.h>


Why do you need this include here?

Well, if we have fp16-related code anywhere, cuda_fp16_h should be thought of as extension of cuda.h

abergeron · 2017-07-24T17:19:59Z

src/gpuarray_elemwise.c

@@ -208,10 +208,10 @@ static int gen_elemwise_basic_kernel(GpuKernel *k, gpucontext *ctx,
  }
  for (j = 0; j < n; j++) {
    if (is_array(args[j])) {
-      strb_appendf(&sb, "%s %s;", ctype(ISSET(gen_flags, GEN_CONVERT_F16) && args[j].typecode == GA_HALF ?
+      strb_appendf(&sb, "%s %s;", ctype(/* ISSET(gen_flags, GEN_CONVERT_F16) && */ args[j].typecode == GA_HALF ?


Why are you disabling the flag here?

I was getting some test errors - since half is now defined as struct, there are no cases when explicit conversion is not needed - if I understood this flag correctly.

And yes test errors went away since then. I even ran nose-test with floatX=float16 with very few errors.

The intent of the flag is to choose between

store float16, compute float32
and

store float16, compute float16.

In the latter case I'll admit that I just let the compiler figure out how to do that and it never worked properly.

Those tests would fail if not removing ISSET:
ERROR: test_long (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_inc_and_set_subtensor (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_ellipsis (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_advanced1_inc_and_set (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_strided (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_rows_finite (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_range_finite (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_col (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test1_ok_strided (theano.gpuarray.tests.test_subtensor.G_subtensorF16)

Ok, I'll have to take a look at this to figure out the proper solution. I don't have much time to do it this week so this may wait a while.

I'm leaning towards adding a new type for native f16 compute and keeping the existing type for f32 compute.

I have to warn you using native fp16 is almost never a good idea - we do not do it in most frameworks due to the precision issues, and in many cases it is also slower. Implementing the switch via separate types was also tried and proved more trouble than worth - better to use single storage type and have a switch. If you return to this - Volta also has FP16 HMMA (compute f16 with f32 accumulator) - all three maths (pseudo f16, native f16, f16 hmma) are slightly different, so plan in advance :)

Maybe we will stick with float16 meaning f32 compute for now then and nothing for native float16. In any case that would allow for removing the flag.

abergeron · 2017-07-24T17:21:30Z

src/loaders/libnccl.fn

+DEF_PROC(ncclResult_t, ncclReduceScatter, (const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream));
+DEF_PROC(ncclResult_t, ncclBcast, (void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream ));
+DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream));


nccl Got updated to use size_t now? Is there a way to detect that when we load the library. I would like to prevent people from loading the older one.

Yes, those are nccl 2.0 definitions. I think, if you #include 'nccl.h' as well, you will get compiler errors if definitions are not the same.

These definitions are used to dlopen the library and grab some function pointers. There won't be any compiler double checking our work, so we need to be careful.

That being said, I have no issues with dropping support for nccl 1.0 and blocking the load in that case.

Why, if you #include nccl.h in .cpp, and then include your .fn with proper expansion, you would end up with 2 sets of extern function definitions. If they won't match, compile would break.

We never include the real nccl.h anywhere.

@abergeron : my sentiment, exactly: should you include it as I suggested above, you would be able to detect API change.

I don't want to include it because I want to be able to build on machines where it is not present and then load it if later installed.

Right, this is important. Could be an optional target only.

One way to do this might be to add one of the new group API to the set of required functions. This will make the load fail for version 1.0, which should prevent problems of the sort.

nouiz · 2017-08-01T12:37:21Z

I agree we should not spent time on making float16 storage and float16 compute work. Le mer. 26 juil. 2017 15:12, abergeron <notifications@github.com> a écrit :

…

***@***.**** commented on this pull request. ------------------------------ In src/gpuarray_elemwise.c <#485 (comment)>: > @@ -208,10 +208,10 @@ static int gen_elemwise_basic_kernel(GpuKernel *k, gpucontext *ctx, } for (j = 0; j < n; j++) { if (is_array(args[j])) { - strb_appendf(&sb, "%s %s;", ctype(ISSET(gen_flags, GEN_CONVERT_F16) && args[j].typecode == GA_HALF ? + strb_appendf(&sb, "%s %s;", ctype(/* ISSET(gen_flags, GEN_CONVERT_F16) && */ args[j].typecode == GA_HALF ? Maybe we will stick with float16 meaning f32 compute for now then and nothing for native float16. In any case that would allow for removing the flag. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#485 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC-4tnDaiHMMV1i0RrZOVgdlIZGk_bks5sR4-UgaJpZM4Ofs1A> .

abergeron · 2017-08-18T17:39:58Z

src/util/error.c

  va_start(ap, fmt);
  vsnprintf(e->msg, ERROR_MSGBUF_LEN, fmt, ap);
  va_end(ap);
-#ifdef DEBUG


This is bad. We want to set the error string in all cases, but only print it when in DEBUG.

Is it ever being used w/o DEBUG? The reasons I did it is because I have noticed all tests print humungous number of those messages in DEBUG (without failing - the errors are being ignored) - meaning you are currently spend a lot of CPU time in sprintf() on non-error path.

Those messages are used by theano to report the precise cause of the error and are more generally available through gpucontext_error().

It is true that there are a couple of "regular" paths where we encounter an error so I am open to a way to improve this.

One of the way to improve this might be to spot the paths where the error can't get to the user and remove the message from those.

Great, thanks! I would also suggest to look at the flip side of this observation - when in DEBUG mode, useful output is being flooded by thousands of messages about wrong dimensions - if those are not real errors, a different check should be used (or a 'silent' parameter added).

abergeron · 2017-08-18T17:42:29Z

I've made some changes into how we handle float16 and some other stuff in preparation for 0.7 since a lot of those changes break API (#499).

This will affect a number of changes that you did here in this PR. I would like to integrate the work you did for cuda 9 float16 and NCCL in follow-up PRs. Do you want me to cherry-pick/reorganize the commits or do you want to do it?

borisfom · 2017-08-18T23:05:37Z

@ abergeron: you will definitely do a better job on the merge - please go ahead.

abergeron · 2017-08-21T20:08:52Z

I've split the changes into #502 and #503 to separate the cuda 9 and nccl 2.0 concerns.

Please tell me if you see anything wrong in these PRs.

nouiz · 2017-08-25T12:45:45Z

It there something else from this PR that wasn't merged to master? If not, we can merge it.

nouiz · 2017-08-25T12:46:00Z

I mean, if not we can close it.

nouiz · 2017-08-25T17:28:33Z

I think all of this was merged in other PR, so closing.

borisfom added 3 commits July 21, 2017 00:16

Fixed fp16 breakage caused by CUDA9 changes

86a66e1

Fix for travis

e98fd37

NCCL 2.0 fix

208caeb

Merge remote-tracking branch 'origin/master' into fp16-fix

ec34534

abergeron requested changes Jul 24, 2017

View reviewed changes

borisfom added 3 commits July 24, 2017 13:55

Merge remote-tracking branch 'origin/master' into fp16-fix

c9b9457

Fixing fp16 atomics, removing CUDA build-time dependency

b50a934

Merge remote-tracking branch 'origin/master' into fp16-fix

c2a345e

lamblin modified the milestone: 0.6.10 Jul 31, 2017

borisfom added 3 commits August 4, 2017 16:23

Merge branch 'master' into fp16-fix

4e269d5

Merge remote-tracking branch 'origin/master' into fp16-fix

dd1675c

Moved debug print prep under #ifdef

92cbe7c

abergeron reviewed Aug 18, 2017

View reviewed changes

This was referenced Aug 21, 2017

Changes for CUDA 9.0 float16 support. #502

Merged

Upgrade bindings to nccl 2.0 #503

Merged

nouiz closed this Aug 25, 2017

Fixed fp16 breakage caused by CUDA9 changes #485

Fixed fp16 breakage caused by CUDA9 changes #485

Conversation

borisfom commented Jul 21, 2017

borisfom commented Jul 22, 2017 • edited Loading

ERROR: test_interface1 (main.TestWorker)

borisfom commented Jul 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nouiz commented Aug 1, 2017 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abergeron commented Aug 18, 2017

borisfom commented Aug 18, 2017 • edited Loading

abergeron commented Aug 21, 2017

nouiz commented Aug 25, 2017

nouiz commented Aug 25, 2017

nouiz commented Aug 25, 2017

borisfom commented Jul 22, 2017 •

edited

Loading

borisfom commented Aug 18, 2017 •

edited

Loading