Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed fp16 breakage caused by CUDA9 changes #485

Closed
wants to merge 10 commits into from

Conversation

borisfom
Copy link
Contributor

I can run most of my tests now with CUDA9.

@borisfom
Copy link
Contributor Author

borisfom commented Jul 22, 2017

Also fixed nccl headers for nccl-2.0 (required for CUDA9).
Apparently, there are more issues with nccl - still fails to initialize
/usr/local/lib/python2.7/dist-packages/nose_parameterized/init.py:7: UserWarning: The 'nose-parameterized' package has been renamed 'parameterized'. For the two step migration instru
ctions, see: https://github.com/wolever/parameterized#migrating-from-nose-parameterized-to-parameterized (set NOSE_PARAMETERIZED_NO_WARN=1 to suppress this warning)
"The 'nose-parameterized' package has been renamed 'parameterized'. "
/usr/local/lib/python2.7/dist-packages/theano/gpuarray/dnn.py:169: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgradi
ng cuDNN to a version >= v5 and <= v6.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 7001 on context None
Mapped name None to device cuda0: Graphics Device (0000:01:00.0)
WARNING! Failed to register in a local GPU comm world.
Reason: 'utf8' codec can't decode byte 0xb6 in position 2: invalid start byte
WARNING! Platoon all_reduce interface will not be functional.
FEEEThe following warning is produced by testing procedure:
WARNING! Worker instance has already been initialized.
Args: (123413,), Kwds: {}
.EClosing connections and unlinking memory...

ERROR: test_interface1 (main.TestWorker)

Traceback (most recent call last):
File "test_worker.py", line 40, in test_interface1
self.worker.all_reduce(sinp, '+', sout)
File "/usr/local/lib/python2.7/dist-packages/platoon/channel/worker.py", line 452, in all_reduce
raise PlatoonError("all_reduce interface is not available. Check log.")
PlatoonError: ERROR! all_reduce interface is not available. Check log.

@borisfom
Copy link
Contributor Author

Do I need to have MPI installed in order to use collectives ? It won't install on Ubuntu 16.04

CMakeLists.txt Outdated
set(CUDA_VERSION_MAJOR 8)
endif()

set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DCUDA_VERSION_MAJOR=${CUDA_VERSION_MAJOR}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to depend on an installed version of cuda to compile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ? Would you ever build Theano on CUDA8 system and run on CUDA9 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you're right - no need to introduce build time dependency here.

" asm(\"{ cvt.f32.f16 %0, %1;}\\n\" : \"=f\"(val) : \"h\"(__HALF_TO_CUS(h)));\n"
" return val;\n"
"}\n"
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have to be detected at runtime by consulting the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes runtime is possible, but, again - why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - the condition is not the context though, but the RTC version.

@@ -2,6 +2,7 @@
#define LIBGPU_EXT_CUDA

#include <cuda.h>
#include <cuda_fp16.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this include here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we have fp16-related code anywhere, cuda_fp16_h should be thought of as extension of cuda.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -208,10 +208,10 @@ static int gen_elemwise_basic_kernel(GpuKernel *k, gpucontext *ctx,
}
for (j = 0; j < n; j++) {
if (is_array(args[j])) {
strb_appendf(&sb, "%s %s;", ctype(ISSET(gen_flags, GEN_CONVERT_F16) && args[j].typecode == GA_HALF ?
strb_appendf(&sb, "%s %s;", ctype(/* ISSET(gen_flags, GEN_CONVERT_F16) && */ args[j].typecode == GA_HALF ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you disabling the flag here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting some test errors - since half is now defined as struct, there are no cases when explicit conversion is not needed - if I understood this flag correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes test errors went away since then. I even ran nose-test with floatX=float16 with very few errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of the flag is to choose between

  • store float16, compute float32
    and
  • store float16, compute float16.

In the latter case I'll admit that I just let the compiler figure out how to do that and it never worked properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests would fail if not removing ISSET:
ERROR: test_long (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_inc_and_set_subtensor (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_ellipsis (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test_advanced1_inc_and_set (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_strided (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_rows_finite (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_range_finite (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test2_ok_col (theano.gpuarray.tests.test_subtensor.G_subtensorF16)
ERROR: test1_ok_strided (theano.gpuarray.tests.test_subtensor.G_subtensorF16)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll have to take a look at this to figure out the proper solution. I don't have much time to do it this week so this may wait a while.

I'm leaning towards adding a new type for native f16 compute and keeping the existing type for f32 compute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to warn you using native fp16 is almost never a good idea - we do not do it in most frameworks due to the precision issues, and in many cases it is also slower. Implementing the switch via separate types was also tried and proved more trouble than worth - better to use single storage type and have a switch. If you return to this - Volta also has FP16 HMMA (compute f16 with f32 accumulator) - all three maths (pseudo f16, native f16, f16 hmma) are slightly different, so plan in advance :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we will stick with float16 meaning f32 compute for now then and nothing for native float16. In any case that would allow for removing the flag.

DEF_PROC(ncclResult_t, ncclReduceScatter, (const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream));
DEF_PROC(ncclResult_t, ncclBcast, (void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream ));
DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl Got updated to use size_t now? Is there a way to detect that when we load the library. I would like to prevent people from loading the older one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are nccl 2.0 definitions. I think, if you #include 'nccl.h' as well, you will get compiler errors if definitions are not the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These definitions are used to dlopen the library and grab some function pointers. There won't be any compiler double checking our work, so we need to be careful.

That being said, I have no issues with dropping support for nccl 1.0 and blocking the load in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, if you #include nccl.h in .cpp, and then include your .fn with proper expansion, you would end up with 2 sets of extern function definitions. If they won't match, compile would break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never include the real nccl.h anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron : my sentiment, exactly: should you include it as I suggested above, you would be able to detect API change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to include it because I want to be able to build on machines where it is not present and then load it if later installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is important. Could be an optional target only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to do this might be to add one of the new group API to the set of required functions. This will make the load fail for version 1.0, which should prevent problems of the sort.

@lamblin lamblin modified the milestone: 0.6.10 Jul 31, 2017
@nouiz
Copy link
Member

nouiz commented Aug 1, 2017 via email

va_start(ap, fmt);
vsnprintf(e->msg, ERROR_MSGBUF_LEN, fmt, ap);
va_end(ap);
#ifdef DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad. We want to set the error string in all cases, but only print it when in DEBUG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever being used w/o DEBUG? The reasons I did it is because I have noticed all tests print humungous number of those messages in DEBUG (without failing - the errors are being ignored) - meaning you are currently spend a lot of CPU time in sprintf() on non-error path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those messages are used by theano to report the precise cause of the error and are more generally available through gpucontext_error().

It is true that there are a couple of "regular" paths where we encounter an error so I am open to a way to improve this.

One of the way to improve this might be to spot the paths where the error can't get to the user and remove the message from those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! I would also suggest to look at the flip side of this observation - when in DEBUG mode, useful output is being flooded by thousands of messages about wrong dimensions - if those are not real errors, a different check should be used (or a 'silent' parameter added).

@abergeron
Copy link
Member

I've made some changes into how we handle float16 and some other stuff in preparation for 0.7 since a lot of those changes break API (#499).

This will affect a number of changes that you did here in this PR. I would like to integrate the work you did for cuda 9 float16 and NCCL in follow-up PRs. Do you want me to cherry-pick/reorganize the commits or do you want to do it?

@borisfom
Copy link
Contributor Author

borisfom commented Aug 18, 2017

@ abergeron: you will definitely do a better job on the merge - please go ahead.

@abergeron
Copy link
Member

I've split the changes into #502 and #503 to separate the cuda 9 and nccl 2.0 concerns.

Please tell me if you see anything wrong in these PRs.

@nouiz
Copy link
Member

nouiz commented Aug 25, 2017

It there something else from this PR that wasn't merged to master? If not, we can merge it.

@nouiz
Copy link
Member

nouiz commented Aug 25, 2017

I mean, if not we can close it.

@nouiz
Copy link
Member

nouiz commented Aug 25, 2017

I think all of this was merged in other PR, so closing.

@nouiz nouiz closed this Aug 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants