-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade bindings to nccl 2.0 #503
Conversation
src/loaders/libnccl.fn
Outdated
DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)); | ||
// We don't need this but we use it as a sentinel to prevent nccl 1.0 from loading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell how this work and what user error this gave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will try to find the ncclStartGroup symbol in the library, not find it, complain about it not being present and abort the load.
Can you put somewhere in the doc that error message and tell that the
problem is the nccl version and that the fix is to update?
This way, people will ask less frequently the question what that error mean
and how to fix it.
…On Tue, Aug 22, 2017 at 12:24 PM abergeron ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/loaders/libnccl.fn
<#503 (comment)>:
> \ No newline at end of file
+DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream));
+// We don't need this but we use it as a sentinel to prevent nccl 1.0 from loading.
It will try to find the ncclStartGroup symbol in the library, not find it,
complain about it not being present and abort the load.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#503 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALC-2DxOJ1wqPAXZM5CZOmnciT_uaTcks5sawCmgaJpZM4O9xM_>
.
|
IRL discussion with @abergeron he will change it to give a good error if it is nccl that is avaliable and not 2.0 |
The error isn't passed up to the user: DEVICE=cuda0 nosetests tests/collectives/test_collectives.py -s --pdb-failure --pdb don't raise any errors with this branch with nccl 1 installed. |
Also, it have a segfault. |
~/repos/libgpuarray/pygpu$ DEVICE=cuda0 nosetests tests/collectives/test_collectives.py -s |
It was my env that mixed pygpu and libgpuarray version. So merging. |
Extracted from #485.
fix #497