Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade bindings to nccl 2.0 #503

Merged
merged 3 commits into from
Aug 23, 2017
Merged

Upgrade bindings to nccl 2.0 #503

merged 3 commits into from
Aug 23, 2017

Conversation

abergeron
Copy link
Member

@abergeron abergeron commented Aug 21, 2017

Extracted from #485.

fix #497

DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream));
// We don't need this but we use it as a sentinel to prevent nccl 1.0 from loading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell how this work and what user error this gave?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will try to find the ncclStartGroup symbol in the library, not find it, complain about it not being present and abort the load.

@nouiz
Copy link
Member

nouiz commented Aug 22, 2017 via email

@nouiz nouiz mentioned this pull request Aug 22, 2017
@nouiz
Copy link
Member

nouiz commented Aug 22, 2017

IRL discussion with @abergeron he will change it to give a good error if it is nccl that is avaliable and not 2.0

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

The error isn't passed up to the user:

DEVICE=cuda0 nosetests tests/collectives/test_collectives.py -s --pdb-failure --pdb

don't raise any errors with this branch with nccl 1 installed.

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

Also, it have a segfault.

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

~/repos/libgpuarray/pygpu$ DEVICE=cuda0 nosetests tests/collectives/test_collectives.py -s
*** Testing for GeForce GTX 750
mpi4py found: True
*** Collectives testing for GeForce GTX 750
F......*** Error in `/Tmp/lisa/os_v5/anaconda/bin/python': free(): invalid pointer: 0x00007f76f5e69bf8 ***
Aborted (core dumped)

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

It was my env that mixed pygpu and libgpuarray version. So merging.

@nouiz nouiz merged commit 351f359 into Theano:master Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nccl 2.0 support
3 participants