-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Adding base protocol and PC algorithm and example #11
Conversation
Signed-off-by: Adam Li <adam2392@gmail.com>
A note about upstream dependency: Since the |
Another high-level note: Currently there does not exist a robust package in Python dedicated to CI testing. Therefore, for now I think it's best to keep it in-house. Eventually, it makes sense to out-source into another package, since |
from .base import BaseConditionalIndependenceTest | ||
|
||
|
||
class KernelCITest(BaseConditionalIndependenceTest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that HSIC and KCI tests are also part of the GCM module: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo nice! I was hoping to find time for the approx version. In line with the functional API push, would it be desired to refactor those into class objects with a required API to make all functionality that relies on CI testing more robust?
Where would the best place for "CI" tests to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if a class object is really necessary (at least, I don't see the benefit). In the GCM package, we ensure a compatible API by requiring the unconditional independence tests to be a function of the from Callable[[np.ndarray, np.ndarray], float]
and for CI Callable[[np.ndarray, np.ndarray, np.ndarray], float]
. An example would be something like:
def hsic(X: np.ndarray, Y: np.ndarray, other_parameters...) -> float:
.. compute p value...
return p_value
def kci(X: np.ndarray, Y: np.ndarray, Z: np.ndarray, other_parameters...) -> float:
.. compute p value...
return p_value
def my_pc_algo(data: pd.DataFrame, independence_test : Callable[[np.ndarray, np.ndarray, np.ndarray], float] = hsic, conditional_test : Callable[[np.ndarray, np.ndarray, np.ndarray], float] = kci, significant_level: float = 0.05):
... do PC stuff ...
if nodeC is None:
p_value = independence_test(data[nodeA], data[nodeB])
else:
p_value = conditional_test(data[nodeA], data[nodeB], data[nodeC])
reject = p_value < significant_level
.. do PC stuff ...
If you want to configure the tests further, you can do simply:
my_pc_algo(data, partial(hsic, my_hsic_parameter_a = 0.1, kernel = 'laplace'))
Ok, just an idea to emphasize that we don't always need class objects for everything :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. generally I like the idea to move the independence tests into a separate package. I think they can also be very helpful outside of the 'causality' context and are often required core functions in several algorithms (e.g. graph validation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed the CI testing is perhaps something to discuss next time then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re class vs no class: I started out w/ functional too, but I realized I needed to pass in a large number of parameters to the structure learning algorithm. So it became really clunky. That was prolly my main motivation, but I agree (rn) that there's not much diff, except ease of reading and typing out. I'm down to switch back to function if that's desired.
E.g.
pc = PC(
... PC kwargs ...
ci_test=kci,
... ci_test_kwargs
I think rn the PC algorithm as I have it has 7 parameters. If we also pass the ci test parameters, say for KCI, which has probably ~9 parameters. The PC algorithm overall has 16+ parameters possibly. I think eventually, it might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but we will have default parameters typically and then, in practice, only a fraction of users would really change more than 1 parameter overall :) In the 'average' use case, a user would not even change anything and sticks to the default ones. Btw. another way is also to create a config object that can be passed (just some ideas :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that makes sense. Ill refactor CI stuff then to just functions. And maybe eventually just a sep repo altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you ok w/ making that into a separate issue/PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea sure. You can also keep the class structure for now and we can discuss this with the others first. I guess both (having it as class or as function) have their advantages and disadvantages. Maybe something for the next meeting?
logger = logging.getLogger() | ||
|
||
|
||
class PC(BaseConstraintDiscovery): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to follow-up on the effort to bring the 'causal-learn' package into PyWhy instead of re-implementing the functionalities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were refactored from my package, causal-networkx
. From my exp. the causal-learn
package was very hard to refactor into a working extensible version. Moreover, the package lacked unit-tests that assert behavior, so I would be wary of copy-pasting here directly w/o extensive testing.
Is there any reason this approach is not desired?
WDYT?
Just as background: Not trying to be too opinionated btw. The package is great for doing some initial applications on the problem I was working on, but I was also working on extending structure learning and researching modifications and it was very difficult for me to do so. Hence I re-implemented the abstractions from scratch with insight from their package of course. I'm in favor of bringing the functionality they implement into a refactored API form that Robert proposed, otw we induce large amounts of technical debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more thinking along the line that we plan to integrate the causal-learn package in the future anyway (at least, we are talking to the authors of the package) and then it might be better to rather try to refactor their package and make it compatible (rather then re-implementing methods that already exist in many other packages).
Also keep in mind that we can not simply copy over code and refactor it, that would violate the license :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Makes sense if they are on board to try. I also took inspiration from their code as well. This implementation was to address the action item we discussed last time in the meeting of getting an end-to-end PC example using the graphs to be PRed into networkx.
It was difficult for me to parse https://github.com/cmu-phil/causal-learn/blob/main/causallearn/search/ConstraintBased/PC.py to see if we're missing anything in the core pc alg, so lmk. I think the PR's implementation follows the abstraction of the PC algorithm, so it should be very easy to add functionality at any stage of the algorithm if we are missing something: e.g. the abstraction has the following steps for all constraint-based algorithms:
- initialization
- learning a skeleton (this is dedicated to a separate algorithm altogether, since there are many proposals in the literature for improving this process
- orienting unshielded colliders
- orienting edges
(for example, here it is demonstrated that one can easily implement Conservative/MaxVotePC by overriding one method: the orientation of unshielded colliders -> 80 LOC). The algorithm then works as expected locally (unit tests to be added).
https://github.com/cmu-phil/causal-learn/blob/main/causallearn/search/ConstraintBased/PC.py does show that there is an algorithm missing, which is the mvpc
algo for missing-value. I think that involves modification of the skeleton procedure IIUC and then everything else should work as expected.
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Its also nothing crucial, I just want to make sure that you don't have double work if we can leverage other packages :) I think the PC algorithm is a good example to see how the required APIs could look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that I'm happy to get this in there now. The causal-learn team did indicate they would change the license but we shouldn't block on them.
I would like to see this PR broken up though (see message below) so we can commit these parts piecemeal?
Thanks for this @adam2392 . Can I trouble you to close this PR and break the changes on this branch into multiple PR's? The development features can just go into one PR that we can merge immediately. For the PC algorithm, it would be nice if we could break that up into a few chunks as well. Say for example, the conditional independence tests can be their own PR. This would make for a more readable commit history. Also @darthtrevino are going to start working on dodiscover code and I'm going to be working with him, it would be easier for someone new to PC to see this code broken up into chunks |
Closing this for now, starting off new PRs with 15 |
Summary
Closes: #10
Assuming one installs
graphs
andpywhy_graphs
as is from their most up-to-date versions, this now has a working implementation of the PC algorithm that is in line with the API proposal.Then one can run the file
examples/plot_pc_alg.py
. This is a "huge" PR cuz it adds all the files needed to do basic book-keeping. But I can run thru the example next meeting.Features for development:
Features for API:
GraphProtocol
andEquivalenceClassProtocol
as discussedContext
classWhat's next?
Documentation:
API:
Context
andBaseConstraintDiscovery
classes, we'll want to generalize and see if they still hold. E.g. implement FCI which uses the Base and Score-based which uses the Context and see if we're missing anything.Misc.
pinging: @amit-sharma @petergtz @emrekiciman @bloebp @robertness
cc: @robertness any chance you guys can just give us org privileges for at least the CI portion? I feel bad continuing to ask, but can you activate the circleCI, and GH actions and codecov?