Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Adding base protocol and PC algorithm and example #11

Closed
wants to merge 5 commits into from

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Aug 9, 2022

Summary

Closes: #10

Assuming one installs graphs and pywhy_graphs as is from their most up-to-date versions, this now has a working implementation of the PC algorithm that is in line with the API proposal.

Then one can run the file examples/plot_pc_alg.py. This is a "huge" PR cuz it adds all the files needed to do basic book-keeping. But I can run thru the example next meeting.

Features for development:

  • black and flake8 for styling
  • sorting of imports (isort)
  • pydocstyle to check doc strings
  • code spell to check spelling
  • mypy to check typing
  • CI just needs someone in pywhy org to enable (is it possible to start creating teams to enable less friction here?)
  • unit test coverage will be generated via CI
  • docs will be built and deployed via CI

Features for API:

  • An extensible implementation of constraint based discovery, skeleton learning, and PC algorithm which does not impose the need for a specific graph class, but adds a GraphProtocol and EquivalenceClassProtocol as discussed
  • A basic implementation of the Context class

What's next?

Documentation:

  • from a developer perspective, if our goal is to bring in more researchers/users/contributors, we need to properly document i) how to use and ii) how to extend the algorithms presented here. E.g. https://scikit-learn.org/stable/developers/develop.html is a good inspiration. This probably should be a few short examples and tutorials that are self-contained

API:

  • to test the abstractions of the Context and BaseConstraintDiscovery classes, we'll want to generalize and see if they still hold. E.g. implement FCI which uses the Base and Score-based which uses the Context and see if we're missing anything.

Misc.

pinging: @amit-sharma @petergtz @emrekiciman @bloebp @robertness

cc: @robertness any chance you guys can just give us org privileges for at least the CI portion? I feel bad continuing to ask, but can you activate the circleCI, and GH actions and codecov?

@adam2392
Copy link
Collaborator Author

adam2392 commented Aug 11, 2022

A note about upstream dependency: Since the MixedEdgeGraph is subject to change and therefore the changes in causal graphs in pywhy_graphs. However, thankfully the connection is loose and due to mypy and the Graph protocols implemented here, it should be relatively robust to those changes unless the underlying data structures returned differ.

@adam2392
Copy link
Collaborator Author

Another high-level note: Currently there does not exist a robust package in Python dedicated to CI testing. Therefore, for now I think it's best to keep it in-house. Eventually, it makes sense to out-source into another package, since dodiscover should be agnostic to what CI tests are used.

from .base import BaseConditionalIndependenceTest


class KernelCITest(BaseConditionalIndependenceTest):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that HSIC and KCI tests are also part of the GCM module: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo nice! I was hoping to find time for the approx version. In line with the functional API push, would it be desired to refactor those into class objects with a required API to make all functionality that relies on CI testing more robust?

Where would the best place for "CI" tests to go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if a class object is really necessary (at least, I don't see the benefit). In the GCM package, we ensure a compatible API by requiring the unconditional independence tests to be a function of the from Callable[[np.ndarray, np.ndarray], float] and for CI Callable[[np.ndarray, np.ndarray, np.ndarray], float]. An example would be something like:

def hsic(X: np.ndarray, Y: np.ndarray, other_parameters...) -> float:
   .. compute p value...
  return p_value

def kci(X: np.ndarray, Y: np.ndarray, Z: np.ndarray, other_parameters...) -> float:
   .. compute p value...
  return p_value


def my_pc_algo(data: pd.DataFrame, independence_test : Callable[[np.ndarray, np.ndarray, np.ndarray], float] = hsic, conditional_test : Callable[[np.ndarray, np.ndarray, np.ndarray], float] = kci, significant_level: float = 0.05):
   ... do PC stuff ...
  if nodeC is None:
    p_value = independence_test(data[nodeA], data[nodeB])
  else:
    p_value = conditional_test(data[nodeA], data[nodeB], data[nodeC])

  reject = p_value < significant_level
  .. do PC stuff ...

If you want to configure the tests further, you can do simply:

my_pc_algo(data, partial(hsic, my_hsic_parameter_a = 0.1, kernel = 'laplace'))

Ok, just an idea to emphasize that we don't always need class objects for everything :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. generally I like the idea to move the independence tests into a separate package. I think they can also be very helpful outside of the 'causality' context and are often required core functions in several algorithms (e.g. graph validation).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed the CI testing is perhaps something to discuss next time then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re class vs no class: I started out w/ functional too, but I realized I needed to pass in a large number of parameters to the structure learning algorithm. So it became really clunky. That was prolly my main motivation, but I agree (rn) that there's not much diff, except ease of reading and typing out. I'm down to switch back to function if that's desired.

E.g.

pc = PC(
    ... PC kwargs ...
    ci_test=kci,
    ... ci_test_kwargs

I think rn the PC algorithm as I have it has 7 parameters. If we also pass the ci test parameters, say for KCI, which has probably ~9 parameters. The PC algorithm overall has 16+ parameters possibly. I think eventually, it might be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, but we will have default parameters typically and then, in practice, only a fraction of users would really change more than 1 parameter overall :) In the 'average' use case, a user would not even change anything and sticks to the default ones. Btw. another way is also to create a config object that can be passed (just some ideas :)).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that makes sense. Ill refactor CI stuff then to just functions. And maybe eventually just a sep repo altogether.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you ok w/ making that into a separate issue/PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sure. You can also keep the class structure for now and we can discuss this with the others first. I guess both (having it as class or as function) have their advantages and disadvantages. Maybe something for the next meeting?

logger = logging.getLogger()


class PC(BaseConstraintDiscovery):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to follow-up on the effort to bring the 'causal-learn' package into PyWhy instead of re-implementing the functionalities?

Copy link
Collaborator Author

@adam2392 adam2392 Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were refactored from my package, causal-networkx. From my exp. the causal-learn package was very hard to refactor into a working extensible version. Moreover, the package lacked unit-tests that assert behavior, so I would be wary of copy-pasting here directly w/o extensive testing.

Is there any reason this approach is not desired?

WDYT?

Just as background: Not trying to be too opinionated btw. The package is great for doing some initial applications on the problem I was working on, but I was also working on extending structure learning and researching modifications and it was very difficult for me to do so. Hence I re-implemented the abstractions from scratch with insight from their package of course. I'm in favor of bringing the functionality they implement into a refactored API form that Robert proposed, otw we induce large amounts of technical debt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking along the line that we plan to integrate the causal-learn package in the future anyway (at least, we are talking to the authors of the package) and then it might be better to rather try to refactor their package and make it compatible (rather then re-implementing methods that already exist in many other packages).

Also keep in mind that we can not simply copy over code and refactor it, that would violate the license :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Makes sense if they are on board to try. I also took inspiration from their code as well. This implementation was to address the action item we discussed last time in the meeting of getting an end-to-end PC example using the graphs to be PRed into networkx.

It was difficult for me to parse https://github.com/cmu-phil/causal-learn/blob/main/causallearn/search/ConstraintBased/PC.py to see if we're missing anything in the core pc alg, so lmk. I think the PR's implementation follows the abstraction of the PC algorithm, so it should be very easy to add functionality at any stage of the algorithm if we are missing something: e.g. the abstraction has the following steps for all constraint-based algorithms:

  1. initialization
  2. learning a skeleton (this is dedicated to a separate algorithm altogether, since there are many proposals in the literature for improving this process
  3. orienting unshielded colliders
  4. orienting edges

(for example, here it is demonstrated that one can easily implement Conservative/MaxVotePC by overriding one method: the orientation of unshielded colliders -> 80 LOC). The algorithm then works as expected locally (unit tests to be added).

https://github.com/cmu-phil/causal-learn/blob/main/causallearn/search/ConstraintBased/PC.py does show that there is an algorithm missing, which is the mvpc algo for missing-value. I think that involves modification of the skeleton procedure IIUC and then everything else should work as expected.

How does that sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Its also nothing crucial, I just want to make sure that you don't have double work if we can leverage other packages :) I think the PC algorithm is a good example to see how the required APIs could look like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that I'm happy to get this in there now. The causal-learn team did indicate they would change the license but we shouldn't block on them.

I would like to see this PR broken up though (see message below) so we can commit these parts piecemeal?

@robertness
Copy link
Collaborator

Thanks for this @adam2392 . Can I trouble you to close this PR and break the changes on this branch into multiple PR's? The development features can just go into one PR that we can merge immediately.

For the PC algorithm, it would be nice if we could break that up into a few chunks as well. Say for example, the conditional independence tests can be their own PR. This would make for a more readable commit history. Also @darthtrevino are going to start working on dodiscover code and I'm going to be working with him, it would be easier for someone new to PC to see this code broken up into chunks

@robertness
Copy link
Collaborator

robertness commented Aug 12, 2022

Closing this for now, starting off new PRs with 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement PC algorithm following API
3 participants