We welcome new contributions to scipeds
! Our goal is to make this package a more useful resource for people working with IPEDS data, and we cannot accomplish that without support from contributors.
This guide will detail how you can contribute, whether you are requesting a new feature or adding functionality to scipeds
.
!!! note
We know that contributing to open-source projects can feel intimidating! We want contributing to be accessible and not scary, so if this guide feels overwhelming we encourage you to reach out by starting a new [discussion](https://github.com/scienceforamerica/scipeds/discussions/new/choose). We are humans on the other end, and will be excited to get back to you and figure out how we can help get you started.
All contributors must follow the code of conduct
To report a bug, file an issue. Please report as much context about the issue as possible (e.g., the line of code where scipeds failed, the OS and python version you are using, any error message or traceback detailing the failure).
To request a new feature be added to scipeds
, start a new discussion.
This section details how to develop new features for scipeds
, with some additional information about code architecture and design considerations.
To add new functionality to scipeds
, it is useful to have the source code working on your own machine. This section contains details for how to get scipeds
working locally.
This project uses a Makefile
to streamline development. To make these steps easier to run, make sure that make
is installed on your system.
- Clone the repo and create a new Python environment. If you have conda, you can use
make create_environment
- otherwise, use your virtual environment creation tool of choice. - Install the requirements (
make requirements
). - Run the tests (
make test
). - Build the documentation (
make docs
) - Serve the documentation locally (
make docs-serve
) - Download the data.
- To download the pre-processed duckdb, run
scipeds download_db
. - If you want the raw data files. You can download them from Science for America's cloud storage (
make download-raw
) or directly from IPEDS (make download-raw-from-ipeds
). You can then process the raw data files into a processed duckdb (make process
).
- To download the pre-processed duckdb, run
The scipeds
repo actually contains two different things:
- an application (in
pipeline/
) for downloading and processing raw data from IPEDS into a duckdb database - a library (in
scipeds/
) to make it easy to query IPEDS data from Python
Contributions to IPEDS will generally fall into two categories: adding new queries of existing IPEDS data (which currently contains just completions data and institutional directory information) and adding new data sources to the duckdb database.
The basis for scipeds
queries is the IPEDSQueryEngine
. This base class connects to the pre-processed duckdb and returns the results of various queries. Sets of queries for specific tables or purposes are factored out into their own sub-classes. For example, the CompletionsQueryEngine
in scipeds/data/completions.py
, offers a several built-in queries for aggregations that the authors used frequently to explore completions data. The universe of possible queries is quite large (which is why you can simply write your own SQL query using the fetch_df_from_query
method of the IPEDSQueryEngine
) but if that gets tiresome you might want to add a new function to an existing class or build a new class altogether.
The process for adding queries is relatively straightforward:
- If you are creating a new query engine (for example because you are adding a new IPEDS survey component to
scipeds
), create a new file and create your class as a subclass of theIPEDSQueryEngine
. If you are adding a query to an existing engine, add a new function for your query to the engine. - Create a template for your SQL query in the appropriate place. For example, for the completions queries in
CompletionsQueryEngine
, the longer queries are class attributes for neatness and shorter queries live with their corresponding functions. Queries can use a mix of Python string formatting and (duckdb prepared statements)[https://duckdb.org/docs/sql/query_syntax/prepared_statements.html] to inject variables into the queries. - Write a test for your function that correctly asserts that the data returned by your query matches what you expect from running the query on the test database (generated by running
python pipeline/db.py write-test-db
). Your tests should be added in thetests/
folder located most closely to your code changes (create one if it does not exist).
Where possible, please use the set of models, conventions, and options that exist for the current set of queries. For example, use the existing QueryFilters
model (or extend it) to filter data by race/ethnicity, year, etc., and use the existing FieldTaxonomy
columns and TaxonomyRollup
model to aggregate across fields within a particular field taxonomy.
IPEDS has lots of different survey components, and we only include a small fraction of them here. We would very much like to expand the set of data that this package covers!
If you are interested in adding a new data source but aren't sure where to start or find the following instructions confusing, please reach out! We are more than happy to work with you through the process.
To add a new data source, please follow these instructions so the maintainers can easily reproduce your work:
- Identify the data source you'd like to add and come up with an easily identifiable and SQL-table-friendly name to add to
scipeds/constants.py
. For example, the completions table is calledipeds_completions_a
and the directory info table is calledipeds_directory_info
. - Add code to
pipeline/download.py
that downloads your data directly from IPEDS to local disk. 1. Create a new script for your source of data inpipeline/
that converts the data to a series of interim CSV files that can be read into duckdb. See existing completions (pipeline/completions.py
) and directory info (pipeline/institutions.py
) scripts for examples. - Add code to
pipeline/process.py
that reads your interim CSVs into a duckdb table. Try to minimize the size of your table where possible by usingENUM
s or smaller data types (see completions and directory info processing for examples). - Add queries for your new table.
- Add code to
pipeline/db.py
for adding fake data for your new table to the test database. - Generate the new test database and add tests where appropriate for any new queries or other functionality.
!!! note
By default, `scipeds` will use the database in the `SCIPEDS_CACHE_DIR` for query engines. As you are developing a new pipeline, `make process` outputs data to the interim and processed data directories in the `DATA_DIR`. Make sure you are querying the correct database.
- Confirm that you are fixing an existing issue or adding a requested feature. If an issue or discussion doesn't exist for your desired contribution, create it!
- Fork the repo, clone it locally, and follow the instructions for local development
- Run formatting (
make format
) and linting (make lint
). - Submit your PR!