Skip to content

Commit

Permalink
Create gsoc-ideas.md
Browse files Browse the repository at this point in the history
Signed-off-by: Sean P. Goggins <outdoors@acm.org>
  • Loading branch information
sgoggins authored Feb 11, 2025
1 parent 47884ca commit b2f67b0
Showing 1 changed file with 56 additions and 0 deletions.
56 changes: 56 additions & 0 deletions gsoc-ideas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@

## Idea: Enhance Conversational Topic Modelling Capabilities in CHAOSS Software

**Hours: 350**

[Micro-tasks and place for questions](https://github.com/chaoss/augur/issues/1640)

This project will add GenSIM logic, and other capabilities to the Clustering Worker inside of Augur Software, and be extended into a generalized Open Source Software Conversational Topic Modeling Instrument.

CHOASS/augur has several workers that store machine learning information derived from computational linguistic analysis of data in the `message` table. The message table includes messages from issue, pull request, pull request review, and email messages. They are related to their origin with bridge tables like `pull_request_message_ref`. The ML/CL workers are all run against all the messages, regardless of origin.

1. Clustering Worker (clusters created and topics modeled)
2. message analysis worker (sentiment and novelty analysis)
3. discourse analysis worker (speech act classification (question, answer, approval, etc.)

Clustering Worker Notes:

Clustering Worker: 2 Models.
- Models:
- Topic modeling, but it needs a better way of estimating number of topics.
- Tables
- repo_topic
- topic_words
- Computational linguistic clustering
- Tables
- repo_cluster_messages
- Key Needs
- Add GenSim algorithms to topic modeling section https://github.com/chaoss/augur/issues/1199
- The topics, and associated topic words need to be persisted after each run. At the moment, the topic words get overwritten for each topic modeling run.
- Description/optimization of the parameters used to create the computational linguistic clusters.
- Periodic deletion of models (heuristic: If 3 months pass, OR there’s a 10% increase in the messages, issues, or PRs in a repo, rebuild the models)
- Establish some kind of model archiving with appropriate metadata (lower priority)

Discourse Analysis Worker Notes:

discourse_insights table (select max(data_collection_date) for each msg_id)
- sequence is reassembled from the timestamp in the message table (look at msg_timestamp)
- issues_msg_ref, pull_request_message_ref, pull_request_review_msg_ref

Message Analysis Worker
- message_analysis
- message_analysis_summary

<img width="1159" alt="augur-tech" src="https://user-images.githubusercontent.com/379847/124799236-f440dc80-df19-11eb-84ce-302cf274884f.png">

The aims of the project are as follows:
- Advance topic modeling of open source software conversations captured in GitHub.
- Integrate this information into clearer, more parsimonious CHAOSS metrics.
- Automate the management machine learning insights, and topic models over time.
- (Stretch Goal) Improve the operation of the overall machine learning insights pipeline in CHAOSS/augur, and generalize these capabilities.


* _Difficulty:_ Medium
* _Requirements:_ Interest in software analytics. Python programming. Conceptual understanding of machine learning, and an eagerness to learn maching learning, and SQL knowledge.
* _Recommended:_ Experience with Python
* _Mentors:_ Sean Goggins, Andrew Brain, Isaac Milarsky

0 comments on commit b2f67b0

Please sign in to comment.