Tagline: Intelligent Static, Dynamic & Real-time Malware Analysis with Explainability.
Note : Don't run on your system directly , test_sample file contain harmful malware . Use VM environment for dynamic and hybrid analysis
- The Problem Cerebrus Solves
- Our Solution: Cerebrus
- Key Features
- Architecture
- Technology Stack
- Installation
- Configuration
- Usage
- Challenges We Ran Into
- Future Work
- Contributing
- License
- Acknowledgements
In the ever-evolving cybersecurity landscape, traditional malware detection methods, often relying solely on known signatures, are struggling to keep pace with increasingly sophisticated threats. Malware is becoming more diverse (ransomware, trojans, spyware, worms, fileless attacks), targeting a wide array of file types (.exe
, .dll
, .pdf
, Office documents, scripts, etc.), and the sheer volume of new strains, including zero-day attacks with no prior signatures, is overwhelming.
Key challenges faced by security teams and developers include:
- Zero-Day Attacks: Signature-based detection is fundamentally ineffective against malware that hasn't been seen before.
- Manual Analysis Bottleneck: Security analysts are bombarded with a huge volume of files requiring investigation. Manual static and dynamic analysis is time-consuming, requires significant expertise, and doesn't scale effectively.
- Lack of Transparency & Trust: Many automated tools flag potential malware but operate as "black boxes," failing to explain why a file is deemed malicious. This makes it difficult to trust the output, differentiate sophisticated threats from false positives, and take confident remediation actions.
Cerebrus aims to bridge these gaps by providing a comprehensive, AI-powered solution that enhances detection capabilities, improves efficiency, and builds trust through explainability.
Cerebrus
is an intelligent malware analysis framework designed to provide multi-layered defense against modern cyber threats. It integrates static analysis, machine learning, external threat intelligence, and explainable AI (XAI) to deliver a robust platform for detecting known and unknown malware across various file formats.
It empowers security analysts, incident responders, and developers to:
- Triage files quickly and safely: Gain significant insights through static analysis before execution.
- Detect novel threats: Leverage machine learning models trained on deep file features, moving beyond simple signature matching.
- Understand the verdict: Utilize XAI features to see why the AI classified a file as malicious or benign.
- Automate analysis: Use batch processing and API capabilities for large-scale scanning.
- (Experimental) Monitor proactively: Employ real-time monitoring to detect threats as they appear on the system.
Cerebrus combines several powerful techniques for comprehensive analysis:
-
π€ AI-Driven Static Analysis:
- Deep Feature Extraction: Performs in-depth static analysis without executing the file. Extracts a rich set of features tailored to file type (currently strong focus on PE files
.exe
/.dll
, modular design for extension):- PE Files: Detailed parsing of DOS, File, Optional Headers; Section analysis (name patterns, entropy min/max, raw vs. virtual size, RWX permissions); Import Table analysis (suspicious API/DLL counts); Export Table analysis; Resource analysis (count, entropy). Features are structured to align with common malware research datasets (e.g.,
dataset_malwares.csv
structure). (Logic primarily inStaticPEAnalyzer
class) - General Files: File size, Accurate Type ID (
python-magic
), Hashes (MD5, SHA1, SHA256, SSDeep fuzzy hash), Overall Shannon Entropy. - String Analysis: Extracts ASCII/Unicode strings; calculates count, average length, string block entropy; identifies suspicious patterns (URLs, IPs, file paths, registry keys, common shell commands, suspicious API names, long Base64 strings).
- PE Files: Detailed parsing of DOS, File, Optional Headers; Section analysis (name patterns, entropy min/max, raw vs. virtual size, RWX permissions); Import Table analysis (suspicious API/DLL counts); Export Table analysis; Resource analysis (count, entropy). Features are structured to align with common malware research datasets (e.g.,
- Benefit: Enables risk assessment and feature gathering before running potentially harmful code, making triage safer and faster.
- Deep Feature Extraction: Performs in-depth static analysis without executing the file. Extracts a rich set of features tailored to file type (currently strong focus on PE files
-
π‘ Machine Learning Detection:
- Utilizes a Random Forest Classifier (trained via
model_training.ipynb
and saved usingjoblib
) on the extracted static features. - Learns complex relationships between features to identify malicious patterns missed by traditional signatures.
- Effective against known malware families and has the potential to flag unknown variants exhibiting similar static characteristics.
- Provides a Malware/Benign prediction and a Confidence Score (
predict_proba
).
- Utilizes a Random Forest Classifier (trained via
-
π¬ Dynamic Analysis (" DON'T RUN IN YOUR SYSTEM USE VM "):
- Includes modules (
dynamic_analysis.py
) demonstrating the capability to run executable files and scripts in a controlled, isolated environment (requires user-configured Sandbox/VM like VirtualBox). - Designed to capture behavioral indicators like: Process activity (creation, resource usage via
psutil
), File system modifications, Registry changes (Windows), Network connections. - Benefit: Detects malware using obfuscation, packing, or polymorphism that might evade static-only analysis. (Note: Full sandbox integration and log parsing require further development/configuration).
- Includes modules (
-
β Explainable AI (XAI):
- Integrates SHAP (SHapley Additive exPlanations) via
model_explainer.py
to interpret the Machine Learning model's predictions. - Shows which specific static features most influenced the classification (positive or negative contribution).
- Generates visualizations (e.g., waterfall plots) for easy understanding.
- Provides global feature importance analysis to understand the model's overall decision logic.
- Benefit: Builds trust in the AI detection, aids analysts in verifying findings, helps debug the model, and provides clear justification for actions taken.
- Integrates SHAP (SHapley Additive exPlanations) via
-
π External Threat Intelligence:
- Uses the VirusTotal API (
vt_api.py
) to check the file's SHA256 hash against its vast database. - Provides immediate context: Is this hash known malware? How many AV engines detect it?
- Benefit: Quickly identifies known threats and leverages community intelligence.
- Uses the VirusTotal API (
-
π‘οΈ Signature Scanning:
- Integrates with ClamAV (via
subprocess
calls) to perform traditional signature-based scanning. - Benefit: Catches common, known malware efficiently using an established open-source engine and database.
- Integrates with ClamAV (via
-
β±οΈ Real-Time Monitoring (Experimental):
- Offers a file system (
watchdog
) and process (psutil
) monitoring module (realtime_monitor.py
). - Detects new/modified files in specified directories and identifies potentially suspicious process behavior (e.g., high resource usage).
- Queues detected items for analysis using other Cerebrus components.
- Benefit: Enables proactive detection of threats as they emerge on a monitored system. (Note: Requires careful configuration and runs as a separate process/service).
- Offers a file system (
-
π₯οΈ Flexible Interfaces:
- Command-Line Tools: For prediction (
predict_file.py
), batch processing (run_batch_test.py
,scan_and_report.py
), model evaluation (model_explainer.py
,test_model_on_subset.py
). - Web Application: A Flask-based UI (
app.py
) for user-friendly file uploads and viewing results. - API Endpoints: (
app.py
,run_batch_test.py --api
) allow programmatic integration with other security tools or workflows.
- Command-Line Tools: For prediction (
-
βοΈ Modular Design: Built with distinct Python modules/classes for different functionalities (static analysis, dynamic analysis, ML prediction, explanation, external checks, UI), facilitating maintenance and future extensions (e.g., adding analyzers for PDF, Office, scripts).
Cerebrus employs a multi-stage pipeline to analyze files:
- Core Language: Python 3
- Machine Learning: Scikit-learn (RandomForestClassifier), Joblib, Pandas, NumPy, imbalanced-learn
- Explainability: SHAP
- Static Analysis: PEFile, python-magic, SSDeep, standard libraries (re, hashlib, math)
- External Checks: Requests (VirusTotal API), Subprocess (ClamAV CLI)
- Web Framework: Flask (
app.py
) - Real-Time Monitoring: Watchdog, psutil
- System Tools: ClamAV (External Dependency)
- Visualization: Matplotlib, Seaborn (primarily in
model_explainer.py
)
-
Prerequisites:
- Python 3.8+
pip
andvenv
(usually included with Python)- Git
- ClamAV: Install the ClamAV engine and signature updater.
- Debian/Ubuntu:
sudo apt-get update && sudo apt-get install -y clamav clamav-daemon p7zip-full build-essential libmagic1
- Fedora/CentOS:
sudo dnf install -y clamav clamav-update p7zip p7zip-plugins gcc-c++ file-devel
- macOS (Homebrew):
brew install clamav
- Windows: Download from ClamAV website or use via WSL. Ensure
clamscan
is in PATH.
- Debian/Ubuntu:
- Update ClamAV Database: Crucial! Run
sudo freshclam
(Linux/macOS) or the equivalent update command. This may need to be run periodically.
-
Clone Repository:
git clone https://github.com/YOUR_USERNAME/Cerebrus.git cd Cerebrus
(Replace with your actual repository URL)
-
Set up Virtual Environment: (Recommended)
python3 -m venv venv source venv/bin/activate # Linux/macOS # venv\Scripts\activate # Windows
-
Install Python Dependencies:
pip install -r requirements.txt
-
Download/Train ML Model: Ensure the pre-trained model file (e.g.,
ML_model/balanced_malwareclassifier.pkl
) exists. If not, you may need to run the training script (improved_model_training.py
) using an appropriate dataset (likeDataset/dataset_malwares.csv
if available).
- VirusTotal API Key: (Optional, for enhanced threat intelligence)
- Create a
config.json
file in the root directory:{ "virustotal_api_key": "YOUR_ACTUAL_VIRUSTOTAL_API_KEY" }
- Replace
"YOUR_ACTUAL_VIRUSTOTAL_API_KEY"
with your key obtained from VirusTotal. - Alternatively, set the environment variable
VT_API_KEY
.
- Create a
- Model Path: The scripts generally expect the model at
ML_model/balanced_malwareclassifier.pkl
. Adjust paths in the scripts (MODEL_PATH
variables) if your model is located elsewhere. - Analysis Preferences: Modify
config.json
(as used bybatch_processor.py
) to customize which analysis modules run for different file types.
Quickly analyze a single PE file (.exe
, .dll
).
python predict_file.py /path/to/your/file.exe
Output:
Analyzing file: file.exe
Model loaded successfully from ML_model/balanced_malwareclassifier.pkl
Extracted 77 features
Features aligned to match model requirements (77 features)
==================================================
PREDICTION RESULT:
==================================================
File: file.exe
Prediction: MALWARE
Confidence: 98.50%
==================================================
Analyze multiple files or directories. Creates detailed JSON reports and a summary CSV in batch_results/
.
# Analyze all files in a directory (recursive)
python run_batch_test.py -d /path/to/samples --recursive
# Analyze only DLL files in a directory (non-recursive)
python run_batch_test.py -d /path/to/dlls --file_patterns "*.dll" --no-recursive # Assuming --no-recursive flag exists
# Analyze a single file via batch processor logic
python run_batch_test.py -f /path/to/single/file.exe
Provides a user-friendly web interface.
- Start Server:
# Development: flask run --host=0.0.0.0 # Production (example with Waitress): waitress-serve --host=0.0.0.0 --port=5000 app:app
- Access: Open
http://<your-server-ip>:5000
in a browser. - Features: Upload single files, view analysis results, potentially view batch results and real-time status (depending on
app.py
implementation).
Used for evaluating the trained ML model and understanding predictions. Typically run after training or for specific analysis tasks.
# Example (Conceptual - requires data):
# python model_explainer.py --evaluate --X_test features.csv --y_test labels.csv
# Check model_explainer.log for detailed output
# Generates plots (ROC, PR, Confusion Matrix) in current dir or specified output
Monitors directories and processes continuously. Run as a background service or separate process.
# Example: Monitor Downloads and Temp directories
python realtime_monitor.py -w /path/to/Downloads -w /tmp
Check realtime_monitor.log
for detected events and analysis triggers.
Implementing Explainable AI (XAI) presented a significant challenge. While our Random Forest model achieved good classification accuracy, it functioned as a "black box." Making its decisions transparent was crucial for user trust, especially for security analysts who need to validate findings.
- Interpreting SHAP Values: The raw numerical outputs from SHAP, while mathematically sound, were initially difficult to translate into actionable insights. Understanding how multiple features interacted within the complex tree ensemble to influence the final prediction required careful processing.
- SHAP Library Integration: Ensuring the
shap.TreeExplainer
was correctly applied to our specificscikit-learn
RandomForestClassifier
instance and handling the multi-output nature (probabilities for both classes) needed specific implementation details withinmodel_explainer.py
.
How We Overcame It:
- Value Processing: We developed functions within
model_explainer.py
to parse the raw SHAP values, calculate the absolute impact of each feature, and rank them to identify the top positive (contributing to malware verdict) and negative (contributing to benign verdict) influencers for any given prediction. - Visualization: We implemented the
generate_explanation_plot
function to create clear waterfall or bar plots. These visuals directly map feature names to their SHAP contribution, making the 'why' behind a prediction immediately apparent. - Integration: The explanation logic was integrated into the analysis workflow (callable via
app.py
or used directly in reports) to provide context alongside the verdict and confidence score.
This focus on processing and visualizing the XAI output transformed the explainability feature from a technical possibility into a practical tool for analysts using Cerebrus.
- ** ADVANCE++ Dynamic Analysis:** Implement a robust sandbox environment (e.g., using VirtualBox/QEMU via Python) and integrate the
DynamicAnalyzer
logic fully, including parsing logs from tools like Procmon/tshark to extract behavioral features. - Expand ALL File Type Support: Create dedicated static analyzer classes (inheriting from
BaseStaticAnalyzer
) for:- Scripts (PowerShell, VBS, JS, Batch) - using regex, keyword analysis, deobfuscation techniques.
- Office Documents (OLE/OOXML) - using
olefile
/zipfile
, checking for macros, embedded objects, external links (document_analyzer.py
). - PDFs - using
PyPDF2
orPyMuPDF
to check for JavaScript, auto-actions, embedded files, obfuscation. - Archives - unpacking various formats and analyzing contents.
- Android APKs.
- Advanced ML Models: Although our model have accuracy of 99% you can explore Deep Learning models (CNNs for binary visualization, RNNs/LSTMs for sequential data like API calls from dynamic analysis) for potentially higher accuracy, especially on obfuscated samples.
- UI/UX Enhancements: Improve the Flask web interface, add dashboards, historical analysis views, and better visualization of results.
- Performance Optimization: Optimize feature extraction and model prediction for speed, especially for batch processing and real-time monitoring.
Contributions are welcome! Please feel free to submit pull requests, report issues, or suggest new features.
This project is licensed under the MIT License - see the LICENSE file for details (assuming MIT).
- The creators and maintainers of the libraries used (PEFile, python-magic, Scikit-learn, SHAP, etc.).
- OUR TEAM : MANJESH TIWARI , AKASH YADAV , MANDEEP