Hey there! 👋 I'm diving deep into the world of S&P 500 stock market trends, building a time series regression model to predict stock movements. This project is a mix of data exploration, feature engineering, and machine learning to gain valuable financial insights.
I've got three datasets that power this analysis:
This dataset contains details about S&P 500 companies, including their sector, industry, market cap, financials, and stock weights.
🔍 Initial Findings:
- The big tech giants dominate: Apple, NVIDIA, Microsoft, and Alphabet hold the highest market cap.
- NVIDIA is a growth beast, showing 122.4% revenue growth, outpacing others.
- Apple has the highest weight (0.0635) in the index, meaning it significantly impacts overall movements.
This dataset tracks the historical performance of the S&P 500 index.
📌 First few entries:
- The earliest date is October 6, 2014, with the index at 1,964.82 points.
- The daily fluctuations reflect economic trends, earnings reports, and investor sentiment.
💡 Next step: Plot the S&P 500 index over time to see the long-term trend.
This dataset contains historical stock prices for every S&P 500 company.
🔍 Example (3M - MMM on Jan 4, 2010):
- Open: 69.47, Close: 69.41
- High: 69.77, Low: 69.12
- Volume: 3.64M shares traded
📌 Why this matters:
- The "Adj Close" column accounts for dividends/splits, making it crucial for accurate return calculations.
Here's how I'd write this step in GitHub README.md format while keeping the tone like yours:
Before diving into analysis, I cleaned and prepped the data to make it ready for modeling. Here’s what I did:
Since I'm working with time series data, it's crucial to ensure that dates are in the proper format.
sp500_index['Date'] = pd.to_datetime(sp500_index['Date'])
sp500_stocks['Date'] = pd.to_datetime(sp500_stocks['Date'])
Sorting helps in sequential analysis and forecasting.
sp500_index = sp500_index.sort_values('Date')
sp500_stocks = sp500_stocks.sort_values('Date')
Missing values can mess up ML models, so I filled them with zero (0) for now.
sp500_companies.fillna(0, inplace=True)
sp500_stocks.fillna(0, inplace=True)
Setting the 'Date' column as an index makes it easier to visualize and apply time series models.
sp500_index.set_index('Date', inplace=True)
✅ Now the data is clean and ready for exploration!
To understand the historical performance of the S&P 500 Index, I plotted the trend over time.
I also analyzed Apple's (AAPL) stock price movements over time.
Below are the Python code snippets used for visualization:
# S&P 500 Index Trend
plt.figure(figsize=(12, 6))
plt.plot(sp500_index.index, sp500_index['S&P500'], label="S&P 500 Index")
plt.title("S&P 500 Index Trend")
plt.xlabel('Year')
plt.ylabel('Index Value')
plt.legend()
plt.show()
# AAPL Stock Price Trend
aapl_stock = sp500_stocks[sp500_stocks['Symbol'] == 'AAPL']
plt.figure(figsize=(12, 6))
plt.plot(aapl_stock['Date'], aapl_stock['Close'], label='AAPL Close Price')
plt.title('AAPL Stock Price Trend')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.show()
To predict future trends in the S&P 500 Index, I used the Auto ARIMA model, which automatically selects the best ARIMA parameters based on AIC (Akaike Information Criterion).
Before applying ARIMA, we checked if the time series is stationary using the Augmented Dickey-Fuller (ADF) Test:
result = sm.tsa.adfuller(sp500_index['S&P500'])
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
📊 Results:
ADF Statistic: 0.4040 p-value: 0.9816 (Greater than 0.05 → The time series is non-stationary) Since the p-value is high, we apply differencing to make the series stationary.
⚙️ Auto ARIMA Model The Auto ARIMA function automatically finds the best parameters:
sp500_diff = sp500_index['S&P500'].diff().dropna()
arima_model = auto_arima(sp500_diff, seasonal=False, trace=True, stepwise=True)
📌 Best ARIMA Model Found: ARIMA(2,0,2) with intercept This model was selected based on the lowest AIC score (25188.691).
n_periods = 30
forecast, conf_int = arima_model.predict(n_periods=n_periods, return_conf_int=True)
forecast_dates = pd.date_range(sp500_index.index[-1], periods=n_periods, freq='B')
plt.figure(figsize=(10,6))
plt.plot(sp500_index.index, sp500_index['S&P500'], label='S&P 500')
plt.plot(forecast_dates, forecast.cumsum() + sp500_index['S&P500'].iloc[-1],
label='Forecast', color='red')
plt.fill_between(forecast_dates, conf_int[:, 0].cumsum() + sp500_index['S&P500'].iloc[-1],
conf_int[:, 1].cumsum() + sp500_index['S&P500'].iloc[-1], color='red', alpha=0.3)
plt.title('S&P 500 Forecast')
plt.legend()
plt.show()
📊 Visualization: S&P 500 Forecast Below is the forecasted trend for the next 30 business days:
🔍 Insights:
The forecast shows a projected upward/downward trend for the S&P 500. The red shaded region represents the confidence interval, showing possible fluctuations.