From 7c83e237bf05d2190f0b0c800a7450e61313dbff Mon Sep 17 00:00:00 2001 From: Kevin Tyle Date: Mon, 20 Nov 2023 22:26:37 +0000 Subject: [PATCH 1/2] Add pandas_json notebook --- _toc.yml | 1 + core/pandas/pandas_json.ipynb | 390 ++++++++++++++++++++++++++++++++++ 2 files changed, 391 insertions(+) create mode 100644 core/pandas/pandas_json.ipynb diff --git a/_toc.yml b/_toc.yml index 267b3d7b4..b10c45db8 100644 --- a/_toc.yml +++ b/_toc.yml @@ -56,6 +56,7 @@ parts: - file: core/pandas sections: - file: core/pandas/pandas + - file: core/pandas/pandas_json - file: core/data-formats sections: - file: core/data-formats/netcdf-cf diff --git a/core/pandas/pandas_json.ipynb b/core/pandas/pandas_json.ipynb new file mode 100644 index 000000000..4ffc8bff6 --- /dev/null +++ b/core/pandas/pandas_json.ipynb @@ -0,0 +1,390 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e124235-3846-4fcb-b533-10fa5856b4b4", + "metadata": {}, + "source": [ + "\n", + "# Pandas: Working with a JSON file" + ] + }, + { + "cell_type": "markdown", + "id": "6e75bb80-da84-47a9-ae2d-210eb06d492e", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "9d226e98-85a1-4197-a757-448bd2bf4563", + "metadata": {}, + "source": [ + "## Overview\n", + "In this notebook, we will create a [Pandas Dataframe](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe) from a remotely-served [JSON](https://www.json.org/) file. This particular file contains forecasted [solar wind](https://www.swpc.noaa.gov/phenomena/solar-wind) parameters from NOAA's [Space Weather Prediction Center](https://www.swpc.noaa.gov).\n", + "\n", + "1. Read in a JSON file\n", + "1. Reformat the `Dataframe`\n", + "1. Visualize the dataset" + ] + }, + { + "cell_type": "markdown", + "id": "daeabc7d-d4f6-4f9f-aad7-679123b9d2fb", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "| Concepts | Importance | Notes |\n", + "| --- | --- | --- |\n", + "| [Pandas](https://foundations.projectpythia.org/core/pandas/pandas.html) | Necessary | |\n", + "\n", + "- **Time to learn**: 10 minutes\n" + ] + }, + { + "cell_type": "markdown", + "id": "7b875611-44ef-4b1f-9453-f2fa84bb4d82", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "c732c9d1-0e00-4d9b-8e29-73ce71c99499", + "metadata": {}, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d48491f-4332-4eff-af72-0382c6c5794a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "690e69eb-5ec0-458d-962f-a25cdf8983dc", + "metadata": {}, + "source": [ + "## Read in a JSON file" + ] + }, + { + "cell_type": "markdown", + "id": "630ace3c-7e81-45a8-a884-a045d7afbae6", + "metadata": {}, + "source": [ + "### NOAA's SWPC has a variety of forecast output in JSON format. Here, we create a `Dataframe` Pandas' `read_json` method from the current 1-day plasma forecast." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3be45f7e-8f1b-4645-8eff-6d7d0c6976a4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df = pd.read_json(\"https://services.swpc.noaa.gov/products/solar-wind/plasma-1-day.json\")" + ] + }, + { + "cell_type": "markdown", + "id": "2128b21f-7d34-45f6-b08f-23287a7761ea", + "metadata": {}, + "source": [ + "Examine the `Dataframe`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deea1ad0-78ef-4b47-8ed4-951b2f8e2a5b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "1b34e2ce-19a9-46b6-a346-da9cc814a8a6", + "metadata": {}, + "source": [ + "## Reformat the `Dataframe`" + ] + }, + { + "cell_type": "markdown", + "id": "61ff9856-2a64-4606-8c13-4cc1bbe6f384", + "metadata": {}, + "source": [ + "Notice that the column headers look to be in the `Dataframe`'s first row. Let's modify it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d198bce-c48e-4c77-9785-ab26bfaac669", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Set the columns to be the values of the first row. Then drop that first row.\n", + "df = df.rename(columns=df.iloc[0]).drop(df.index[0])" + ] + }, + { + "cell_type": "markdown", + "id": "e3c8f341-a5f6-46f0-ac1d-2d1feeb4631d", + "metadata": {}, + "source": [ + "Examine the reformatted `Dataframe`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a457cb3-6203-45d5-9467-c8c5ffc07e52", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "43bfed63-0fa2-4d3a-8a0b-162386dace6f", + "metadata": {}, + "source": [ + "### Set the `Dataframe`'s index to the timestamped column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e1f2933-6193-4f2d-b012-7c5f18a3999b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df.index" + ] + }, + { + "cell_type": "markdown", + "id": "4b3ade9e-eebd-49d0-a321-624939d42d7d", + "metadata": {}, + "source": [ + "Currently, the `Dataframe` has a *default index* (i.e., a range of integers). For time series data (i.e., time is the independent variable), it is [good practice](https://pandas.pydata.org/docs/user_guide/timeseries.html) to use a time-based column as the index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83cdc00a-2a38-4ef7-ac33-c2f6875c2dc3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df = df.set_index('time_tag')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9abc6605-9676-48ef-a9c5-57ef13b21eed", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "7d4e6843-638b-4801-af35-9ddbc191378a", + "metadata": {}, + "source": [ + "### Check and edit the `dtypes` of the independent and dependent variables" + ] + }, + { + "cell_type": "markdown", + "id": "7fc15d05-b3cb-4575-a4c5-c25123613bf7", + "metadata": {}, + "source": [ + "In this case, the `Dataframe`'s index corresponds to the independent variable, and the columns correspond to the dependent variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8341c2d-7d4b-4960-93e6-b90670aab9a3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df.index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e165dc3-1a51-406d-9b31-b92a16721a1d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "id": "90c69195-8111-4c0f-bb6f-17aa036b7147", + "metadata": { + "tags": [] + }, + "source": [ + "They are all `object`s ... and as a result won't be amenable to typical time-series visualization methods. Change them to more appropriate `dtype`s ... `float32` for the dependent variables, and `datetime64` for the time-based one." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32beea01-3390-4eb6-9d62-090ad70ba2a9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "for col in df.columns:\n", + " df[col] = df[col].astype(\"float32\")\n", + "df.index = pd.to_datetime(df.index)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eba917fd-6167-4726-ae0f-79c064d4ef72", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df.index = pd.to_datetime(df.index)" + ] + }, + { + "cell_type": "markdown", + "id": "38611e12-3f66-4e39-ac83-2021df2bf63e", + "metadata": {}, + "source": [ + "## Visualize the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "634749f1-f613-4e90-aff8-6efdd038d251", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df.temperature.plot(figsize=(10, 8));" + ] + }, + { + "cell_type": "markdown", + "id": "5d085493-0e98-4e6f-a190-a948bd44f53a", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "5a43f367-1bdf-46b6-b253-f40ff965ff02", + "metadata": {}, + "source": [ + "## Summary\n", + "Pandas has several reader functions for differently-formatted tabular datasets. In this notebook, we created a `Dataframe` via Pandas `read_json` function, and then manipulated the `Dataframe` to allow for a useful time-series visualization." + ] + }, + { + "cell_type": "markdown", + "id": "419af5fe-4cfc-466a-a4b0-1df66ddae8f8", + "metadata": {}, + "source": [ + "
\n", + "

Note:

\n", + " There is no strict format specification for JSON files. The strategy we followed to create and reformat the Dataframe in this notebook will likely need to change for other JSON-formatted datasets you may encounter!\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "62c4be7e-84a8-497d-a799-a74e5201c567", + "metadata": {}, + "source": [ + "### What's next?\n", + "Future [Project Pythia Foundations](https://foundations.projectpythia.org) Pandas notebooks will explore additional file format-specific reader methods." + ] + }, + { + "cell_type": "markdown", + "id": "dd8c3c60-d703-4b22-b796-8c385130cfa2", + "metadata": {}, + "source": [ + "## Resources and references\n", + "1. [pandas](https://pandas.pydata.org)\n", + "1. [JSON](https://json.io)\n", + "1. [NOAA SWPC](https://www.swpc.noaa.gov)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From b65601aa8a19104994331d93e6917094c475c2a6 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 20 Nov 2023 22:28:15 +0000 Subject: [PATCH 2/2] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- core/pandas/pandas_json.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/core/pandas/pandas_json.ipynb b/core/pandas/pandas_json.ipynb index 4ffc8bff6..2d84c774e 100644 --- a/core/pandas/pandas_json.ipynb +++ b/core/pandas/pandas_json.ipynb @@ -97,7 +97,9 @@ }, "outputs": [], "source": [ - "df = pd.read_json(\"https://services.swpc.noaa.gov/products/solar-wind/plasma-1-day.json\")" + "df = pd.read_json(\n", + " \"https://services.swpc.noaa.gov/products/solar-wind/plasma-1-day.json\"\n", + ")" ] }, {