diff --git a/docs/notebooks/time_varying.ipynb b/docs/notebooks/time_varying.ipynb index 2550a87..70fe456 100644 --- a/docs/notebooks/time_varying.ipynb +++ b/docs/notebooks/time_varying.ipynb @@ -118,7 +118,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Simulating a dataset\n", + "## Simulating a dataset: first approach to test dimensions but doesn't guarantee meaningful results\n", "\n", "We will simulate a dataset of 100 subjects with 10 follow up times where a covariate is observed. The covariates will follow a trigonometric function over time and will be dependant on a random variable to differentiate between subjects.\n", "\n", @@ -126,7 +126,109 @@ "\n", "$$ Z_i(t) = a_i \\cos(2 \\pi t) $$\n", "\n", - "where $a_i \\sim N(5, 2.5)$." + "where $a_i \\sim N(5, 2.5)$.\n", + "\n", + "## Proper simulation guidance: data that can be interpreted\n", + "\n", + "A good approach for simulating data is described in detail by [Ngwa et al 2020](https://pmc.ncbi.nlm.nih.gov/articles/PMC7731987/). If this is not yet implemented, it would be a good way of starting to ensure that both methods work as expected. There are tow parts in simulating such a dataset. First, simulating the longitudina lobservational data and then the survival data. Below we describe methodologies for both.\n", + "\n", + "### Longitudinal data (covariates)\n", + "\n", + "We use $i \\in \\{1, \\dots, n\\}$ to index subjects and $j \\in \\{1, \\dots, m_i\\}$ to index time points where $m_i$ is the final time point for subject $i$.\n", + "We simulate covariates independantly:\n", + "- age at baseline $Age_i \\sim N(35,5)$\n", + "- sex $\\sim Bernoulli(p=0.54)$\n", + "\n", + "Generate expected longitudinal trajectories $\\varphi_{\\beta}(t_{ij})$:\n", + "\n", + "$$ \\varphi_{\\beta}(t_{ij}) = b_{i1} + b_{i2} \\cdot t_{ij} + \\alpha Age_i, $$\n", + "\n", + "where $b_{i1}, b_{i2}$ are random effects\n", + "\n", + "We will generate $b_{i1}, b_{i2}$ from multivariate normal distribution with a covariance matrix $G = [[0.29, -0.00465],[-0.00465, 0.000320]]$. Sample from this multivariate normal distribution (with mean zero) to get the random intercept and slope.\n", + "\n", + "The observed longitudinal measures measures $Y_{ij}(t_{ij})$ from a multivariate normal distribution with mean $ \\varphi_{\\beta}(t_{ij})$ and variance $V$:\n", + "\n", + "$$ V = Z_i GZ_i ^T + R_i, \\text{ where }Z_i = [[1,1,1,1,1,1]^T, [0,5,10,15,20,25]^T]$$\n", + "\n", + "and $R_i = diag(\\sigma^2)$ and $\\sigma^2$ is set to $0.1161$." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "tensor([[33.7853, 34.1568, 33.7249, 33.9724, 34.4417, 34.4528],\n", + " [33.1087, 33.4781, 32.5054, 33.1090, 32.9212, 33.4908],\n", + " [31.8224, 31.8031, 32.1202, 32.3814, 31.4848, 31.9074],\n", + " [36.1902, 35.9910, 36.4153, 36.2511, 35.8788, 36.4300]])\n" + ] + } + ], + "source": [ + "import torch.distributions as dist\n", + "\n", + "# Set random seed for reproducibility\n", + "torch.manual_seed(123)\n", + "\n", + "n = 100 # Number of subjects\n", + "T = 6 # Number of time points\n", + "\n", + "# Simulation parameters\n", + "age_mean = 35\n", + "age_std = 5\n", + "sex_prob = 0.54\n", + "G = torch.tensor([[0.29, -0.00465],[-0.00465, 0.000320]])\n", + "Z = torch.tensor([[1, 1, 1, 1, 1, 1], [0, 5, 10, 15, 20, 25]], dtype=torch.float32).T\n", + "sigma = torch.tensor([0.1161])\n", + "alpha = 1\n", + "\n", + "# Simulate age at baseline\n", + "age_dist = dist.Normal(age_mean, age_std)\n", + "age = age_dist.sample((n,))\n", + "\n", + "# Simulate sex\n", + "sex_dist = dist.Bernoulli(probs=sex_prob)\n", + "sex = sex_dist.sample((n,))\n", + "\n", + "# Simulate random effects\n", + "random_effects_dist = dist.MultivariateNormal(torch.zeros(2), G)\n", + "random_effects = random_effects_dist.sample((n,))\n", + "\n", + "# Generate expected longitudinal trajectories\n", + "# quite frakly this is useless now - it was based on my bad understanding of the algorithm\n", + "trajectories = random_effects[:, 0].unsqueeze(1) + random_effects[:, 1].unsqueeze(1) * Z[:,1] + alpha * age.unsqueeze(1)\n", + "\n", + "# Simulate observed longitudinal measures\n", + "R = torch.diag_embed(sigma.repeat(T))\n", + "V = torch.matmul(torch.matmul(Z, G), Z.T) + R\n", + "\n", + "#get a mean trajectory\n", + "b1 = torch.tensor([4.250])\n", + "b2 = torch.tensor([0.250])\n", + "mean_trajectory = b1.item() + b2.item() * Z[:,1] + alpha * age_mean\n", + "\n", + "#define the distribution to sample the trajectories from\n", + "observed_data_dist = dist.MultivariateNormal(trajectories, V)\n", + "\n", + "#sample from the distribution to get an n x T matrix of observations/covariates\n", + "observed_data = observed_data_dist.sample((1,)).squeeze()\n", + "\n", + "print(observed_data[1:5, :])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Survival data (outcomes)\n", + "\n", + "here I will describe how to get the survival and censoring for all the subjects from above. then I will code it up in python." ] }, { @@ -160,6 +262,13 @@ "covars = matrix * random_vars[:, None]\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {},