ADD FIMO/TOMTOM updates

jmschrei · Sep 6, 2024 · 656f699 · 656f699
1 parent 61b6407
commit 656f699
Show file tree

Hide file tree

Showing 12 changed files with 2,292 additions and 1,235 deletions.
diff --git a/cmd/tangermeme b/cmd/tangermeme
@@ -2,8 +2,14 @@
 # tangermeme command-line toolkits
 # Author: Jacob Schreiber <jmschreiber91@gmail.com>
 
+import sys
+import pathlib
+sys.path.insert(0, str(pathlib.Path(__file__).resolve().parent))
+
 import argparse
+
 from _tomtom import _run_tomtom
+from _fimo import _run_fimo
 
 desc = """tangermeme is a package for genomic sequence-based machine learning.
 	   This command-line tool contains many methods that are useful for
@@ -59,8 +65,25 @@ tomtom_parser.add_argument("-p", "--thresh", type=float, default=0.01,
 ###
 
 
-#
+fimo_help = """Scan a set of motifs from a MEME file against one or more
+	sequences in a FASTA file."""
 
+fimo_parser = subparsers.add_parser("fimo", help=fimo_help,
+	formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+fimo_parser.add_argument("-m", "--motif", type=str, required=True,
+	help="""The filename of a MEME-formatted file containing motifs.""")
+fimo_parser.add_argument("-s", "--sequence", type=str,
+	help="""The filename of a FASTA-formatted file containing the sequences to
+	scan against.""")
+fimo_parser.add_argument("-w", "--bin_size", type=float, default=0.1,
+	help="""The width of bins to use when discretizing scores.""")
+fimo_parser.add_argument("-e", "--epsilon", type=float, default=0.0001,
+	help="""A pseudocount to add to each PWM.""")
+fimo_parser.add_argument("-p", "--threshold", type=float, default=0.0001,
+	help="""The p-value threshold for returning matches.""")
+fimo_parser.add_argument("-r", "--norc", action='store_true', default=False,
+	help="""Whether to only do the positive strand.""")
 
 
 ##############

diff --git a/docs/tutorials/Tutorial_D1_FIMO.ipynb b/docs/tutorials/Tutorial_D1_FIMO.ipynb
diff --git a/docs/tutorials/Tutorial_D2_TOMTOM.ipynb b/docs/tutorials/Tutorial_D2_TOMTOM.ipynb
@@ -0,0 +1,337 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8ef3314b-3f86-4532-ac9a-3a8d50a49f8d",
+   "metadata": {},
+   "source": [
+    "### Tutorial D2 (Built-in Tools): TOMTOM\n",
+    "\n",
+    "`FIMO` is a tool for taking a set of PWMs and scanning them against a set of one-hot encoded sequences; in contrast, `TOMTOM` is a tool for comparing pairs of PWMs to determine their similarity. Classically, tools like `TOMTOM` are used to identify redundancies in motif databases and to estimate similarities between binding profiles of proteins. However, `TOMTOM` is also frequently used by researchers who see motifs that are highlighted by some method (such as attributions from machine learning models) and need to know what that motif corresponds to. In this manner, `TOMTOM` serves as a useful automatic annotation tool for discoveries made using `tangermeme`.\n",
+    "\n",
+    "Although at first glance `FIMO` and `TOMTOM` appear similar because they involve comparing motifs with something to identify statistical similarities, the difference between operating on long one-hot encoded sequences and a second motif require large methodological changes. First, calculating a background distribution for `FIMO` can be done exactly because you know that the sequences being scanned are limited to being one-hot encoded, whereas the background distribution for `TOMTOM` must be empirically calculated from the set of provided motifs. Second, because motifs are much shorter than the sequences usually being scanned by `FIMO`, the best alignment between two motifs likely involve some amount of overhang on either end of the alignment. Calculating scores and p-values in a manner that don't automatically undervalue imperfect alignments is non-trivial. \n",
+    "\n",
+    "For more information about `TOMTOM`, I'd suggest reading [the original paper](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-2-r24) and the [subsequent follow-up work](https://pubmed.ncbi.nlm.nih.gov/21543443/) that improves the calculation of p-values for alignments with overhangs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7bf7195e-914c-4e2d-82dc-894b502eac04",
+   "metadata": {},
+   "source": [
+    "#### Using TOMTOM\n",
+    "\n",
+    "One can run the `TOMTOM` algorithm by calling the `tomtom` function on a pair of lists, where both lists are PWMs that are each of shape `(len(alphabet), motif_length)`. The background distribution is built primarily using the targets, so make sure that the first list is the queries you want to evaluate and the second list is the target distribution you want to match against. For example, if you want to score a single one-hot encoded sequence, you can do the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "29dff7e7-ca13-4967-bf5c-2e9385a73571",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(1, 1646)"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import numpy\n",
+    "\n",
+    "from tangermeme.io import read_meme\n",
+    "from tangermeme.utils import one_hot_encode\n",
+    "\n",
+    "from tangermeme.tools.tomtom import tomtom\n",
+    "\n",
+    "q = [one_hot_encode(\"ACGTGT\").double()]\n",
+    "\n",
+    "targets = read_meme(\"JASPAR2020_CORE_non-redundant_pfms_meme.txt\")\n",
+    "target_names = numpy.array([name for name in targets.keys()])\n",
+    "target_pwms = [pwm for pwm in targets.values()]\n",
+    "\n",
+    "\n",
+    "p, scores, offsets, overlaps, strands = tomtom(q, target_pwms)\n",
+    "p.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed6a712d-71fd-46b6-8f18-957b98b67aac",
+   "metadata": {},
+   "source": [
+    "You will get several properties of the best alignment between the provided queries and each of the targets. Here, because there was only one query, the first dimension of each returned tensor only has one elements.\n",
+    "\n",
+    "If we want to get the names of the statistically significant matches we can easily cross-reference the p-values with the names in the target database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "2a66616d-457c-4d25-8eb0-f831309aa415",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array(['MA0310.1 HAC1', 'MA0622.1 Mlxip', 'MA0930.1 ABF3',\n",
+       "       'MA0931.1 ABI5', 'MA1033.1 OJ1058_F05.8', 'MA1331.1 BEH4',\n",
+       "       'MA1332.1 BEH2', 'MA1333.1 BEH3', 'MA1493.1 HES6'], dtype='<U28')"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "target_names[p[0] < 0.0001]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e685714d-e25c-4aba-8415-92b652a710e8",
+   "metadata": {},
+   "source": [
+    "The returned scores are the integerized score of the best alignment, the offets indicate where the best alignment starts for the query, the overlaps are the number of positions that overlap between the target and the query, and the strand is whether the best alignment is on the forward (0) strand or the negative (1) strand.\n",
+    "\n",
+    "Sometimes, you will be comparing one set of PWMs against a second set of PWMs though, rather than a single value versus a set of PWMs. Doing so is pretty much the same as before, except that the list of query motifs has more than one value in it. We can demonstrate that by comparing the test set of motifs used in the unit tests against the full set of motifs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "eed69064-ebf5-4786-89fe-c574a224ddc9",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(12, 1646)"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "query = list(read_meme(\"../../tests/data/test.meme\").values())\n",
+    "\n",
+    "p, scores, offsets, overlaps, strands = tomtom(query, target_pwms)\n",
+    "p.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a7d21f2-8cf7-4011-9b34-a65ef565606d",
+   "metadata": {},
+   "source": [
+    "The set of motifs used in the unit tests are a few randomly selected motifs from the JASPAR set, so the p-values should indicate perfect matches."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "e26732ac-31c3-411b-a785-611d45cae92e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([-3.17523785e-13, -5.60440583e-13,  5.38370792e-09, -9.46798195e-13,\n",
+       "       -2.70006240e-13,  5.71406977e-07,  4.77395901e-14, -2.93542968e-13,\n",
+       "       -2.18491891e-13,  2.02698924e-09, -1.84918747e-12,  1.90786831e-09])"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "p.min(axis=-1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63206c02-017d-4643-8e15-eff8770c843e",
+   "metadata": {},
+   "source": [
+    "#### Timings\n",
+    "\n",
+    "When one uses the MEME suite webserver to run TOMTOM, they may notice that it takes several seconds to run a single query against a target database. Although the command-line tool is faster (as jobs do not need to be submitted, queued, and then wait for available resources), one may still wonder if it is possible to do large-scale comparisons with it. \n",
+    "\n",
+    "The implementation of `TOMTOM` in `tangermeme` scales significantly better than the command-line tool (though you might notice it takes a few seconds for numba to load initially due to some issues there). There are several reasons for this but perhaps the most notable are built in multithreading through numba and caching null distributions for each target length, rather than having to recalculate them for each query-target pair. These efficiencies significantly speed up the implementation in `tangerememe` without compromising on the exactness. However, importantly, other speed improvements (approximate medians, binning similar targets) further speed up the algorithm but do slightly alter the results. In practice, there is very little observed difference, but this is why you may not see identical results.\n",
+    "\n",
+    "Let's do a quick timing test comparing the 1646 motifs in the JASPAR set against themselves. This involves doing 1646x1646x2 comparisons, with the x2 being because we are also doing reverse complement comparisons."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "80e8d169-bf4f-43ba-9a61-57776bd7ddbe",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.22 s ± 71.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tomtom(target_pwms, target_pwms)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "842a6a7a-47e2-4722-889d-4b06473bebfb",
+   "metadata": {},
+   "source": [
+    "Just one second to calculate the complete similarity matrix -- including allocating the intermediary memory and spinning up the threadpool.\n",
+    "\n",
+    "How long would it take to do a much larger calculation? We can estimate that by copying the array 10 times in both directions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "4228e04b-03ff-4233-b5e0-1c5be2b61397",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "10.1 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tomtom(target_pwms*10, target_pwms*10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "04dfd340-79ca-44f6-ae2f-d7b5fb6b837d",
+   "metadata": {},
+   "source": [
+    "Looks like it only takes ~10s, which is 10x faster than you might expect, since we expected the calculation to be ~100x slower. This performance is mostly explained by the target column binning feature, where repeated columns (or those whose probability vectors are close, but not identical) are merged and calculation does not happen for them. Since each column in the target set now has 10 exact copies, it may not be surprising to see that the target axis does not matter as much for the calculations. However, this is an important point because as target databases get larger, *they get more redundant*. So, while we may not see speed improvements as large as we saw here, we will still see time only increase *sublinearly* with the number of targets."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27b5acb4-400e-42fd-b448-06e05e1812da",
+   "metadata": {},
+   "source": [
+    "#### Nearest Neighbors\n",
+    "\n",
+    "In some applications, you may not want the entire dense similarity matrix, either because calculating the full thing would take too much memory or just generally not be valuable for downstream applications. In these cases, you can keep only the top `n` nearest neighbors (according to the p-value). Doing so requires only setting the `n_nearest` parameter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "398c45b7-6540-48c7-9446-f75ece60d1a7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(1646, 100)"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "p, _, _, _, _, idxs = tomtom(target_pwms, target_pwms, n_nearest=100)\n",
+    "p.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae488488-0335-489c-ab77-b0e87427b1a6",
+   "metadata": {},
+   "source": [
+    "The return signature should be the same except that an additional matrix of indexes is returned. These indexes are the target indexes for the top `n` returned values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "05d713c5-5226-4f1a-924d-756ac0b4ef2f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([   0.,  188.,  481.,  900., 1252.,  527.,  314.,  684.,  493.,\n",
+       "        805.])"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "idxs[0, :10]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "334944c6-d136-4fbd-b192-6ef3aad23b94",
+   "metadata": {},
+   "source": [
+    "As you might expect, when calculating the similarity between a set of motifs and itself, the motif itself will be the highest similarity match (hence the 0 as the first index)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "302e4bfc-5a0c-4d80-9626-923dde9c2a9c",
+   "metadata": {},
+   "source": [
+    "#### Approximation Parameters\n",
+    "\n",
+    "The implementation in `tangermeme` uses a few appproximation tricks to speed up the algorithm. In general, these tricks involve bins, where the more bins the better the approximation but the more time they take. The default values have been tried in a few situations involving PWMs and seem to work reasonably well, but if you have a non-standard setting you may need to change them. They are:\n",
+    "\n",
+    "`n_score_bins`: This is the parameter `t` from the TOMTOM paper. The 100 number is hardcoded into the TOMTOM command-line algorithm, but can be changed. However, in addition to setting a higher value taking more time, you will need more data to support each of the bins. Only set to a higher value when you have enough target columns to support them.\n",
+    "\n",
+    "`n_median_bins`: When calculating medians, `tangermeme` avoids sorting and implements a linear-time approximation by binning the range, counting the instances in each bin, and checking what bin corresponds to a cumulative half of all examples. The average value of the examples in the median bin (rather than the middle of the bin) is returned, for higher accuracy.\n",
+    "\n",
+    "`n_target_bins`: Each value in the target column (each of the four nucleotides in standard cases) are binned into this number of bins from the minimum observed value to the maximum observed value. Target columns that are identical across each value are merged to speed up the inner loop."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}