Distancia is a comprehensive Python package that provides a wide range of distance metrics and similarity measures, making it easy to calculate and compare the proximity between various types of data. This documentation provides an in-depth guide to the package, including installation instructions, usage examples, and detailed descriptions of each available metric.
The documentation is divided into the following sections:
Note
The code examples provided in this documentation are written for Python 3.x. The python code in this package has been optimized by static typing with Cython
Distancia is designed to be simple and intuitive, yet powerful and flexible. Whether you are working with numerical data, strings, or other types of data, Distancia provides the tools you need to measure the distance or similarity between objects.
For a quick introduction, check out the quickstart guide. If you want to dive straight into the code, head over to the Euclidean page.
Note
If you find any issues or have suggestions for improvements, feel free to contribute!
You can install the distancia package with pip:
pip install distancia
By default, this will install the core functionality of the package, suitable for users who only need basic distance metrics.
Optional Dependencies The Distancia package also supports optional modules to enable additional features. You can install these extras depending on your needs:
With pandas support: Install with additional support for working with tabular data:
pip install distancia[pandas]
With all supported extras: Install all optional dependencies for maximum functionality:
pip install distancia[all]
This modular installation allows you to keep your setup lightweight or include everything for full capabilities.
Here are some common examples of how to use Distancia:
from distancia import Euclidean
point1 = [1, 2, 3]
point2 = [4, 5, 6]
# Create an instance of Euclidean
euclidean = Euclidean()
# Calculate the Euclidean distance
distance = euclidean.compute(point1, point2)
print(f"Euclidean Distance: {distance:4f}")
>>>Euclidean Distance: 5.196
from distancia import Levenshtein
string1 = "kitten"
string2 = "sitting"
distance = Levenshtein().compute(string1, string2)
print(f"Levenshtein Distance: {distance:4f}")
>>>Levenshtein Distance: 3
For a complete list and detailed explanations of each metric, see the next section.
Distance measures between vectors are essential in machine learning, classification, and information retrieval. Here are five of the most commonly used:
-
The Euclidean distance is the square root of the sum of the squared differences between the coordinates of two vectors. It is ideal for measuring similarity in geometric spaces.
- Manhattan Distance Also known as L1 distance, it is defined as the sum of the absolute differences between the coordinates of the vectors. It is well-suited for discrete spaces and grid-based environments.
- Cosine Distance It measures the angle between two vectors rather than their absolute distance. Commonly used in natural language processing and information retrieval (e.g., search engines).
- Jaccard Distance Based on the ratio of the intersection to the union of sets, it is effective for comparing sets of words, tags, or recommended items.
- Hamming Distance It counts the number of differing positions between two character or binary sequences. It is widely used in error detection and bioinformatics.
Note
These distance measures are widely used in various algorithms, including clustering, supervised classification, and search engines.
Distance measures between matrices are widely used in machine learning, image processing, and numerical analysis. Below are five of the most commonly used:
- Frobenius Norm The Frobenius norm is the square root of the sum of the squared elements of the difference between two matrices. It generalizes the Euclidean distance to matrices and is commonly used in optimization problems.
- Spectral Norm Defined as the largest singular value of the difference between two matrices, the spectral norm is useful for analyzing stability in numerical methods.
- Trace Norm (Nuclear Norm) This norm is the sum of the singular values of the difference between matrices. It is often used in low-rank approximation and compressed sensing.
- Mahalanobis Distance A statistical distance measure that considers correlations between features, making it effective in multivariate anomaly detection and classification.
- Wasserstein Distance (Earth Mover’s Distance) This metric quantifies the optimal transport cost between two probability distributions, making it highly relevant in image processing and deep learning.
Note
These distance measures are widely applied in fields such as computer vision, data clustering, and signal processing.
Distance measures between texts are crucial in natural language processing (NLP), search engines, and text similarity tasks. Below are five of the most commonly used:
- Levenshtein Distance (Edit Distance) The minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. Used in spell checkers and DNA sequence analysis.
- Jaccard Similarity Measures the overlap between two sets of words or character n-grams, computed as the ratio of their intersection to their union. Useful in document comparison and keyword matching.
- Cosine Similarity Computes the cosine of the angle between two text vectors, often based on TF-IDF or word embeddings. Commonly used in search engines and document ranking.
- Damerau-Levenshtein Distance An extension of Levenshtein distance that also considers transpositions (swapping adjacent characters). More robust for typographical error detection.
- BLEU Score (Bilingual Evaluation Understudy) Measures the similarity between a candidate text and reference texts using n-gram precision. Widely used in machine translation and text summarization.
Note
These text distance measures are extensively used in chatbots, plagiarism detection, and semantic search applications.
Distance measures between time series are essential in forecasting, anomaly detection, and clustering of temporal data. Below are five of the most commonly used:
- Dynamic Time Warping (DTW) Computes the optimal alignment between two time series by allowing non-linear warping along the time axis. Widely used in speech recognition and gesture classification.
- Euclidean Distance The sum of squared differences between corresponding points in two time series of equal length. Simple but sensitive to time shifts and distortions.
- Pearson Correlation Distance Measures how similar the shapes of two time series are by computing 1 - Pearson correlation coefficient. Useful in financial time series and sensor data analysis.
- Frechet Distance Considers both the location and order of points, making it more robust than Euclidean distance for trajectory analysis and movement comparison.
- Longest Common Subsequence (LCSS) Identifies the longest matching subsequence between two time series while allowing gaps. Effective for pattern recognition in noisy or incomplete data.
Note
These distance measures are widely used in time series classification, similarity search, and predictive analytics.
Loss functions are widely used in machine learning, deep learning, and optimization to quantify the difference between predicted and actual values. Below are five of the most commonly used:
- Mean Squared Error (MSE) Computes the average squared difference between predicted and actual values. Sensitive to large errors, making it effective for regression tasks where large deviations need penalization.
- Mean Absolute Error (MAE) Calculates the average of absolute differences between predicted and actual values. Unlike MSE, it treats all errors equally and is more robust to outliers.
- Huber Loss Combines MSE and MAE by using a quadratic loss for small errors and a linear loss for large errors. Used in robust regression to handle outliers.
- Kullback-Leibler (KL) Divergence Measures the difference between two probability distributions. Essential in variational inference, deep learning, and information theory.
- Cross-Entropy Loss Used in classification tasks, it quantifies the difference between two probability distributions, typically between true labels and predicted probabilities. Crucial in neural networks and logistic regression.
Note
These loss functions are fundamental in supervised learning, deep neural networks, and statistical modeling.
Distance measures between graphs are crucial in network analysis, bioinformatics, computer vision, and graph-based machine learning. Below are five of the most commonly used:
- Graph Edit Distance (GED) Computes the minimum number of edit operations (node/edge insertions, deletions, or substitutions) required to transform one graph into another. Used in pattern recognition and structural comparison.
- Wasserstein Distance (Gromov-Wasserstein) Measures the optimal transport cost between two graph structures by aligning their node distributions. Widely applied in graph matching and deep learning on graphs.
- Spectral Distance Compares the eigenvalues of graph Laplacians or adjacency matrices to quantify structural differences. Effective for comparing network topology and community structures.
- Jaccard Graph Similarity Computes the ratio of common edges to total edges between two graphs. Useful in social network analysis and recommendation systems.
- Random Walk Betweenness Measures centrality based on random walk processes.
Note
These distance measures are widely used in graph classification, anomaly detection, and network embedding.
Distance measures between Markov chains are essential in stochastic processes, reinforcement learning, and model comparison. Below are five of the most commonly used:
- Kullback-Leibler (KL) Divergence Measures how one probability distribution differs from another. In Markov chains, it quantifies the difference between stationary distributions. Used in model selection and statistical inference.
- Total Variation Distance Computes the largest possible difference between the probabilities assigned by two Markov chains. It is useful in bounding convergence rates and stability analysis.
- Wasserstein Distance Also known as the Earth Mover’s Distance, it measures the minimal cost of transforming one stationary distribution into another. Applied in optimal transport and generative modeling.
- Jensen Shannon Divergence A symmetrized and smoothed version of KL divergence, often used to compare Markov processes. Frequently applied in text clustering and reinforcement learning.
- Hellinger Distance Measures the similarity between two probability distributions, particularly useful when comparing transition matrices or steady-state distributions.
Note
These distance measures are widely used in hidden Markov models (HMMs), reinforcement learning, and stochastic modeling.
Distance measures between images are crucial in computer vision, image retrieval, and deep learning. Below are five of the most commonly used:
- Mean Squared Error (MSE) Computes the average squared difference between corresponding pixel values of two images. Simple but sensitive to intensity variations and noise.
- Structural Similarity Index (SSIM) Measures the perceptual similarity between two images by considering luminance, contrast, and structure. Widely used in image quality assessment.
- Peak Signal-to-Noise Ratio (PSNR) Evaluates the ratio between the maximum possible pixel value and the mean squared error. Commonly used in image compression and denoising.
- Earth Mover’s Distance (Wasserstein Distance) Computes the minimal cost of transforming one image histogram into another. Used in image retrieval and generative modeling.
- Feature-Based Distance (SIFT, ORB, or Deep Learning Embeddings) Compares high-level feature representations extracted from images, often using deep learning models. Effective in image recognition and object detection.
Note
These distance measures are widely applied in image classification, object detection, and content-based image retrieval (CBIR).
Distance measures between audio signals are crucial in speech recognition, music analysis, and sound classification. Below are five of the most commonly used:
- Dynamic Time Warping (DTW) Measures the similarity between two time-series signals by allowing non-linear time distortions. Used in speech recognition and audio matching.
- Mel-Frequency Cepstral Coefficient (MFCC) Distance Computes the Euclidean or cosine distance between MFCC feature vectors, capturing perceptual characteristics of sound. Widely applied in voice recognition and speaker identification.
- Cross-Correlation Distance Measures the alignment between two audio signals by computing their cross-correlation. Useful for audio synchronization and time-delay estimation.
- Spectral Distance (KL Divergence on Spectrograms) Compares spectrograms or power spectra of two signals using Kullback-Leibler divergence. Applied in music genre classification and environmental sound analysis.
- Perceptual Evaluation of Speech Quality (PESQ) Score Quantifies the perceptual difference between two speech signals, often used for speech enhancement and telecommunication quality assessment.
Note
These distance measures are widely used in sound classification, music similarity analysis, and audio fingerprinting.
Distance measures between files are essential in data deduplication, plagiarism detection, and digital forensics. Below are five of the most commonly used:
- Hash-Based Distance (Hamming Distance on Hashes) Compares hash values (e.g., MD5, SHA-256) of two files and counts the number of differing bits. Used in integrity verification and duplicate detection.
- Byte-Level Edit Distance (Levenshtein Distance) Measures the number of insertions, deletions, or substitutions required to transform one file’s binary content into another. Useful for binary diffing and file versioning.
- Jaccard Similarity on Shingled Content Splits files into overlapping chunks (shingles) and compares their sets to determine similarity. Common in plagiarism detection and near-duplicate file detection.
- Kolmogorov Complexity-Based Distance Approximates the minimum amount of information needed to transform one file into another, often using compression-based methods. Applied in data compression and anomaly detection.
- Structural Distance (Tree Edit Distance for XML/JSON Files) Measures differences in hierarchical file structures by computing edit distances on tree representations. Used in configuration file comparison and web scraping.
Note
These distance measures are widely used in file integrity checks, malware detection, and version control systems.
And many more...
The distancia package offers a comprehensive set of tools for computing and analyzing distances and similarities between data points. This package is particularly useful for tasks in data analysis, machine learning, and pattern recognition. Below is an overview of the key classes included in the package, each designed to address specific types of distance or similarity calculations.
Purpose: Facilitates batch processing of distance computations, enabling users to compute distances for large sets of pairs in a single operation.
Use Case: Essential in real-time systems or when working with large datasets where efficiency is critical. Batch processing saves time and computational resources by handling multiple distance computations in one go.
Purpose: Provides tools for benchmarking the performance of various distance metrics on different types of data.
Use Case: Useful in performance-sensitive applications where choosing the optimal metric can greatly impact computational efficiency and accuracy. This class helps users make informed decisions about which distance metric to use for their specific task.
Purpose: Allows users to define custom distance functions by specifying a mathematical formula or providing a custom Python function.
Use Case: Useful for researchers or practitioners who need a specific metric that isn’t commonly used or already implemented.
Purpose: Automatically generates a distance matrix for a set of data points using a specified distance metric.
Use Case: Useful in clustering algorithms like k-means, hierarchical clustering, or in generating heatmaps for visualizing similarity/dissimilarity in datasets.
Purpose: Implements algorithms for learning an optimal distance metric from data based on a specific task, such as classification or clustering.
Use Case: Critical in machine learning tasks where the goal is to optimize a distance metric for maximum task-specific performance, improving the accuracy of models.
Purpose: Enables seamless integration of distance computations with popular data science libraries like pandas, scikit-learn, and numpy.
Use Case: This class enhances the usability of the distancia package, allowing users to incorporate distance calculations directly into their existing data analysis workflows.
Purpose: Identifies the most appropriate distance metric for two given data points based on their structure.
Use Case: Useful when dealing with various types of data, this class helps users automatically determine the best distance metric to apply, ensuring that the metric chosen is suitable for the data's characteristics.
Purpose: Implements methods for detecting outliers in datasets by using distance metrics to identify points that deviate significantly from others.
Use Case: Essential in fields such as fraud detection, quality control, and data cleaning, where identifying and managing outliers is crucial for maintaining data integrity.
Purpose: Adds support for parallel or distributed computation of distances, particularly useful for large datasets.
Use Case: In big data scenarios, calculating distances between millions of data points can be computationally expensive. This class significantly reduces computation time by parallelizing these calculations across multiple processors or machines.
Purpose: Provides tools for visualizing distance matrices, dendrograms (for hierarchical clustering), and 2D/3D representations of data points based on distance metrics.
Use Case: Visualization is a powerful tool in exploratory data analysis (EDA), helping users understand the relationships between data points. This class is particularly useful for creating visual aids like heatmaps or dendrograms to better interpret the data.
The APICompatibility class in the distancia package bridges the gap between powerful distance computation tools and modern API-based architectures. By enabling the creation of REST endpoints for distance metrics, it facilitates the integration of distancia into a wide range of applications, from web services to distributed computing environments. This not only enhances the usability of the package but also ensures that it can be effectively deployed in real-world, production-grade systems.
The AutomatedDistanceMetricSelection feature in the distancia package represents a significant advancement in the ease of use and accessibility of distance metric selection. By automating the process of metric recommendation, it helps users, especially those less familiar with the intricacies of different metrics, to achieve better results in their analyses. This feature not only saves time but also improves the accuracy of data-driven decisions, making distancia a more powerful and user-friendly tool for the data science community.
The ReportingAndDocumentation class is a powerful tool for automating the analysis and documentation of distance metrics. By integrating report generation, matrix export, and property documentation, it provides users with a streamlined way to evaluate and present the results of their distance-based models. This class is especially valuable for machine learning practitioners who require a deeper understanding of the behavior of the metrics they employ.
+AdvancedAnalysis`_
The AdvancedAnalysis class provides essential tools for evaluating the performance, robustness, and sensitivity of distance metrics. These advanced analyses ensure that a metric is not only theoretically sound but also practical and reliable in diverse applications. By offering deep insights into the behavior of distance metrics under perturbations, noise, and dataset divisions, this class is crucial for building resilient models in real-world environments.
The DimensionalityReductionAndScaling class offers powerful methods for simplifying and scaling datasets. By providing tools for dimensionality reduction such as Multi-Dimensional Scaling (MDS), it allows users to project high-dimensional data into lower dimensions while retaining its key characteristics.
The ComparisonAndValidation class offers tools to analyze and validate the performance of a distance or similarity metric by comparing it with other metrics and using established benchmarks. This class is essential for evaluating the effectiveness of a metric in various tasks, such as clustering, classification, or retrieval. By providing cross-validation techniques and benchmarking methods, it allows users to gain a deeper understanding of the metric's strengths and weaknesses.
The StatisticalAnalysis class provides essential tools to analyze and interpret the statistical properties of distances or similarities within a dataset. Through the computation of mean, variance, and distance distributions,
We welcome contributions! If you would like to contribute to Distancia, please read the contributing guide to get started. We appreciate your help in making this project better.
The Distancia package offers a versatile toolkit for handling a wide range of distance and similarity calculations. Whether you're working with numeric data, categorical data, strings, or time series, the package's classes provide the necessary tools to accurately measure distances and similarities. By understanding and utilizing these classes, you can enhance your data analysis workflows and improve the performance of your machine learning models.