Title: | Partitioned Local Depth for Community Structure in Data |
---|---|
Description: | Implementation of the Partitioned Local Depth (PaLD) approach which provides a measure of local depth and the cohesion of a point to another which (together with a universal threshold for distinguishing strong and weak ties) may be used to reveal local and global structure in data, based on methods described in Berenhaut, Moore, and Melvin (2022) <doi:10.1073/pnas.2003634119>. No extraneous inputs, distributional assumptions, iterative procedures nor optimization criteria are employed. This package includes functions for computing local depths and cohesion as well as flexible functions for plotting community networks and displays of cohesion against distance. |
Authors: | Katherine Moore [aut] , Kenneth Berenhaut [aut], Lucy D'Agostino McGowan [aut, cre] |
Maintainer: | Lucy D'Agostino McGowan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.4 |
Built: | 2024-11-14 03:59:00 UTC |
Source: | https://github.com/lucymcgowan/pald |
A synthetic data set of two-dimensional points created by Gionis et al. to demonstrate clustering aggregation.
aggregation
aggregation
A data frame with 788 rows and 2 columns, x1
and x2
.
A. Gionis, H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.
Checks for isolated points.
any_isolated(c)
any_isolated(c)
c |
A |
Logical, indicating whether any points are isolated.
d <- data.frame( x1 = c(1, 2, 3, 6), x2 = c(2, 1, 3, 10) ) D <- dist(d) C <- cohesion_matrix(D) any_isolated(C)
d <- data.frame( x1 = c(1, 2, 3, 6), x2 = c(2, 1, 3, 10) ) D <- dist(d) C <- cohesion_matrix(D) any_isolated(C)
as_cohesion_matrix()
converts an existing matrix into an object of class
cohesion_matrix
.
as_cohesion_matrix(c)
as_cohesion_matrix(c)
c |
A matrix of cohesion values (see |
Object of class cohesion_matrix
C <- matrix( c(0.25, 0.125, 0.125, 0, 0.125, 0.25, 0, 0.125, 0.125, 0, 0.25, 0.125, 0, 0.125, 0.125, 0.25 ), nrow = 4, byrow = TRUE) class(C) C <- as_cohesion_matrix(C) class(C)
C <- matrix( c(0.25, 0.125, 0.125, 0, 0.125, 0.25, 0, 0.125, 0.125, 0, 0.25, 0.125, 0, 0.125, 0.125, 0.25 ), nrow = 4, byrow = TRUE) class(C) C <- as_cohesion_matrix(C) class(C)
A dist
object describing distances between 87 Indo-European languages from
the perspective of cognates.
cognate_dist
cognate_dist
A dist
object for 87 Indo-European languages.
Cognate relationships from a collection of essential words were collected from Dyen et al. and encoded in a 87x2665 binary matrix from which this distance matrix was derived (using Euclidean distance).
I. Dyen, J. B. Kruskal, P. Black, An Indoeuropean classification: A lexicostatistical experiment. Trans. Am. Phil. Soc. 82, iii-132 (1992).
Creates a matrix of (pairwise) cohesion values from a matrix of pairwise
distances or a dist
object.
cohesion_matrix(d)
cohesion_matrix(d)
d |
A matrix of pairwise distances or a |
Computes the matrix of (pairwise) cohesion values, C_xw, from a matrix of
pairwise distances or a dist
object. Cohesion is an interpretable probability
that reflects the strength of alignment of a point, w
, to another point, x
.
The rows of the cohesion matrix can be seen as providing neighborhood
weights. These values may be used for defining associated weighted graphs
(for the purpose of community analysis) as in Berenhaut, Moore, and
Melvin (2022).
Given an n x n distance matrix, the sum of the entries in the resulting
cohesion matrix is always equal to n/2.
Cohesion is partitioned local depth (see local_depths
) and thus the row
sums of the cohesion matrix provide a measure of local depth centrality.
If you have a matrix that is already a cohesion matrix and you would like to
add the class, see as_cohesion_matrix()
.
The matrix of cohesion values. An object of class cohesion_matrix
.
K. S. Berenhaut, K. E. Moore, R. L. Melvin, A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), 2022.
plot(exdata1) text(exdata1 + .08, lab = 1:8) D <- dist(exdata1) C <- cohesion_matrix(D) C ## neighbor weights (provided by cohesion) for the 8th point in exdata1 C[8, ] localdepths <- rowSums(C)
plot(exdata1) text(exdata1 + .08, lab = 1:8) D <- dist(exdata1) C <- cohesion_matrix(D) C ## neighbor weights (provided by cohesion) for the 8th point in exdata1 C[8, ] localdepths <- rowSums(C)
Provides the symmetrized and thresholded matrix of cohesion values.
cohesion_strong(c, symmetric = TRUE)
cohesion_strong(c, symmetric = TRUE)
c |
A |
symmetric |
Logical. Whether the returned matrix should be made
symmetric (using the minimum); the default is |
The threshold is that provided by strong_threshold (and is equal to half of
the average of the diagonal of c
).
Values of the cohesion matrix which are less than the threshold are set to
zero.
The symmetrization, if desired, is computed using the entry-wise (parallel)
minimum of C and its transpose (i.e., min(C_ij, C_ji)
).
The matrix provided by cohesion_strong (with default symmetric = TRUE
) is
the adjacency matrix for the graph of strong ties (the cluster graph), see
community_graphs
and pald
.
The symmetrized cohesion matrix in which all entries corresponding to weak ties are set to zero.
C <- cohesion_matrix(dist(exdata2)) strong_threshold(C) cohesion_strong(C) ## To illustrate the calculation performed C_strong <- C ## C_strong is equal to cohesion_strong(C, symmetric = FALSE) C_strong[C < strong_threshold(C)] <- 0 ## C_strong_sym is equal to cohesion_strong(C) C_strong_sym <- pmin(C_strong, t(C_strong)) ## The (cluster) graph whose adjacency matrix, CS, ## is the matrix of strong ties CS <- cohesion_strong(C) if (requireNamespace("igraph", quietly = TRUE)) { G_strong <- igraph::simplify( igraph::graph.adjacency(CS, weighted = TRUE, mode = "undirected") ) plot(G_strong) }
C <- cohesion_matrix(dist(exdata2)) strong_threshold(C) cohesion_strong(C) ## To illustrate the calculation performed C_strong <- C ## C_strong is equal to cohesion_strong(C, symmetric = FALSE) C_strong[C < strong_threshold(C)] <- 0 ## C_strong_sym is equal to cohesion_strong(C) C_strong_sym <- pmin(C_strong, t(C_strong)) ## The (cluster) graph whose adjacency matrix, CS, ## is the matrix of strong ties CS <- cohesion_strong(C) if (requireNamespace("igraph", quietly = TRUE)) { G_strong <- igraph::simplify( igraph::graph.adjacency(CS, weighted = TRUE, mode = "undirected") ) plot(G_strong) }
Community clusters
community_clusters(c)
community_clusters(c)
c |
A |
A data frame with two columns:
point
: The points from cohesion matrix c
community
: The community cluster labels
D <- dist(exdata2) C <- cohesion_matrix(D) community_clusters(C)
D <- dist(exdata2) C <- cohesion_matrix(D) community_clusters(C)
Provides the graphs whose edge weights are (mutual) cohesion, together with a graph layout.
community_graphs(c)
community_graphs(c)
c |
A |
Constructs the graphs whose edge weights are (mutual) cohesion
(see cohesion_matrix
), self-loops are removed.
The graph G has adjacency matrix equal to the symmetrized cohesion matrix
(using the entry-wise parallel minimum of C and its transpose).
The graph G_strong has adjacency matrix equal to the thresholded and
symmetrized cohesion matrix (see cohesion_strong
). The threshold is
equal to half of the average of the diagonal of the
cohesion matrix (see strong_threshold
).
A layout is also computed using the Fruchterman-Reingold (FR) force-directed graph drawing algorithm. As a result, it may provide a somewhat different layout each time it is run.
A list consisting of:
G
: the weighted (community) graph whose edge weights are mutual
cohesion
G_strong
: the weighted (community) graph consisting of edges
for which mutual cohesion is greater than the threshold for strong
ties (see strong_threshold
)
layout
: the layout, using the Fruchterman Reingold (FR)
force-directed graph drawing for the graph G
C <- cohesion_matrix(dist(exdata2)) plot(community_graphs(C)$G_strong) plot(community_graphs(C)$G_strong, layout = community_graphs(C)$layout)
C <- cohesion_matrix(dist(exdata2)) plot(community_graphs(C)$G_strong) plot(community_graphs(C)$G_strong, layout = community_graphs(C)$layout)
Pairwise dissimilarities are given by the cultural fixation index obtained from World Values Survey responses.
cultures
cultures
A 59x59 matrix
of dissimilarities
M. Muthukrishna, et al., Beyond western, educated, industrial, rich, and democratic (WEIRD) psychology: measuring and mapping scales of cultural and psychological distance. Psychol. Sci. 1, 24 (2020).
R. Inglehart et al, World Values Survey: All Rounds-Country-Pooled Datafile 1981-2014, (JD Systems Institute, Madrid 2014).
Provides a plot of cohesion against distance, with the threshold indicated by a horizontal line.
dist_cohesion_plot( d, mutual = FALSE, xlim_max = NULL, cex = 1, colors = NULL, weak_gray = FALSE )
dist_cohesion_plot( d, mutual = FALSE, xlim_max = NULL, cex = 1, colors = NULL, weak_gray = FALSE )
d |
A matrix of pairwise distances or a |
mutual |
Set to |
xlim_max |
If desired, set the maximum value of distance which is displayed on the x-axis. |
cex |
Factor by which points should be scaled relative to the default. |
colors |
A vector of color names, if none is given a default is provided. |
weak_gray |
Set to |
The plot of cohesion against distance provides a visualization for the
manner in which distance is transformed.
The threshold distinguishing strong and weak ties is indicated by a
horizontal line.
When there are separated regions with different density, one can often
observe vertical bands of color, see example below and Berenhaut, Moore, and
Melvin (2022). For each distance pair in d
, the corresponding value of
cohesion is computed. If the pair is within a single cluster, the point is
colored (with the same color provided by the pald
and
plot_community_graphs
functions). Weak ties appear below the threshold.
Note that cohesion is not symmetric, and so all n^2
points are plotted.
A gray point above the threshold corresponds to a pair in which the value
of cohesion is greater than the threshold in only one direction. If one
only wants to observe mutual cohesion (i.e., cohesion made symmetric via
the minimum), set mutual = TRUE
.
A plot of cohesion against distance with threshold indicated by a horizontal line.
D <- dist(exdata2) dist_cohesion_plot(D) dist_cohesion_plot(D, mutual = TRUE) C <- cohesion_matrix(D) threshold <- strong_threshold(C) #the horizontal line dist_cohesion_plot(D, mutual = TRUE, weak_gray = TRUE)
D <- dist(exdata2) dist_cohesion_plot(D) dist_cohesion_plot(D, mutual = TRUE) C <- cohesion_matrix(D) threshold <- strong_threshold(C) #the horizontal line dist_cohesion_plot(D, mutual = TRUE, weak_gray = TRUE)
A data set consisting of 8 points (in 2-dimensional Euclidean space) to provide a simple illustrative example. This data is displayed in Figure 1 in Berenhaut, Moore, and Melvin (2022).
exdata1
exdata1
A data frame with 8 rows and 2 columns, x1
and x2
K. S. Berenhaut, K. E. Moore, R. L. Melvin, A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), 2022.
A data set consisting of 16 points (in 2-dimensional Euclidean space) to provide an illustrative example. This data is displayed in Figure 2 in Berenhaut, Moore, and Melvin (2022).
exdata2
exdata2
A data frame with 16 rows and 2 columns, x1
and x2
K. S. Berenhaut, K. E. Moore, R. L. Melvin, A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), 2022.
A data set consisting of 240 points (in 2-dimensional Euclidean space) to provide an illustrative example. Points were generated from bivariate normal distributions with varying mean and variance (with covariance matrix cI). This data is displayed in Figure 4D in Berenhaut, Moore, and Melvin (2022).
exdata3
exdata3
A data frame with 240 rows and 2 columns, x1
and x2
K. S. Berenhaut, K. E. Moore, R. L. Melvin, A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), 2022.
Creates a vector of local depths from a matrix of distances (or dist
object).
local_depths(d)
local_depths(d)
d |
A matrix of pairwise distances, a |
Local depth is an interpretable probability which reflects aspects of relative position and centrality via distance comparisons (i.e., d(z, x) < d(z, y)).
The average of the local depth values is always 1/2. Cohesion is
partitioned local depth (see cohesion_matrix
); the row-sums of the
cohesion matrix are the values of local depth.
A vector of local depths.
D <- dist(exdata1) local_depths(D) C <- cohesion_matrix(D) local_depths(C) ## local depths are the row sums of the cohesion matrix rowSums(C) ## cognate distance data ld_lang <- sort(local_depths(cognate_dist))
D <- dist(exdata1) local_depths(D) C <- cohesion_matrix(D) local_depths(C) ## local depths are the row sums of the cohesion matrix rowSums(C) ## cognate distance data ld_lang <- sort(local_depths(cognate_dist))
Noisy circles data generated from scikit-learn
noisy_circles
noisy_circles
A dataframe with 500 rows and 2 columns, x1
and x2
.
https://scikit-learn.org/stable/modules/clustering.html#clustering
Noisy moons data generated from scikit-learn
noisy_moons
noisy_moons
A dataframe with 500 rows and 2 columns, x1
and x2
.
https://scikit-learn.org/stable/modules/clustering.html#clustering
A wrapper function which computes the cohesion matrix, local depths, community graphs and provides a plot of the community graphs with connected components of the graph of strong ties colored by connected component.
pald( d, show_plot = TRUE, show_labels = TRUE, only_strong = FALSE, emph_strong = 2, edge_width_factor = 50, colors = NULL, layout = NULL, ... )
pald( d, show_plot = TRUE, show_labels = TRUE, only_strong = FALSE, emph_strong = 2, edge_width_factor = 50, colors = NULL, layout = NULL, ... )
d |
A matrix of pairwise distances or a |
show_plot |
Set to |
show_labels |
Set to |
only_strong |
Set to |
emph_strong |
Numeric. The numeric factor by which the edge widths of
strong ties are emphasized in the display; the default is |
edge_width_factor |
Numeric. Modify to change displayed edge widths.
Default: |
colors |
A vector of display colors, if none is given a default list (of length 24) is provided. |
layout |
A layout for the graph. If none is specified, FR-graph drawing algorithm is used. |
... |
Optional parameters to pass to the
|
This function re-computes the cohesion matrix each time it is run.
To avoid unnecessary computation when creating visualizations, use the
function cohesion_matrix
to compute the cohesion matrix which may then
be taken as input for local_depths
, strong_threshold
,
cohesion_strong
, community_graphs
, and plot_community_graphs
.
For further details regarding each component, see the documentation for
each of the above functions.
A list consisting of:
C
: the matrix of cohesion values
local_depths
: a vector of local depths
clusters
: a vector of (community) cluster labels
threshold
: the threshold above which cohesion is considered
particularly strong
C_strong
: the thresholded matrix of cohesion values
G
: the graph whose edges weights are mutual cohesion
G_strong
: the weighted graph whose edges are those for
which cohesion is particularly strong
layout
: a FR force-directed layout associated with G
K. S. Berenhaut, K. E. Moore, R. L. Melvin, A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), 2022.
D <- dist(exdata2) pald_results <- pald(D) pald_results$local_depths pald(D, layout = as.matrix(exdata2), show_labels = FALSE) C <- cohesion_matrix(D) local_depths(C) plot_community_graphs(C, layout = as.matrix(exdata2), show_labels = FALSE) pald_languages <- pald(cognate_dist) head(pald_languages$local_depths)
D <- dist(exdata2) pald_results <- pald(D) pald_results$local_depths pald(D, layout = as.matrix(exdata2), show_labels = FALSE) C <- cohesion_matrix(D) local_depths(C) plot_community_graphs(C, layout = as.matrix(exdata2), show_labels = FALSE) pald_languages <- pald(cognate_dist) head(pald_languages$local_depths)
A vector of colors to use if comparing other clustering methods. These are the default colors used in the plotting functions.
pald_colors
pald_colors
A vector of 24 colors
Provides a plot of the community graphs, with connected components of the graph of strong ties colored by connected component.
plot_community_graphs( c, show_labels = TRUE, only_strong = FALSE, emph_strong = 2, edge_width_factor = 50, colors = NULL, ... )
plot_community_graphs( c, show_labels = TRUE, only_strong = FALSE, emph_strong = 2, edge_width_factor = 50, colors = NULL, ... )
c |
A |
show_labels |
Set to |
only_strong |
Set to |
emph_strong |
Numeric. The numeric factor by which the edge widths of
strong ties are emphasized in the display; the default is |
edge_width_factor |
Numeric. Modify to change displayed edge widths.
Default: |
colors |
A vector of display colors, if none is given a default list (of length 24) is provided. |
... |
Optional parameters to pass to the
|
Plots the community graph, G, with the sub-graph of strong ties emphasized
and colored by connected component. If no layout is provided, the
Fruchterman-Reingold (FR) graph drawing algorithm is used.
Note that the FR graph drawing algorithm may provide a somewhat different
layout each time it is run. You can also access and save a given graph
layout using community_graphs(C)$layout
.
The example below shows how to display only a subset of vertex labels.
Note that the parameter emph_strong
is for visualization purposes
only and does not influence the network layout.
A plot of the community graphs.
C <- cohesion_matrix(dist(exdata1)) plot_community_graphs(C, emph_strong = 1, layout = as.matrix(exdata1)) plot_community_graphs(C, only_strong = TRUE) C2 <- cohesion_matrix(cognate_dist) subset_lang_names <- rownames(C2) subset_lang_names[sample(1:87, 60)] <- "" plot_community_graphs(C2, vertex.label = subset_lang_names, vertex.size = 3)
C <- cohesion_matrix(dist(exdata1)) plot_community_graphs(C, emph_strong = 1, layout = as.matrix(exdata1)) plot_community_graphs(C, only_strong = TRUE) C2 <- cohesion_matrix(cognate_dist) subset_lang_names <- rownames(C2) subset_lang_names[sample(1:87, 60)] <- "" plot_community_graphs(C2, vertex.label = subset_lang_names, vertex.size = 3)
Given a cohesion matrix, provides the value of the threshold above which values of cohesion are considered "particularly strong".
strong_threshold(c)
strong_threshold(c)
c |
A |
The threshold considered in Berenhaut, Moore, and Melvin (2022) which may be used for distinguishing between strong and weak ties. The threshold is equal to half the average of the diagonal of the cohesion matrix, see Berenhaut, Moore, and Melvin (2022).
The value of the threshold.
K. S. Berenhaut, K. E. Moore, R. L. Melvin, A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), 2022.
C <- cohesion_matrix(dist(exdata1)) strong_threshold(C) mean(diag(C)) / 2 ## points whose cohesion are greater than the threshold may be considered ## (strong) neighbors which(C[3, ] > strong_threshold(C)) ## note that the number of (strongly-cohesive) neighbors varies across the ## space which(C[4, ] > strong_threshold(C)) C[4, c(2, 3, 4, 6)] # cohesion values can provide neighbor weights
C <- cohesion_matrix(dist(exdata1)) strong_threshold(C) mean(diag(C)) / 2 ## points whose cohesion are greater than the threshold may be considered ## (strong) neighbors which(C[3, ] > strong_threshold(C)) ## note that the number of (strongly-cohesive) neighbors varies across the ## space which(C[4, ] > strong_threshold(C)) C[4, c(2, 3, 4, 6)] # cohesion values can provide neighbor weights
A dist
object describing distances from a subset of tissue gene
expression data from the following papers:
http://www.ncbi.nlm.nih.gov/pubmed/17906632
http://www.ncbi.nlm.nih.gov/pubmed/21177656
http://www.ncbi.nlm.nih.gov/pubmed/24271388 obtained from the tissuesGeneExpression bioconductor package.
tissue_dist
tissue_dist
A dist
object of 189 tissue types
The original data frame had 189 rows, each with a corresponding tissue,
such as colon
, kidney
or cerebellum
.
There were 22,215 columns corresponding to gene expression data from each of
these rows. This was then converted into a distance matrix.
M. Love and R. Irizarry. tissueGeneExpression. Bioconductor Package