To prepare for the vignette, we load the necessary R packages and associated datasets.
# load R packages (both on CRAN)
library(ICON)
library(network) # for network analysis
library(ggplot2) # for network visualization
library(ggnetwork) # for network visualization
# load ICON dataset
data(ICON_data)
The ICON R package provides complex networks in edge list format that can be easily incorporated into network analysis pipelines. The package is based on the Index of COmplex Networks (ICON) website that curates complex networks and corresponding summary information (e.g. source, number of nodes & edges, discipline). Many publications cite the ICON website, however, since the complex networks in the database exist in distinct formats, the authors of each publication likely had to standardize a large set of networks. The ICON R package aims to decrease redundant scut work (data formatting) by providing a large number of complex networks from the ICON website in a standard format (network format = edge list; file format = binary R Data with extension .rda). Currently, the ICON R package provides 1075 complex networks. As can be reasonably inferred by reading this paragraph, referring to the package as “the ICON R package” can become quite tedious with multiple mentions; as such, the R package will be referred to simply as “ICON” for the remainder of this vignette and the website will be referred to as “the ICON website”.
To learn more about ICON, you can visit the associated website, GitHub repo, and CRAN page.
The ICON_data
dataset (loaded in the Setup section) provides a summary
of all the datasets that are available through ICON. Using head
, we
can take a look at the first 6.
head(ICON_data)
#> Var_name Edges Directed Weighted
#> 1 aishihik_intensity 78 FALSE TRUE
#> 2 aishihik_prevalence 78 FALSE TRUE
#> 3 amazon_copurchase 3387388 TRUE FALSE
#> 4 arxiv_astroph 396160 FALSE FALSE
#> 5 arxiv_condmat 186936 FALSE FALSE
#> 6 arxiv_grqc 28980 FALSE FALSE
#> Name Domain Year
#> 1 Host-parasite web BioChem 1955-1983
#> 2 Host-parasite web BioChem 1955-1983
#> 3 Amazon co-purchasing network Economic 2003
#> 4 arXiv Astrophysics Collaboration Social 2007
#> 5 arXiv Condensed Matter Collaboration Social 2007
#> 6 arXiv General Relativity Quantum Cosmology Collaboration Social 2007
#> Source
#> 1 https://www.nceas.ucsb.edu/interactionweb/html/canadian_fish.html
#> 2 https://www.nceas.ucsb.edu/interactionweb/html/canadian_fish.html
#> 3 http://snap.stanford.edu/data/amazon0601.html
#> 4 http://snap.stanford.edu/data/ca-AstroPh.html
#> 5 http://snap.stanford.edu/data/ca-CondMat.html
#> 6 http://snap.stanford.edu/data/ca-GrQc.html
#> Description
#> 1 Undirected, weighted, bipartite adjacency network of Canadian freshwater fish (hosts) and their metazoan parasites (parasites). The name of the lake to which this dataset corresponds is the first word of the dataset (text prior to the underscore). Network is formatted as an edgelist. Node labels (words) were provided in the raw dataset, which can be found at the source URL (below). On the webpage at the source URL (below), select the appropriate named link under the "Data files" heading.
#> 2 Undirected, weighted, bipartite adjacency network of Canadian freshwater fish (hosts) and their metazoan parasites (parasites). The name of the lake to which this dataset corresponds is the first word of the dataset (text prior to the underscore). Network is formatted as an edgelist. Node labels (words) were provided in the raw dataset, which can be found at the source URL (below). On the webpage at the source URL (below), select the appropriate named link under the "Data files" heading.
#> 3 Directed, unweighted network of items for sale on amazon.com in June 2003 and the items they recommend (via the Customers Who Bought This Item Also Bought feature). If one item is frequently co-purchased with another, then the first item recommends the second.
#> 4 Undirected, unweighted author collaboration network in arXiv astrophysics section.
#> 5 Undirected, unweighted author collaboration network in arXiv condensed matter section.
#> 6 Undirected, unweighted author collaboration network in arXiv general relativity and quantum cosmology section.
#> Citation
#> 1 Arai HP, Mudry DR. Protozoan and metazoan parasites of fishes from the headwaters of the Parsip and McGregor Rivers, British Columbia: a study of possible parasite transfaunations. Canadian Journal of Fisheries and Aquatic Sciences. 1983; 40: 1676-1684. Arthur JR, Margolic L, Arai HP. Parasites of fishes of Aishihik and Stevens Lakes, Yukon Territory, and potential consequences of their interlake transfer through a proposed water diversion for hydroelectrical purposes. Journal of the Fisheries Research Board of Canada. 1976; 33: 2489-2499. Bangham RV. Studies on fish parasites of Lake Huron and Manitoulin Island. American Midland Naturalist. 1955; 53: 184-194. Chinniah VC, Threlfall W. Metazoan parasites of fish from the Smallwood Reservoir, Labrador, Canada. Journal of Fish Biology. 1978; 13: 203-213. Dechtiar AO. Parasites of fish from Lake of the Woods, Ontario. Journal of the Fisheries Research Board of Canada. 1972; 29: 275-283. Leong TS, Holmes JC. Communities of metazoan parasites in open water fishes of Cold Lake, Alberta. Journal of Fish Biology. 1981; 18: 693-713.
#> 2 Arai HP, Mudry DR. Protozoan and metazoan parasites of fishes from the headwaters of the Parsip and McGregor Rivers, British Columbia: a study of possible parasite transfaunations. Canadian Journal of Fisheries and Aquatic Sciences. 1983; 40: 1676-1684. Arthur JR, Margolic L, Arai HP. Parasites of fishes of Aishihik and Stevens Lakes, Yukon Territory, and potential consequences of their interlake transfer through a proposed water diversion for hydroelectrical purposes. Journal of the Fisheries Research Board of Canada. 1976; 33: 2489-2499. Bangham RV. Studies on fish parasites of Lake Huron and Manitoulin Island. American Midland Naturalist. 1955; 53: 184-194. Chinniah VC, Threlfall W. Metazoan parasites of fish from the Smallwood Reservoir, Labrador, Canada. Journal of Fish Biology. 1978; 13: 203-213. Dechtiar AO. Parasites of fish from Lake of the Woods, Ontario. Journal of the Fisheries Research Board of Canada. 1972; 29: 275-283. Leong TS, Holmes JC. Communities of metazoan parasites in open water fishes of Cold Lake, Alberta. Journal of Fish Biology. 1981; 18: 693-713.
#> 3 Leskovec J, Adamic L, Adamic B. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB). 2007; 1.
#> 4 Leskovec J, Kleinberg J, Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data. 2007; 1(1).
#> 5 Leskovec J, Kleinberg J, Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data. 2007; 1(1).
#> 6 Leskovec J, Kleinberg J, Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data. 2007; 1(1).
Clearly, this is not aesthetic due to the number of columns that
ICON_data
contains. However, it should provide some comfort that the
amount of metadata available within ICON_data
will likely enable
tracking down the original dataset. To get a nicer view of the available
datasets, let’s take a look at only a subset of the available metadata.
head(ICON_data[, c("Var_name",
"Edges",
"Directed",
"Name")],
n = 5)
#> Var_name Edges Directed Name
#> 1 aishihik_intensity 78 FALSE Host-parasite web
#> 2 aishihik_prevalence 78 FALSE Host-parasite web
#> 3 amazon_copurchase 3387388 TRUE Amazon co-purchasing network
#> 4 arxiv_astroph 396160 FALSE arXiv Astrophysics Collaboration
#> 5 arxiv_condmat 186936 FALSE arXiv Condensed Matter Collaboration
A key difference to note is that the Var_name
column refers to the
dataset name that should be used when accessing it through ICON whereas
the Name
column lists a more descriptive dataset name with little
programmatic relevance. Two salient points should be made here. First,
descriptions of the information contained in each column of ICON_data
can be accessed in the package documentation via ?ICON_data
or
equivalents. Second, package metadata is generally available in R data
packages via package documentation, which avoids the need for a dataset
like ICON_data
. However, since the networks that ICON provides can be
quite large, even in a compressed binary format, they are not hosted
within the package on the Comprehensive R Archive Network
(CRAN). Instead, the desired packages are
only downloaded locally when the user instructs ICON to do so via the
get_data
function (explored in the next section). Although this comes
with the disadvantage of not having all datasets immediately available
upon installation of ICON, it does save ICON users considerable space if
they only wish to use a small subset of the available datasets.
Once the list of available datasets has been explored, the complex
networks can be downloaded using the get_data
function and the
Var_name
column in ICON_data
. For example, the first network in
ICON_data
has Var_name
set to aishihik_intensity
, so we download
it and peek as follows.
get_data("aishihik_intensity")
#> DATASET(S) aishihik_intensity LOADED
head(aishihik_intensity)
#> Fish Parasite Intensity
#> 1 1 V1 5.8
#> 2 1 V9 7.0
#> 3 1 V16 3.0
#> 4 1 V22 1.0
#> 5 2 V3 7.2
#> 6 2 V8 65.8
Every complex network in ICON is provided as an edgelist stored in a
data frame. Each row of the data frame corresponds to a single edge and
the first two columns contain the nodes that define the edge; for
directed networks, the first column will always be the source (from) and
the second column will always be the sink (to). Note that since only an
edgelist is provided, nodes of degree zero will not be included.
Weighted (and some unweighted) networks will contain more than two
columns; the additional columns represent either edge weights or other
edge attributes. The get_data
function can also download multiple
datasets at once. Note that the get_data
’s envir
parameter can be
modified to select a different environment; by default, the objects will
load on the global environment (.GlobalEnv
).
get_data(c("coldlake_intensity", "fullerene_c60"))
#> DATASET(S) coldlake_intensity fullerene_c60 LOADED
head(coldlake_intensity)
#> Fish Parasite Intensity
#> 1 1 V2 2.1
#> 2 1 V5 NA
#> 3 1 V8 1.7
#> 4 1 V12 3.5
#> 5 1 V16 1.7
#> 6 1 V19 1.2
head(fullerene_c60)
#> Node1 Node2
#> 1 0 1
#> 2 0 2
#> 3 0 3
#> 4 1 4
#> 5 1 5
#> 6 2 10
get_data
also lends itself to a simple solution (following code chunk)
to download all the networks available through ICON, however, this
should be used with caution as there are a large number of them. The
following chunk is not actually run in the vignette.
# download all available complex networks
get_data(ICON_data$Var_name)
Once downloaded, the complex networks can be stored locally in a binary (e.g. RDA/RData, RDS) or plain-text (CSV, TXT) format; storing it locally removes the reliance on an internet connection for future use.
Although ICON provides the complex networks, it does not provide
functionality to analyze or visualize them. However, the S3 generic
as.network.ICON
is provided to permit use of the
network and
ggnetwork R packages for
analysis and visualization, respectively. The following code shows how
to generate an object of class network
using a previously downloaded
complex network.
# make sure that the downloaded network has class `ICON`
class(aishihik_intensity)
#> [1] "ICON" "data.frame"
# coerce to class `network` (`as.network.ICON` is called)
coerced_network <- as.network(aishihik_intensity)
# check if the coerced network has the correct class
class(coerced_network)
#> [1] "network"
# peek at the coerced network (pay attention to number of edges)
coerced_network
#> Network attributes:
#> vertices = 36
#> directed = FALSE
#> hyper = FALSE
#> loops = FALSE
#> multiple = FALSE
#> bipartite = FALSE
#> total edges= 78
#> missing edges= 0
#> non-missing edges= 78
#>
#> Vertex attribute names:
#> vertex.names
#>
#> Edge attribute names:
#> Intensity
# check number of vertices in initial ICON object
length(unique(c(aishihik_intensity[, 1], aishihik_intensity[, 2])))
#> [1] 36
# check number of edges in initial ICON object
nrow(aishihik_intensity)
#> [1] 78
The initial network (named aishihik_intensity
) and the coerced network
(named coerced_network
) both contain 36 vertices and 78 edges. Once we
have an object of class network
, we can visualize it easily using
ggnetwork (a ggplot2 extension).
# ggnetwork fortifies objects of class `network` without additional code
# the aes parameters should be used as-is
ggplot(coerced_network, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges() +
geom_nodes() +
theme_blank()
coerced_network
also has an edge attribute named Intensity
(see the
name of the third column name in aishihik_intensity
). This edge
attribute can be used to color the edges as follows.
ggplot(coerced_network, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(aes(color = Intensity)) + # this line changed
geom_nodes() +
theme_blank()
Implementing an S3 generic for the as.network
method and ICON
class
thus makes ICON datasets compatible with the network and ggnetwork
packages, considerably simplifying the processes for network analysis
and visualization. ICON’s GitHub
README provides
sample code to analyze and visualize ICON complex networks using
igraph as an alternative.