Skip to content

Instantly share code, notes, and snippets.

@rrrlw
Last active October 28, 2020 22:04
Show Gist options
  • Save rrrlw/b0c2b8ef0abe70038c19ab1e64aec3ed to your computer and use it in GitHub Desktop.
Save rrrlw/b0c2b8ef0abe70038c19ab1e64aec3ed to your computer and use it in GitHub Desktop.
Introduction to the ICON R package, allowing access to 1,075 complex networks.

Setup

To prepare for the vignette, we load the necessary R packages and associated datasets.

# load R packages (both on CRAN)
library(ICON)
library(network)   # for network analysis
library(ggplot2)   # for network visualization
library(ggnetwork) # for network visualization

# load ICON dataset
data(ICON_data)

Purpose

The ICON R package provides complex networks in edge list format that can be easily incorporated into network analysis pipelines. The package is based on the Index of COmplex Networks (ICON) website that curates complex networks and corresponding summary information (e.g. source, number of nodes & edges, discipline). Many publications cite the ICON website, however, since the complex networks in the database exist in distinct formats, the authors of each publication likely had to standardize a large set of networks. The ICON R package aims to decrease redundant scut work (data formatting) by providing a large number of complex networks from the ICON website in a standard format (network format = edge list; file format = binary R Data with extension .rda). Currently, the ICON R package provides 1075 complex networks. As can be reasonably inferred by reading this paragraph, referring to the package as “the ICON R package” can become quite tedious with multiple mentions; as such, the R package will be referred to simply as “ICON” for the remainder of this vignette and the website will be referred to as “the ICON website”.

To learn more about ICON, you can visit the associated website, GitHub repo, and CRAN page.

Which datasets are included in ICON?

The ICON_data dataset (loaded in the Setup section) provides a summary of all the datasets that are available through ICON. Using head, we can take a look at the first 6.

head(ICON_data)
#>              Var_name   Edges Directed Weighted
#> 1  aishihik_intensity      78    FALSE     TRUE
#> 2 aishihik_prevalence      78    FALSE     TRUE
#> 3   amazon_copurchase 3387388     TRUE    FALSE
#> 4       arxiv_astroph  396160    FALSE    FALSE
#> 5       arxiv_condmat  186936    FALSE    FALSE
#> 6          arxiv_grqc   28980    FALSE    FALSE
#>                                                       Name   Domain      Year
#> 1                                        Host-parasite web  BioChem 1955-1983
#> 2                                        Host-parasite web  BioChem 1955-1983
#> 3                             Amazon co-purchasing network Economic      2003
#> 4                         arXiv Astrophysics Collaboration   Social      2007
#> 5                     arXiv Condensed Matter Collaboration   Social      2007
#> 6 arXiv General Relativity Quantum Cosmology Collaboration   Social      2007
#>                                                              Source
#> 1 https://www.nceas.ucsb.edu/interactionweb/html/canadian_fish.html
#> 2 https://www.nceas.ucsb.edu/interactionweb/html/canadian_fish.html
#> 3                     http://snap.stanford.edu/data/amazon0601.html
#> 4                     http://snap.stanford.edu/data/ca-AstroPh.html
#> 5                     http://snap.stanford.edu/data/ca-CondMat.html
#> 6                        http://snap.stanford.edu/data/ca-GrQc.html
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Description
#> 1 Undirected, weighted, bipartite adjacency network of Canadian freshwater fish (hosts) and their metazoan parasites (parasites). The name of the lake to which this dataset corresponds is the first word of the dataset (text prior to the underscore). Network is formatted as an edgelist. Node labels (words) were provided in the raw dataset, which can be found at the source URL (below). On the webpage at the source URL (below), select the appropriate named link under the "Data files" heading.
#> 2 Undirected, weighted, bipartite adjacency network of Canadian freshwater fish (hosts) and their metazoan parasites (parasites). The name of the lake to which this dataset corresponds is the first word of the dataset (text prior to the underscore). Network is formatted as an edgelist. Node labels (words) were provided in the raw dataset, which can be found at the source URL (below). On the webpage at the source URL (below), select the appropriate named link under the "Data files" heading.
#> 3                                                                                                                                                                                                                                         Directed, unweighted network of items for sale on amazon.com in June 2003 and the items they recommend (via the Customers Who Bought This Item Also Bought feature). If one item is frequently co-purchased with another, then the first item recommends the second.
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                           Undirected, unweighted author collaboration network in arXiv astrophysics section.
#> 5                                                                                                                                                                                                                                                                                                                                                                                                                       Undirected, unweighted author collaboration network in arXiv condensed matter section.
#> 6                                                                                                                                                                                                                                                                                                                                                                                               Undirected, unweighted author collaboration network in arXiv general relativity and quantum cosmology section.
itation
#> 1 Arai HP, Mudry DR. Protozoan and metazoan parasites of fishes from the headwaters of the Parsip and McGregor Rivers, British Columbia: a study of possible parasite transfaunations. Canadian Journal of Fisheries and Aquatic Sciences. 1983; 40: 1676-1684.    Arthur JR, Margolic L, Arai HP. Parasites of fishes of Aishihik and Stevens Lakes, Yukon Territory, and potential consequences of their interlake transfer through a proposed water diversion for hydroelectrical purposes. Journal of the Fisheries Research Board of Canada. 1976; 33: 2489-2499.    Bangham RV. Studies on fish parasites of Lake Huron and Manitoulin Island. American Midland Naturalist. 1955; 53: 184-194.    Chinniah VC, Threlfall W. Metazoan parasites of fish from the Smallwood Reservoir, Labrador, Canada. Journal of Fish Biology. 1978; 13: 203-213.    Dechtiar AO. Parasites of fish from Lake of the Woods, Ontario. Journal of the Fisheries Research Board of Canada. 1972; 29: 275-283.    Leong TS, Holmes JC. Communities of metazoan parasites in open water fishes of Cold Lake, Alberta. Journal of Fish Biology. 1981; 18: 693-713.
#> 2 Arai HP, Mudry DR. Protozoan and metazoan parasites of fishes from the headwaters of the Parsip and McGregor Rivers, British Columbia: a study of possible parasite transfaunations. Canadian Journal of Fisheries and Aquatic Sciences. 1983; 40: 1676-1684.    Arthur JR, Margolic L, Arai HP. Parasites of fishes of Aishihik and Stevens Lakes, Yukon Territory, and potential consequences of their interlake transfer through a proposed water diversion for hydroelectrical purposes. Journal of the Fisheries Research Board of Canada. 1976; 33: 2489-2499.    Bangham RV. Studies on fish parasites of Lake Huron and Manitoulin Island. American Midland Naturalist. 1955; 53: 184-194.    Chinniah VC, Threlfall W. Metazoan parasites of fish from the Smallwood Reservoir, Labrador, Canada. Journal of Fish Biology. 1978; 13: 203-213.    Dechtiar AO. Parasites of fish from Lake of the Woods, Ontario. Journal of the Fisheries Research Board of Canada. 1972; 29: 275-283.    Leong TS, Holmes JC. Communities of metazoan parasites in open water fishes of Cold Lake, Alberta. Journal of Fish Biology. 1981; 18: 693-713.
eskovec J, Adamic L, Adamic B. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB). 2007; 1.
eskovec J, Kleinberg J, Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data. 2007; 1(1).
#> 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Leskovec J, Kleinberg J, Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data. 2007; 1(1).
eskovec J, Kleinberg J, Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data. 2007; 1(1).

Clearly, this is not aesthetic due to the number of columns that ICON_data contains. However, it should provide some comfort that the amount of metadata available within ICON_data will likely enable tracking down the original dataset. To get a nicer view of the available datasets, let’s take a look at only a subset of the available metadata.

head(ICON_data[, c("Var_name",
                   "Edges",
                   "Directed",
                   "Name")],
     n = 5)
#>              Var_name   Edges Directed                                 Name
#> 1  aishihik_intensity      78    FALSE                    Host-parasite web
#> 2 aishihik_prevalence      78    FALSE                    Host-parasite web
#> 3   amazon_copurchase 3387388     TRUE         Amazon co-purchasing network
#> 4       arxiv_astroph  396160    FALSE     arXiv Astrophysics Collaboration
#> 5       arxiv_condmat  186936    FALSE arXiv Condensed Matter Collaboration

A key difference to note is that the Var_name column refers to the dataset name that should be used when accessing it through ICON whereas the Name column lists a more descriptive dataset name with little programmatic relevance. Two salient points should be made here. First, descriptions of the information contained in each column of ICON_data can be accessed in the package documentation via ?ICON_data or equivalents. Second, package metadata is generally available in R data packages via package documentation, which avoids the need for a dataset like ICON_data. However, since the networks that ICON provides can be quite large, even in a compressed binary format, they are not hosted within the package on the Comprehensive R Archive Network (CRAN). Instead, the desired packages are only downloaded locally when the user instructs ICON to do so via the get_data function (explored in the next section). Although this comes with the disadvantage of not having all datasets immediately available upon installation of ICON, it does save ICON users considerable space if they only wish to use a small subset of the available datasets.

How to load a specific dataset in ICON?

Once the list of available datasets has been explored, the complex networks can be downloaded using the get_data function and the Var_name column in ICON_data. For example, the first network in ICON_data has Var_name set to aishihik_intensity, so we download it and peek as follows.

get_data("aishihik_intensity")
#> DATASET(S) aishihik_intensity LOADED

head(aishihik_intensity)
#>   Fish Parasite Intensity
#> 1    1       V1       5.8
#> 2    1       V9       7.0
#> 3    1      V16       3.0
#> 4    1      V22       1.0
#> 5    2       V3       7.2
#> 6    2       V8      65.8

Every complex network in ICON is provided as an edgelist stored in a data frame. Each row of the data frame corresponds to a single edge and the first two columns contain the nodes that define the edge; for directed networks, the first column will always be the source (from) and the second column will always be the sink (to). Note that since only an edgelist is provided, nodes of degree zero will not be included. Weighted (and some unweighted) networks will contain more than two columns; the additional columns represent either edge weights or other edge attributes. The get_data function can also download multiple datasets at once. Note that the get_data’s envir parameter can be modified to select a different environment; by default, the objects will load on the global environment (.GlobalEnv).

get_data(c("coldlake_intensity", "fullerene_c60"))
#> DATASET(S) coldlake_intensity fullerene_c60 LOADED

head(coldlake_intensity)
#>   Fish Parasite Intensity
#> 1    1       V2       2.1
#> 2    1       V5        NA
#> 3    1       V8       1.7
#> 4    1      V12       3.5
#> 5    1      V16       1.7
#> 6    1      V19       1.2

head(fullerene_c60)
#>   Node1 Node2
#> 1     0     1
#> 2     0     2
#> 3     0     3
#> 4     1     4
#> 5     1     5
#> 6     2    10

get_data also lends itself to a simple solution (following code chunk) to download all the networks available through ICON, however, this should be used with caution as there are a large number of them. The following chunk is not actually run in the vignette.

# download all available complex networks
get_data(ICON_data$Var_name)

Once downloaded, the complex networks can be stored locally in a binary (e.g. RDA/RData, RDS) or plain-text (CSV, TXT) format; storing it locally removes the reliance on an internet connection for future use.

How to analyze or visualize networks acquired through ICON?

Although ICON provides the complex networks, it does not provide functionality to analyze or visualize them. However, the S3 generic as.network.ICON is provided to permit use of the network and ggnetwork R packages for analysis and visualization, respectively. The following code shows how to generate an object of class network using a previously downloaded complex network.

# make sure that the downloaded network has class `ICON`
class(aishihik_intensity)
#> [1] "ICON"       "data.frame"

# coerce to class `network` (`as.network.ICON` is called)
coerced_network <- as.network(aishihik_intensity)

# check if the coerced network has the correct class
class(coerced_network)
#> [1] "network"

# peek at the coerced network (pay attention to number of edges)
coerced_network
#>  Network attributes:
#>   vertices = 36 
#>   directed = FALSE 
#>   hyper = FALSE 
#>   loops = FALSE 
#>   multiple = FALSE 
#>   bipartite = FALSE 
#>   total edges= 78 
#>     missing edges= 0 
#>     non-missing edges= 78 
#> 
#>  Vertex attribute names: 
#>     vertex.names 
#> 
#>  Edge attribute names: 
#>     Intensity

# check number of vertices in initial ICON object
length(unique(c(aishihik_intensity[, 1], aishihik_intensity[, 2])))
#> [1] 36

# check number of edges in initial ICON object
nrow(aishihik_intensity)
#> [1] 78

The initial network (named aishihik_intensity) and the coerced network (named coerced_network) both contain 36 vertices and 78 edges. Once we have an object of class network, we can visualize it easily using ggnetwork (a ggplot2 extension).

# ggnetwork fortifies objects of class `network` without additional code
# the aes parameters should be used as-is
ggplot(coerced_network, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges() +
  geom_nodes() +
  theme_blank()

coerced_network also has an edge attribute named Intensity (see the name of the third column name in aishihik_intensity). This edge attribute can be used to color the edges as follows.

ggplot(coerced_network, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(aes(color = Intensity)) + # this line changed
  geom_nodes() +
  theme_blank()

Implementing an S3 generic for the as.network method and ICON class thus makes ICON datasets compatible with the network and ggnetwork packages, considerably simplifying the processes for network analysis and visualization. ICON’s GitHub README provides sample code to analyze and visualize ICON complex networks using igraph as an alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment