Skip to content

Instantly share code, notes, and snippets.

@mingjiphd
Created January 29, 2026 01:43
Show Gist options
  • Select an option

  • Save mingjiphd/5fd730cdc8663cbdd8009020916c9256 to your computer and use it in GitHub Desktop.

Select an option

Save mingjiphd/5fd730cdc8663cbdd8009020916c9256 to your computer and use it in GitHub Desktop.
Machine Learning using R Dimensionality Reduction using SOM
This R script is a step by step demonstration about how to perform a SOM (self organizing maps) analysis using R. It contains a brief overview of SOMs. A simulated data set in which there are nonlinear relationships among the features was created for SOM analysis. The R package kohonen was used for the SOM analysis. Extensive visualizations of various plots from a fitted SOM model are produced and explained. This video only covers the use of the som function in the kohonen package. We will produce another video to demonstrate the function supersom in the kohonen package.
A step by step video demo can be found at: https://youtu.be/nqHnKWr1GI4?si=jkeF99u4GKAmPdkd
A step by step video demo can be found at: https://youtu.be/nqHnKWr1GI4?si=jkeF99u4GKAmPdkd
########################################################################
# Machine Learning using R Dimensionality Reduction by SOM #
########################################################################
###### A Brief Overview of SOM (Self Organizing Maps)
### Self-organizing maps (SOMs) are unsupervised neural networks that
## project high-dimensional data onto a low-dimensional (usually
## 2D) grid while preserving neighborhood structure.
### Each grid node stores a prototype (codebook vector), and similar data
## points map to nearby nodes on the grid.
### Learning uses competitive updates: the Best Matching Unit (BMU) and
## its neighbors move their weights toward each input, with learning
## rate and neighborhood size decreasing over time.
### SOMs are widely used for clustering, visualization, anomaly detection,
## and exploratory analysis of unlabeled, high-dimensional data.
### They combine nonlinear dimensionality reduction, clustering, and
## visualization in a single framework, without needing labeled data
## or backpropagation.
# Load required library for SOM analysis
install.packages("kohonen")
library(kohonen)
# Step 1: Generate simulated multivariate dataset
set.seed(1132026) # For reproducibility
n_samples <- 500
n_features <- 8
# Simulate data from 3 clusters with different means and correlations
cluster1 <- matrix(rnorm(n_samples * n_features / 3, mean = 2, sd = 1), ncol = n_features)
cluster2 <- matrix(rnorm(n_samples * n_features / 3, mean = -1, sd = 1.5), ncol = n_features)
cluster3 <- matrix(rnorm(n_samples * n_features / 3, mean = 0, sd = 0.8), ncol = n_features)
# Add cluster-specific correlations and noise
sim_data <- rbind(cluster1, cluster2, cluster3)
colnames(sim_data) <- paste0("Feature_", 1:n_features)
# Add some non-linear relationships
sim_data[,1] <- sim_data[,2]^2 + rnorm(n_samples, 0, 0.5)
sim_data[,3] <- sin(sim_data[,4]) + rnorm(n_samples, 0, 0.3)
print("Simulated dataset summary:")
print(summary(sim_data))
print(paste("Dataset dimensions:", nrow(sim_data), "x", ncol(sim_data)))
# Step 2: Scale the data (required for SOM)
sim_data_scaled <- scale(sim_data)
# Step 3: Create SOM grid (5x5 hexagonal grid)
som_grid <- somgrid(xdim = 5, ydim = 5, topo = "hexagonal")
# Step 4: Train SOM using comprehensive parameters
# Using rlen="batch" for batch training, user.weights for feature importance
som_model <- som(
sim_data_scaled,
grid = som_grid,
rlen = 1000,
alpha = c(0.05, 0.01),
radius = c(2, 1),
mode = "batch",
user.weights = rep(1, n_features),
whatmap = 1
)
# Step 5: Comprehensive visualization using all kohonen package features
## 5.1: Basic property plots
par(mfrow = c(2, 3))
# Training progress (learning curves) == convergence
plot(som_model, type = "changes")
# Property trajectories (codebook evolution) == visualize ANY numeric property mapped across your SOM grid
## red color means far away from original data blue=small adjustment
## smooth color graients means topology preservation
plot(som_model, type = "property", property = som_model$changes[[1]])
# Count of samples per unit == color intensity proportional to assigned observations
plot(som_model, type = "counts")
# Quality measures (distances between neighboring units) == cool color(blue) demonte similar prototypes
## warm colors(yellow/red) mark boundaries
plot(som_model, type = "quality")
# Mapping of data onto SOM samples mapped to SOM grid, colored by rainbow scheme
## projects all 500 data samples onto the 25-node SOM grid, colored by their assigned node.
## colored by rainbow palette; dots for individual data points
## dense regions with many dots
plot(som_model, type = "mapping",
bgcol = rainbow(length(som_model$unit.classif), end = 0.8)[som_model$unit.classif],
main = "Data Mapping on SOM")
# Codebook vectors (prototype vectors) == Fan diagram of scaled prototype vectors for all 8 features per node
## Longer spokes indicate higher feature values
## similar fan patterns indicate topology preservation
plot(som_model, type = "codes", palette.name = rainbow)
## 5.2: Component planes (one for each feature)== show how each of the 8 features varies continuously
## across your 25-node hexagonal SOM grid.
## Feature 1 higher feature values form a connected zone
## Feature 2 and 5 show similar patterns indicating correlation
##
par(mfrow = c(2, 4))
for(i in 1:n_features) {
plot(som_model, type = "property", property = som_model$codes[[1]][,i],
main = colnames(sim_data)[i],
palette.name = heat.colors)
}
par(mfrow=c(1,1))
## 5.3: Distance matrix between SOM units == blue = cluster interior yellow=cluster boundary
## grey = empty/low count nodes
## Clear separation of the three indicates perfect SOM training
plot(som_model, type = "dist.neighbours",
palette.name = topo.colors)
## 5.4: Hierarchical clustering of SOM units
# Correct approach - use codebook distances instead:
codebook_dist <- dist(som_model$codes[[1]])
som_hclust <- hclust(codebook_dist)
plot(som_hclust, main = "SOM Units Clustering", hang = -1)
# Step 6: Extract results and quality metrics
# Unit classification (which unit each sample belongs to)
unit_classif <- som_model$unit.classif
# Distance to best matching unit (quantization error)
qerror <- sum(som_model$unit.distances[unit_classif + 1]^2) / nrow(sim_data)
# Print summary statistics
cat("\n=== SOM ANALYSIS SUMMARY ===\n")
cat("Grid dimensions:", som_model$grid$xdim, "x", som_model$grid$ydim, "\n")
cat("Training iterations:", som_model$rlen, "\n")
cat("Quantization error:", round(qerror, 4), "\n")
cat("Number of empty units:", sum(som_model$codes[[1]] == 0), "\n")
cat("Unit classification distribution:\n")
print(table(unit_classif))
# Step 7: Advanced features - SuperSOM preparation (for multiple datasets)
# This demonstrates how to extend to multiple data layers
cat("\n=== READY FOR SUPER SOM (MULTI-LAYER) ===\n")
cat("Current model can be extended with additional datasets using:\n")
cat("supersom(data = list(sim_data_scaled, new_dataset), ...)\n")
# Step 8: Save results
save(som_model, sim_data, sim_data_scaled, file = "som_analysis_complete.RData")
cat("\n=== ANALYSIS COMPLETE ===\n")
cat("All kohonen package features demonstrated:\n")
cat("- som() training with batch mode\n")
cat("- Multiple plot types: learning, property, counts, quality, mapping, codes\n")
cat("- Component planes for all features\n")
cat("- Distance matrices\n")
cat("- Hierarchical clustering of units\n")
cat("- Quality metrics extraction\n")
cat("- SuperSOM ready structure\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment