Created
January 16, 2019 12:41
-
-
Save zmwebdev/1978319aa9ec9d3fb292b1c890208d02 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
```{r eval=FALSE, include=FALSE} | |
# Recordad que la memoria deberá incluir: | |
# * Título y nombre completo del autor/a. | |
# * Indice general. | |
# * Fuente de recogida de los datos y descripción de los datos con la correspondiente codificación. | |
# * Estudio descriptivo general del conjunto de datos. (Estadísticos descriptivos univariantes, bivariantes, ..., histogramas, boxplots, diagramas de dispersión, coplots...) | |
# * Aplicación de, por lo menos, dos de las técnicas del curso al conjunto de datos, incluyendo motivación de la aplicación de esa técnica y comentarios de los resultados presentados. | |
# * Conclusiones: un resumen de los resultados del estudio. | |
# * Bibliografía utilizada. | |
``` | |
\newpage | |
# Sarrera | |
Aukeratutako Dateset-a **Breast Cancer Wisconsin (Diagnostic)** izan da. Bertan, `ezaugarriak` (features) bularreko [fine needle aspirate (FNA)](https://en.wikipedia.org/wiki/Fine-needle_aspiration)-ren irudi digitalizatu batetik atera dira. Datu-basea [@uci] edo [@kaggle01] | |
helbideetatik lortu daiteke. | |
Lan hau egiteko erabilitako erreferentzia nagusiak, KISA Masterreko `Datuen Esplorazioa eta Analisia` irakasgaiko Irakasleen apunteak eta Kaggle-eko hondorengo `kernel` hauek izan dira: [@kaggle02], [@kaggle03] eta [@kaggle04], baita [@Kalshtein] ere. | |
# Dataset gainbegirada | |
Atributuen informazioa: | |
[1] ID number | |
[2] Diagnosis (M = malignant, B = benign) | |
[3-32] Lau taldetan (mean, se, wort), bakoitzean hamar ezaugarri, zelula nukleo bakoitzerako: | |
- a) radius (mean of distances from center to points on the perimeter) | |
- b) texture (standard deviation of gray-scale values) | |
- c) perimeter | |
- d) area | |
- e) smoothness (local variation in radius lengths) | |
- f) compactness (perimeter^2 / area - 1.0) | |
- g) concavity (severity of concave portions of the contour) | |
- h) concave points (number of concave portions of the contour) | |
- i) symmetry | |
- j) fractal dimension ("coastline approximation" - 1) | |
\hfill\break | |
Lehenengo egingo duguna dataset-a `csv` formatuan dagoena, [kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/#data.csv) web gunetik jeitsi eta irakurtzea izango da. Beheko tauletan ikusi dezakegu fitxategian dugun informazioa. | |
\newpage | |
```{r include=FALSE} | |
library(caret) | |
library(corrplot) | |
library(gridExtra) | |
library(grid) | |
library(ggplot2) | |
rm(list=ls()) | |
# | |
data <- read.csv("breast-cancer-wisconsin.csv", header=T, stringsAsFactors=F) | |
``` | |
```{r DatasetSummary, echo=FALSE, tab.cap="DatasetSummary"} | |
# install.packages("kableExtra") | |
library("kableExtra") | |
n <- nrow(data) | |
p <- ncol(data) | |
# datu aleatorio batzuk | |
# sample(2:ncol(data)-1)[1:3] | |
d <- data[,c(1,2)] | |
d <- data.frame(d,data[sample(1:n),sample(3:p-1)[1:5]]) | |
d["..."] <- "..." | |
kable(d[1:10,], "latex", booktabs = T, caption = "Dataset Summary") %>% | |
# "hold_position" | |
kable_styling(latex_options =c("striped", "scale_down", "hold_position")) | |
rm(d) | |
#kable(head(data), "latex", booktabs = T) %>% kable_styling(latex_options =c("striped", "scale_down")) | |
#str(data) | |
# https://stackoverflow.com/questions/44200394/show-str-as-table-in-r-markdown | |
kable(paste("'data.frame': ", dim(data)[1], " obs. of ", dim(data)[2], " variables:"), "latex", booktabs = T, col.names = NULL, caption = "Dataset Summary II", longtable = T) %>% | |
kable_styling(full_width = T, latex_options =c("striped", "hold_position")) | |
data.frame(Aldagaia = names(data), | |
Mota = sapply(data, typeof), | |
Lehenengo_balioak = sapply(data, function(x) paste0(head(x), collapse = ", ")), | |
row.names = NULL) %>% | |
kable("latex", booktabs = T, longtable = T) %>% | |
kable_styling(latex_options =c("striped", "hold_position")) | |
``` | |
Aldagai iragarle *predictor* guztiak **zenbaki erreal jarraiak** dira. Guztira 33 aldagai `feature` ditugu. Hauetatik azterketa egiteko batzuk ez zaizkigu interesatzen eta beraz kendu egingo ditugu (`'id'` eta `'X'`) eta `diagnosis` iragarri behar dena *predicted* klasea, `factor` bihurtuko dugu. | |
\hfill\break | |
```{r echo=FALSE} | |
data$diagnosis <- as.factor(data$diagnosis) | |
# 33. zutabea ez dago ondo. Kendu | |
data[,33] <- NULL | |
# id zutabea kendu | |
data[,1] <- NULL | |
# scale | |
#data.scaled <- data.frame(scale(data[,-1]), diagnosis=data[,1]) | |
# aldagaiak iragarle eta predicted(diagnosis) artean banandu | |
#bc.data <- data[,-1] | |
#bc.diag <- data[,1] | |
``` | |
Aldagai iragarleak 3 multzotan taldekatuak daude: **Mean, Worst eta Standard Error(se)** (Ikusi ondorengo taula). Ondorengo azterketa egiteko, talde guztiekin egin beharrean, **Mean taldeko aldagaiak bakarrik hartuko dira kontutan** | |
```{r echo=FALSE} | |
data.mean <- data[,c(2,3,4,5,6,7,8,9,10,11)] | |
data.se <- data[,c(12,13,14,15,16,17,18,19,20,21)] | |
data.worst <- data[,c(22,23,24,25,26,27,28,29,30,31)] | |
# izenak aldatu "_xxxxx" kendu | |
library(stringr) | |
library(data.table) | |
setnames(data.mean, old = names(data.mean), new = str_replace_all(names(data.mean), "_mean", "")) | |
setnames(data.worst, old = names(data.worst), new = str_replace_all(names(data.worst), "_worst", "")) | |
setnames(data.se, old = names(data.se), new = str_replace_all(names(data.se), "_se", "")) | |
# | |
d <- rbind(data.mean[1:5,], data.worst[1:5,], data.se[1:5,]) | |
kable(d, "latex", booktabs = T, caption = "Cancer Dataset") %>% | |
kable_styling(latex_options =c("striped", "scale_down", "hold_position")) %>% | |
#kable_styling() %>% | |
#add_header_above(c("Mean" = 10)) | |
group_rows("Mean", 1, 5) %>% | |
group_rows("Worst", 6, 10) %>% | |
group_rows("Standard Error", 11, 15) | |
``` | |
## Balio galduak | |
Ba al dago NA-rik? | |
```{r} | |
# NA? | |
sum(sapply(data, FUN = function(col) sum(is.na(col)))) | |
``` | |
Ez, ez dago NA baliorik. | |
```{r eval=FALSE, include=FALSE} | |
summary(data) | |
``` | |
## *Diagnosis* predicted aldagaia | |
```{r include=FALSE} | |
benign_prob = round(prop.table(table(data$diagnosis))[[1]]*100) | |
malign_prob = round(prop.table(table(data$diagnosis))[[2]]*100) | |
``` | |
```{r diagnosis_banaketa, echo=FALSE, fig.cap="\\label{fig:diagnosis_banaketa}Diagnosis. Benign %63 eta Malign %37", fig.height=4, fig.width=2} | |
my_cols = c("#F8766D","#00BFC4") | |
bp <- barplot(table(data$diagnosis), | |
legend=FALSE, | |
axes=TRUE, | |
col=my_cols | |
#beside=TRUE # esta opción hace que cada grupo tenga una barra | |
) | |
``` | |
# Aldagaien deskribapen unitarioa. Aldagai kuantitatiboen erlazioa aldagai kualitatiboarekin | |
Atal honetan aldagaien deskribapen unitarioa eta aldagai kuantitatiboak, aldagai kualitatiboarekin (diagnosia) duten eralazio aztertuko da. Aurreko atalean adierazi dugun moduan `Mean` aldagai iragarle kuantitatiboak eta `diagnosis` aldagai kulatitatiboa (factor) aztertuko ditugu. | |
```{r include=FALSE} | |
# hemandik aurrera gure data <- data.mean + diagnisis izango da | |
rm(data.worst) | |
rm(data.se) | |
data.old <- data | |
diagnosis <- data$diagnosis | |
d <- data.mean | |
d["diagnosis"] <- diagnosis | |
n <- nrow(d) | |
p <- ncol(d) | |
data <- d | |
data.mean <- data[,-p] | |
rm(d) | |
``` | |
```{r mean_table, include=FALSE, tab.cap="Mean"} | |
kable(head(data), "latex", booktabs = T, caption = "Cancer Dataset (Mean)") %>% | |
kable_styling(latex_options =c("striped", "scale_down", "hold_position")) %>% | |
add_header_above(c("Mean" = 10)) | |
``` | |
```{r include=FALSE} | |
kable(paste("'data.frame': ", dim(data)[1], " obs. of ", dim(data)[2], " variables:"), "latex", booktabs = T, col.names = NULL, caption = "Dataset Summary") %>% | |
kable_styling(full_width = T, latex_options =c("striped", "hold_position")) | |
data.frame(Aldagaia = names(data), | |
Mota = sapply(data, class), | |
Lehenengo_balioak = sapply(data, function(x) paste0(head(x), collapse = ", ")), | |
row.names = NULL) %>% | |
#data.frame(Aldagaia = names(data$diagnosis), | |
# Mota = sapply(data, ), | |
# Lehenengo_balioak = sapply(data, function(x) paste0(head(x), collapse = ", ")), | |
# row.names = NULL) %>% | |
kable("latex", booktabs = T) %>% | |
kable_styling(latex_options =c("striped", "hold_position")) | |
``` | |
## Estatistikoak | |
```{r} | |
summary(data) | |
``` | |
```{r eval=FALSE, include=FALSE} | |
#fivenum(data[,2]) | |
``` | |
<!-- | |
Aldagai guztiekin taula bat egin beraien estatistikoak jarriz. | |
https://www.statmethods.net/stats/descriptives.html | |
### Kuantitatiboak: | |
#### 1. Estadisticos de orden | |
- Tamaño del conjunto de datos | |
- Ordenamiento | |
- Extremos del conjunto: 'max()', 'min()' | |
- summary(data) | |
- fivenum(x) | |
- Dispersión de una variable: 'mad()', 'IQR()' (boxplot?) | |
#### 2. Estadisticos basados en la naturaleza numerica | |
- mean() | |
- Dispersión de una variable: 'var()', 'sd()' | |
- Desviación (error) media respecto a la media | |
- Desviación absoluta media | |
- Desviación cuadrática media o varianza var() | |
- desviacion tipica = sqrt(varianza) | |
- Desviación estándar o típica. | |
### Kualitatiboak | |
class(x) | |
levels(x) | |
- Tendencia de una variable: moda | |
```{r} | |
smoda <- function(x){ | |
xtab <- table(x) | |
modas <- xtab[max(xtab)==xtab] | |
modaprim <- as.numeric(modas[1]) | |
lasmodas <- names(modas) | |
modasal <- list(lasmodas=lasmodas, valormoda=modaprim) | |
return(modasal) | |
} | |
# | |
xmoda.n <- smoda(data$diagnosis); xmoda.n | |
``` | |
--> | |
## Histogramak | |
```{r histogram_plot, fig.cap="\\label{fig:histogram_plot}Mean. histogramak", echo=FALSE, fig.height=10, fig.width=12} | |
par(mfrow=c(3, 4), oma=c(0,0,3,0)) | |
#for (j in 1:(p-1)) | |
# hist(data[,j], col="red", | |
# xlab=names(data)[j], ylab="", main=" ") | |
for(j in 1:(p-1)) { | |
hist(data[,j], col="red", xlab="", ylab="", main = names(data)[j], las=1, prob=TRUE) | |
lines(density(data[,j])) | |
} | |
``` | |
## Dentsitateak | |
Ondoren aldagai bakoitzaren `dentsitatea` ploteatuko dugu `diagnosia` predicted klasea kontutan hartuta. | |
```{r feature_plot_density, echo=FALSE, fig.cap="\\label{fig:feature_plot_density}Aldagai iragarleen dentsitatea", fig.height=10, fig.width=10} | |
# Aldagai iragarle 'feature' esanguratsu batzuk bakarrik erakutsiko ditugu | |
scales <- list(x=list(relation="free"),y=list(relation="free"), cex=0.8) | |
featurePlot(x=data.mean, y=data$diagnosis, plot="density",scales=scales, | |
layout = c(4,3), auto.key = list(columns = 2), pch = "|") | |
``` | |
Goiko [\ref{fig:feature_plot_density}] irudian ikusten denez, aldagai guztien artean ez dago banaketa perfekturik. Badira banaketa nahiko ondo adierazten dutenak (`concavity`, `area`, ...) eta gainjartze handia dutenak ere ( `fractal_dimension`, `symetry`, ... ) | |
## Stripchart | |
```{r stripchart_plot, fig.cap="\\label{fig:stripchart_plot}Stripchart", echo=FALSE, fig.height=10, fig.width=12} | |
par(mfrow=c(3, 4), oma=c(0,0,3,0)) | |
for (j in 1:(p-1)) | |
stripchart(data[,j] ~ data$diagnosis, | |
pch=19, method="stack", col=my_cols, | |
xlab=names(data)[j], ylab=names(data)[p]) | |
#par(mfrow=c(1, 1)) | |
# | |
``` | |
## Box Plots | |
<!-- | |
Box-Plot grafikoen [\ref{fig:feature_box_plot_all2}] eta [\ref{fig:feature_box_plot}] bidez, | |
[sartu hemen boxplot svg irudia] | |
--> | |
```{r feature_box_plot_all2, echo=FALSE, fig.cap="\\label{fig:feature_box_plot_all2}Box-Plot. Aldagai guztiak", fig.height=10, fig.width=12} | |
par(mfrow=c(3, 4), oma=c(0,0,3,0)) | |
for (j in 1:(p-1)) | |
boxplot(data[,j], | |
main=names(data)[j], | |
ylab="", | |
las=2) | |
``` | |
```{r feature_box_plot, echo=FALSE, fig.cap="\\label{fig:feature_box_plot}Box-Plot. `Bening` eta `Malign` banaketarekin.", fig.height=12, fig.width=10} | |
featurePlot(x = data.mean, | |
y = data$diagnosis, | |
plot = "box", | |
## Pass in options to bwplot() | |
scales = list(y = list(relation="free"), | |
x = list(relation="free"), | |
cex=0.8), | |
layout = c(4,4), | |
auto.key = list(columns = 2)) | |
``` | |
# Transformazioak. Datuak eskalatzea. Outlier azterketa. | |
## Transformazioak. Datuak eskalatzea | |
*Datuak estandarizatuko dira* nahiz eta aldagaiak homogeneoak izan, beraien arteko aldeak handiak dira (batez ere `area`, besteekin alderatuta) eta beraz, *datuak eskalatu* dira: (`scale`: media=0 eta standard_deviation=1). | |
```{r feature_box_plot_all_scaled, echo=FALSE, fig.cap="Aldagai iragarleen Box-Plot datuak eskalatu gabe eta eskalatu ondoren", fig.height=6, fig.width=12} | |
boxplot(data.mean, | |
main="Datuak eskalatu gabe", | |
ylab="", | |
names=names(data.mean), | |
las=2) | |
data <- data.frame(scale(data.mean), diagnosis=data$diagnosis) | |
data.mean <- data[,-p] | |
boxplot(data.mean, | |
main="Datuak eskalatuta", | |
ylab="", | |
names=names(data.mean), | |
las=2) | |
``` | |
## Outlier-ak | |
<!-- | |
Outlierak aurkituko ditugu. Horretarako box-plotak erabili ditzakegu bisualki atzemateko edo (https://www.rdocumentation.org/packages/DescTools/versions/0.99.19/topics/Outlier)[R-k dakarkien 'Outlier'] funtzioa erabil dezakegu edo baita ere `boxplot.stats(var_name)$out` funtzioa. | |
--> | |
Outlier-ak ezabatu ala ez erabakitzeko, `Tukey's method to identify the outliers ranged above and below the 1.5*IQR` metodoa erabiliz, gure aldagaiak aztertu ditzakegu. Beheko irudian [\ref{fig:outlier_plot}] `radius` aldagaiaren outlier guztiak ezabatzearen ondorioa ikus daiteke. | |
```{r outlier_plot, echo=FALSE, fig.cap="\\label{fig:outlier_plot}`radius` aldagaiaren Outlier-ak ezabatzearen ondorioa `Tukey` metodoa erabiliz", fig.width=8, fig.height=6} | |
# test | |
par(mfrow=c(2, 2)) | |
#abline(h = min(boxplot(data$radius)$out), v = 0, col = "red") | |
boxplot(data$radius, outcol="red", main="Outlier datuekin") | |
hist(data$radius, main="Outlier datuekin", xlab=NA, ylab=NA) | |
x <- data$radius | |
x <- x[!x %in% boxplot.stats(x)$out] | |
boxplot(x, main="Outlier gabe") | |
hist(x, main="Outlier gabe", xlab=NA, ylab=NA) | |
par(mfrow=c(1, 1)) | |
``` | |
```{r outlier_plot2, eval=FALSE, fig.cap="\\label{fig:outlier_plot2}Outlier-ak ezabatzearen azterketa", fig.height=10, fig.width=12, include=FALSE} | |
# https://www.rdocumentation.org/packages/DescTools/versions/0.99.19/topics/Outlier | |
# https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/ | |
# https://en.wikipedia.org/wiki/Box_plot#/media/File:Boxplot_vs_PDF.svg | |
outlierKD <- function(dt, var) { | |
var_name <- eval(substitute(var),eval(dt)) | |
na1 <- sum(is.na(var_name)) | |
m1 <- mean(var_name, na.rm = T) | |
par(mfrow=c(2, 2), oma=c(0,0,3,0)) | |
boxplot(var_name, main="With outliers") | |
hist(var_name, main="With outliers", xlab=NA, ylab=NA) | |
outlier <- boxplot.stats(var_name)$out | |
mo <- mean(outlier) | |
var_name <- ifelse(var_name %in% outlier, NA, var_name) | |
boxplot(var_name, main="Without outliers") | |
hist(var_name, main="Without outliers", xlab=NA, ylab=NA) | |
title("Outlier Check", outer=TRUE) | |
na2 <- sum(is.na(var_name)) | |
cat("Outliers identified:", na2 - na1, "n") | |
cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n") | |
cat("Mean of the outliers:", round(mo, 2), "n") | |
m2 <- mean(var_name, na.rm = T) | |
cat("Mean without removing outliers:", round(m1, 2), "n") | |
cat("Mean if we remove outliers:", round(m2, 2), "n") | |
#response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ") | |
#if(response == "y" | response == "yes"){ | |
# dt[as.character(substitute(var))] <- invisible(var_name) | |
# assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv) | |
# cat("Outliers successfully removed", "n") | |
# return(invisible(dt)) | |
# } else{ | |
# cat("Nothing changed", "n") | |
return(invisible(var_name)) | |
#} | |
} | |
outlierKD(data, radius) | |
``` | |
# Aldagai askoren arteko deskribapena | |
Aldagaiak bakarka aztertu ondoren, taldean (pareka) beraien ezaugarriak deskribatzuko ditugu. | |
## Feature pairs | |
Aldagaien banakako analisi txikiaren ondoren 'pareka' aztertuko dira. | |
```{r feature_pairs_plot, echo=FALSE, fig.cap="\\label{fig:feature_pairs_plot}feature Pairs", fig.height=12, fig.width=12} | |
scales <- list(x=list(relation="free"),y=list(relation="free"), cex=0.4) | |
featurePlot(x=data[,-p], y=data$diagnosis, plot="pairs",scales=scales, | |
auto.key = list(columns = 2), pch=".") | |
``` | |
## Korrelazioa | |
### Korrelazioa diagnosi klasearekin | |
```{r echo=FALSE} | |
# Razon de correlacion (indicador) | |
# http://fr.wikipedia.org/wiki/Rapport_de_corr%C3%A9lation | |
# | |
# Se define una funcion que calcula la razon de correlacion, eta2 | |
# | |
eta2 <- function(x, factor) | |
{ | |
niv <- levels(factor) | |
numniv <- length(niv) | |
SSB <- 0 | |
for(i in 1:numniv) | |
{ | |
xx <- x[factor==niv[i]] | |
nxx <- length(xx) | |
SSB <- SSB + nxx*(mean(xx)-mean(x))^2 | |
} | |
SST <- (length(x)-1)*var(x) | |
# | |
eta2 <- SSB/SST | |
# | |
return(eta2) | |
} | |
# | |
# 0 <= eta2 <= 1 | |
# Si eta2=0, entonces no hay correlacion entre 'x' e 'y', | |
# las medias parciales son todas iguales | |
# Si eta2=1, entonces hay una dependencia funcional entre 'x' e 'y' | |
# no hay variabilidad en las categorias | |
# | |
etados <- vector() | |
for(j in 1:(p-1)) etados[j] <- eta2(data[,j], data[,p]) | |
etados <- round(etados,digits = 3) | |
names(etados) <- names(data)[1:(p-1)] | |
#as.matrix(sort(etados, decreasing = TRUE)) | |
#print("Gehien erlazionatuak:") | |
# Taula egin | |
eta2_table <- as.matrix(sort(etados, decreasing = TRUE)) | |
kable(eta2_table, caption = "Rapport de corrélation. eta2") | |
``` | |
**`diagnosis`** faktorearekin gehien erlazionatuta dauden aldagaiak [Razón de Correlación](http://fr.wikipedia.org/wiki/Rapport_de_corr%C3%A9lation) erabiliz. | |
```{r echo=FALSE, fig.height=4, fig.width=10} | |
par(mfrow=c(1, 2), oma=c(0,0,3,0)) | |
stripchart(data$concave.points ~ data$diagnosis, | |
pch=19, method="stack", col=my_cols, | |
xlab="concave.points", ylab=names(data)[1], main=" ") | |
stripchart(data$perimeter ~ data$diagnosis, | |
pch=19, method="stack", col=my_cols, | |
xlab="perimeter", ylab=names(data)[1], main=" ") | |
``` | |
```{r eval=FALSE, fig.height=8, fig.width=8, include=FALSE} | |
### Pearson correlation | |
nc=ncol(data) | |
df <- data[,2:nc] | |
#df$diagnosis <- as.integer(factor(df$diagnosis))-1 | |
correlations <- cor(df,method="pearson") | |
corrplot(correlations, number.cex = .9, method = "square", | |
hclust.method = "ward", order = "FPC", | |
type = "full", tl.cex=0.8,tl.col = "black") | |
``` | |
\newpage | |
### Korrelazioa aldagai guztien artean | |
Azterketarekin jarraituz aldagai guztien arteko korrelazioa aztertuko dugu: | |
```{r echo=FALSE} | |
d <- round(cor(data[,-p]), digits=3) | |
kable(d, "latex", caption = "Korrelazio taula") %>% | |
# "hold_position" | |
kable_styling(latex_options =c("hold_position", "scale_down")) | |
``` | |
```{r feature_correlation_plot, echo=FALSE, fig.cap="\\label{fig:feature_correlation_plot}Korrelazioa", fig.height=8, fig.width=8} | |
corr_mat <- cor(data[,1:(p-1)]) | |
corrplot(corr_mat, order = "hclust", tl.cex = 0.8, addrect = 8) | |
``` | |
### Korrelazioa handiko pareak | |
Korrelazio irudian [\ref{fig:feature_correlation_plot}] ikusten den bezala, korrelazio handia agertzen da zenbait aldagaien artean, besteak beste: | |
- `area`, `radius` eta `perimeter` (%99) Azken finean erradioa, azalera eta perimetroak erlazio geometriko zuzena dute | |
- `concavity` eta `concave.points` | |
Ikus ditzagun hauetako batzuk modu zehatzago batean: | |
```{r feature_plot_corr2, echo=FALSE, fig.cap="\\label{fig:feature_plot_corr2}Korrelazio batzuk", fig.height=6, fig.width=8, message=FALSE} | |
b1 <- ggplot(data, aes(x=radius, y=perimeter, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
b2 <- ggplot(data, aes(x=area, y=radius, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
b4 <- ggplot(data, aes(x=concavity, y=concave.points, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
grid.arrange(b1, b2, b4, ncol=2) | |
``` | |
Ikusi daiteke kasu batzuetan korrelazio handia dutenen artean banaketa egokia dagoela `B` diagnosia eta `M` diagnosiaren artean eta beste kasuetan ez hainbeste. | |
### Alderantzizko korrelazioa duten pareak | |
```{r feature_plot_corr3, echo=FALSE, fig.cap="\\label{fig:feature_plot_corr3}Alderantzizko korrelazioak", fig.height=4, fig.width=8, message=FALSE} | |
b5 <- ggplot(data, aes(x=radius, y=fractal_dimension, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
b6 <- ggplot(data, aes(x=area, y=fractal_dimension, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
grid.arrange(b5, b6, ncol=2) | |
``` | |
### Korrelazio baxuko pareak | |
```{r feature_plot_corr4, echo=FALSE, fig.cap="\\label{fig:feature_plot_corr4}Korrelazio baxuko pareak", fig.height=6, fig.width=8, message=FALSE} | |
b9 <- ggplot(data, aes(x=fractal_dimension, y=area, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
b10 <- ggplot(data, aes(x=fractal_dimension, y=radius, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
b11 <- ggplot(data, aes(x=texture, y=smoothness, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
b12 <- ggplot(data, aes(x=perimeter, y=fractal_dimension, | |
col=data$diagnosis)) + geom_point(alpha=0.5) | |
grid.arrange(b9, b10, b11, b12, ncol=2) | |
``` | |
Aurreko kasuan bezala, Korrelazio baxuko pareeetan [\ref{fig:feature_plot_corr4}] ere ikusi daiteke kasu batzuetan korrelazio handia dutenen artean banaketa egokia dagoela `B` diagnosia eta `M` diagnosiaren artean eta beste kasuetan ez. | |
# Datuen zatiketa {#data_splitting} | |
Datuen analisirako, datuak `training` eta `testing` zatitan banatuko dira. `createDataPartition` funtzioak datuak modu orekatuan zatituko ditu. Hau da, zati bakoitzean diagnosi portzentaia mantenduko du. Adibidez, gure datasetean 80/20 zatiketa egiteko: | |
```{r} | |
set.seed(2019) | |
inTrain <- createDataPartition(data$diagnosis, p=0.80, list = FALSE, times = 1) | |
#head(inTrain) | |
``` | |
Behin zatiketa eginda, horrela erabili dezakegu: | |
```{r} | |
training <- data[inTrain,] | |
testing <- data[-inTrain,] | |
prop.table(table(training$diagnosis)) | |
prop.table(table(testing$diagnosis)) | |
``` | |
Bai `training` eta `testing` datu multzoek `diagnosis` faktore probabilitate berdinak mantendu dituzte. | |
# Unbalanced Data {#data_splitting_unbalanced} | |
Ikusi dugun bezala gure Dataseta desorekatua dago: Iragarri behar den aldagaia `M` (Malign) edo `B` (Benign) portzentaia oso ezberdina da. Desoreka hau konpontzeko erabili daitezkeen metodoak (downsampling, oversampling, SMOTE…) izan daitezke. | |
`downSample will randomly sample a data set so that all classes have the same frequency as the minority class. upSample samples with replacement to make the class distributions equal` | |
\hfill\break | |
\hfill\break | |
## Downsampling | |
```{r} | |
table(data$diagnosis) | |
prop.table(table(data$diagnosis)) | |
set.seed(2019) | |
down_train <- downSample(x = data[, -p], | |
y = data$diagnosis) | |
table(down_train$Class) | |
prop.table(table(down_train$Class)) | |
``` | |
Ikusten den bezala, faktoreak orekatuak daude. Kopurua `M` ren arabera jeitsi da. | |
## Oversampling | |
```{r} | |
up_train <- upSample(x = data[, -p], y = data$diagnosis) | |
table(up_train$Class) | |
prop.table(table(up_train$Class)) | |
``` | |
Datuak orekatuak. `M` diagnosi kopurua, `B`-koaren pare. | |
# PCA | |
Datasetean guztira 30 Aldagai iragarle *predictor* dira baina gogoratu gure analisian **area** taldekoak bakarrik aztertzen ari garela eta beraz guztira 10 aldagai ditugu. PCA erabiliz aldagai hauen kopurua txikitu ahal izango dugu ahalik eta informazio gutxiena galduz. | |
\hfill\break | |
```{r feature_plot_pca, echo=FALSE, fig.cap="\\label{fig:feature_plot_pca}PCA", fig.height=6, fig.width=8, message=FALSE} | |
#par(mfrow=c(1, 2), oma=c(0,0,3,0)) | |
pca_res <- prcomp(data[,1:(p-1)], center = TRUE, scale = TRUE) | |
plot(pca_res, type="l", main="PCA Osagai Nagusiak") | |
``` | |
```{r feature_plot_pca2, echo=FALSE, fig.cap="\\label{fig:feature_plot_pca2}PC1-PC2. Datuen bariantzaren %80 ", fig.height=6, fig.width=8} | |
# https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html | |
library(ggfortify) | |
autoplot(pca_res, data = data, colour = 'diagnosis', alpha = 0.5) | |
``` | |
```{r feature_pca_summary, echo=FALSE, fig.cap="\\label{fig:feature_pca_summary}PCA"} | |
summary(pca_res) | |
``` | |
\hfill\break | |
Bi osagai hartuz (PC1 eta PC2) datuen bariantzaren %80 lortzen da. Bariantzaren %99 lortzeko PC1-PC6 osagaiak erabili behar dira. | |
```{r eval=FALSE, include=FALSE} | |
pca_df <- as.data.frame(pca_res$x) | |
ggplot(pca_df, aes(x=PC1, y=PC2, col=data$diagnosis)) + geom_point(alpha=0.5) | |
``` | |
<!-- | |
**Iragarpena (Prediction):** | |
We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the Cancer data has just arrived and we want to see what is their PCs values: | |
--> | |
```{r eval=FALSE, include=FALSE} | |
# Predict PCs | |
predict(pca_df, newdata=tail(data, 2)) | |
``` | |
```{r eval=FALSE, include=FALSE} | |
# Logistic Regression | |
#https://www.machinelearningplus.com/machine-learning/logistic-regression-tutorial-examples-r/ | |
``` | |
# Teknika ezberdinen aplikazioa datu multzoan. | |
Ondoren erabiliko teknikak `K-Means`, `Decision Tree` eta `KNN` dira. Asmoa teknika Supervised eta No-Supervised erabiltzea izan da nahiz eta lanean darabilgun datasetan datuak labeldunak (Benign/Malign) izan. | |
## K-Means | |
[K-Means](https://en.wikipedia.org/wiki/K-means_clustering) No-Supervised Clustering teknika bat da. `K` talde egingo ditu `mean`(median) oinarrituta. Ondorengo azterketan, nahiz eta jakin *bi talde(diagnosis)* ditugula, ez jakinarena egingo dugu eta **`K=3`** jarriko dugu. K-Means metodoa aplikatu baino lehenago *datuak estandarizatu ditugu* nahiz eta aldagaiak homogeneoak izan, beraien arteko aldeak handiak direnez *datuak eskalatu* dira. | |
**k = 3** | |
```{r echo=FALSE} | |
# K-Means | |
#data0 <- scale(data[-1]) # scale eta 'diagnosis' faktorea kendu | |
data0 <- data[,-p] | |
set.seed(2019) | |
k <- 3 | |
km <- kmeans(data0, k) | |
# | |
IB <- km$betweenss/km$totss # Indice de Bondad | |
# distancia euclidiana cuadrática | |
distancias <- dist(data0, method="euclidean", upper=TRUE, diag=TRUE) | |
distancias2 <- distancias^2 | |
# | |
``` | |
```{r k-means, echo=FALSE, fig.cap="\\label{fig:k-means}k-means. IB=0.5 klaseen homogeneitatea ez da oso ona.", fig.height=16, fig.width=16} | |
pairs(data0, col=km$cluster+1, las=1, | |
main=paste("Cancer dataset\nk-means, k=", k, ", IB=", round(IB,4)), | |
font.main=4, | |
pch=".", cex=0.75) | |
``` | |
```{r k-means2, echo=FALSE, fig.cap="\\label{fig:k-means2}k-means. Hiru klase. Klase zentroen kokapena eta kopuruak. IB=0.5 klaseen homogeneitatea ez da oso ona. `radius`~`texture` grafikoan 1 eta 2 taldeko zentroak oso gertu daude.", fig.height=10, fig.width=12} | |
kmeansplot <- function(km, elecx, elecy) { | |
#elecx <- 1; elecy <- 2 # | |
plot(data0[,c(elecx,elecy)], pch=20, cex=0.75, col=km$cluster+1, las=1, | |
main=paste("Cancer data set\nk-means, k=", k, ", IB=", round(IB,4)), | |
font.main=4, xlab=names(data0)[elecx], ylab=names(data0)[elecy]) | |
# | |
# Klase zentroen kokapena | |
# | |
kmcenters <- matrix(rep(0, k*ncol(data0)), ncol=ncol(data0)) | |
for(j in 1:ncol(data0)) | |
kmcenters[,j] <- tapply(data0[,j], km$cluster, mean) | |
rownames(kmcenters) <- 1:k; colnames(kmcenters) <- colnames(kmcenters) | |
# | |
etclases <- unique(km$cluster) # etiquetas de las clases | |
# | |
#points(kmcenters[etclases,c(elecx,elecy)],col=etclases+1, pch=1, cex=2, lwd=2) | |
points(kmcenters[etclases,c(elecx,elecy)],col="#000000", pch=1, cex=2, lwd=2) | |
# | |
legend("bottomright", bty="n", x.intersp=1, y.intersp=1, | |
legend=paste(etclases,' (',km$size[etclases],')',sep=""), | |
pch=20, col=etclases, text.col=etclases, ncol=1, cex=0.75) | |
} | |
par(mfrow=c(2, 2)) | |
kmeansplot(km, 1,2) | |
kmeansplot(km, 7,5) | |
kmeansplot(km, 10,3) | |
kmeansplot(km, 6,4) | |
``` | |
**k = 2** | |
```{r k-means3, echo=FALSE, fig.cap="\\label{fig:k-means3}k-means. Bi klase. Klase zentroen kokapena eta kopuruak.\n IB=0.4 klaseen homogeneitatea k=3 baino okerragoa", fig.height=10, fig.width=12} | |
# K-Means | |
#data0 <- scale(data[-1]) # scale eta 'diagnosis' faktorea kendu | |
data0 <- data[,-p] | |
set.seed(2019) | |
k <- 2 | |
km2 <- kmeans(data0, k) | |
# | |
IB <- km2$betweenss/km2$totss # Indice de Bondad | |
# | |
par(mfrow=c(2, 2)) | |
kmeansplot(km2, 1,2) | |
kmeansplot(km2, 7,5) | |
kmeansplot(km2, 10,3) | |
kmeansplot(km2, 6,4) | |
``` | |
## Decision Tree | |
[Decision Tree](https://en.wikipedia.org/wiki/Decision_tree) Klasifikazio Teknika *Supervised* da. Lortzen den emaitza (ikusi beheko irudia) grafikoki oso adierazgarria da. Hau da, oso ondo ikus daiteke emaitzaren zergatia, aldagaien konparaketa batzuetan oinarritzen baita. | |
```{r decision_tree, echo=FALSE, fig.cap="\\label{fig:decision_tree}Decision_tree.", fig.height=10, fig.width=12} | |
# YY | |
library(rpart) | |
library(rpart.plot) | |
# | |
# Construccion de un clasificador 'arbol de decision' | |
# | |
set.seed(2019) | |
tree <- rpart(diagnosis ~ ., data=training) | |
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE) | |
``` | |
```{r echo=FALSE} | |
predictdatosdt <- predict(tree, testing[,-p], type="class") | |
( tabledatos <- table(testing$diagnosis, predictdatosdt, dnn=c("CLASS", "Predict")) ) | |
diagtabledatos <- 0 | |
for(j in 1:nlevels(testing$diagnosis)) diagtabledatos <- diagtabledatos + tabledatos[j,j] | |
err <- 1-diagtabledatos/sum(tabledatos) | |
``` | |
```{r echo=FALSE} | |
# Blokeak sortuko ditugu | |
# | |
n <- nrow(data) | |
nlevel <- nlevels(testing$diagnosis) | |
# | |
B <- 7 | |
# | |
tamanno <- n%/%B | |
set.seed(2019) | |
alea <- runif(n) | |
rang <- rank(alea) | |
bloque <- (rang-1)%/%tamanno +1 | |
bloque <- as.factor(bloque) | |
# | |
#summary(bloque) | |
# balidazio gurutzatua | |
# | |
err.t <- numeric(0) | |
for(b in 1:B) | |
{ | |
# | |
training2 <- which(bloque!=b) | |
test <- which(bloque==b) | |
# | |
datosdt <- rpart(diagnosis ~ . , data=data[training2, ], cp=0.001) | |
# | |
predictdatosdt <- predict(tree, training[test,-p], type="class") | |
# | |
tabledatos <- table(data[test,p], predictdatosdt) | |
# | |
diagtabledatos <- 0 | |
for(j in 1:nlevel) diagtabledatos <- diagtabledatos + tabledatos[j,j] | |
err <- 1-diagtabledatos/sum(tabledatos) | |
# | |
err.t <- rbind(err.t, err) | |
} | |
# | |
# vector de los errores recogidos | |
# | |
#kable(err.t, caption = "Lortutako Errore bektorea") | |
err.cv <- mean(err.t) | |
``` | |
*Errore tasa aparentea* **`r err`** izan da eta *Errore tasa erreala* **`r err.cv`** balidazio gurutzatua erabiliz. Zuhaitza probatuko dugu(**predict**) kasu partikular batekin (13.) eta ikusiko dugu ea bere diagnosia asmatzen duen: | |
```{r echo=FALSE} | |
new <- testing[13,-p] | |
(pr <- predict(tree, new, type="class")) | |
(testing[13,p] == pr) | |
``` | |
Ondo asmatu du. | |
## KNN | |
[KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) *supervised classification* teknika bat da. Gure datasetean probatuko dugu. Egingo duguna `K` ezberdinetarako (Kmax = 30) erroreak kalkultzatzea izango da balidazio gurutzatua (B=6) erabiliz, `K` egokienak aurkitzeko. | |
```{r knn_plot, decision_tree, echo=FALSE, fig.cap="\\label{fig:knn_plot}KNN Errore tasa. Minimoa `K = 7`.", fig.height=10, fig.width=12} | |
library(class) | |
nlevel <- length(levels(data[,p])) | |
# | |
# Construccion de bloques | |
# | |
B <- 6 | |
tamanno <- n%/%B | |
set.seed(2019) | |
alea <- runif(n) | |
rang <- rank(alea) | |
bloque <- (rang-1)%/%tamanno +1 | |
bloque <- as.factor(bloque) | |
# | |
# KNN para diferentes valores de k | |
# | |
kmax <- 30 | |
err.cv <- rep(NA, kmax) | |
# | |
for(k in 1:kmax){ | |
# | |
err.t <- vector() | |
# | |
for(b in 1:B){ | |
# | |
training <- which(bloque!=b) | |
test <- which(bloque==b) | |
# | |
datosknn <- knn(data[training, -p], data[test, -p], cl=data[training, p], k) | |
tabledatos <- table(data[test,p], datosknn) | |
diagtabledatos <- 0 | |
for(j in 1:nlevel) diagtabledatos <- diagtabledatos + tabledatos[j,j] | |
err <- 1-diagtabledatos/sum(tabledatos) | |
# | |
err.t <- rbind(err.t, err) | |
} | |
# | |
err.cv[k] <- mean(err.t) | |
# | |
} | |
# | |
#round(err.cv, digits=4) | |
# | |
plot(1:kmax, err.cv, xlab="k", ylab="", main="Errore tasak. Minimoa k = 7", las=1,type="b", pch=19, col="red") | |
# | |
``` | |
```{r echo=FALSE} | |
# Errore tasa minimoa k=7 | |
# Es la estimacion de la tasa de error real (validacion cruzada) | |
# | |
# Calculo de la tasa de error aparente | |
# | |
datosknn <- knn(data[, -p], data[, -p], cl=data[, p], k=7) | |
diagtabledatos <- 0 | |
for(j in 1:nlevel) diagtabledatos <- diagtabledatos + tabledatos[j,j] | |
err <- 1-diagtabledatos/sum(tabledatos) | |
# | |
``` | |
Lortzen den errore tasarik txikiena `K = 7` aukerarekin **`r min(round(err.cv, digits=4))`** izan da. Errore tasa *aparentea* **`r err`**. | |
<!-- | |
## Naive Bayes | |
```{r} | |
# Construccion del clasificador | |
# | |
library(e1071) # library(klaR), fonction 'NaiveBayes()' | |
# | |
# | |
datosnb <- naiveBayes(diagnosis ~ ., data=data) | |
# | |
for(b in 1:B) | |
{ | |
# | |
training2 <- which(bloque!=b) | |
test <- which(bloque==b) | |
# | |
datosdt <- rpart(diagnosis ~ . , data=data[training2, ], cp=0.001) | |
# | |
predictdatosdt <- predict(datosnb, data[test,-p], type="class") | |
# | |
tabledatos <- table(data[test,p], predictdatosdt) | |
# | |
diagtabledatos <- 0 | |
for(j in 1:nlevel) diagtabledatos <- diagtabledatos + tabledatos[j,j] | |
err <- 1-diagtabledatos/sum(tabledatos) | |
# | |
err.t <- rbind(err.t, err) | |
} | |
# | |
#err.t | |
( err.cv <- mean(err.t) ) | |
``` | |
--> | |
<!--\newpage--> | |
# Ondorioak | |
**Breast Cancer Wisconsin** Datuen analisiaren ondorio bezala, esan daiteke, datuen aurreprozesaketa berezirik ez dela egin behar izan: datu originalak *txukunduta* daude. Honek analisiaren aurretik egin beharreko *garbiketa* lana erreztu du. Aipatu den bezala, datasetak, guztira 30 aldagai iragarle baldin baditu ere, azterketarako *mean* taldeko 10 erabili dira. | |
Erabilitako teknikak hiru izan dira, **K-Means**, **Decision Tree** eta **KNN**. Asmoa *Supervised* eta *Unsupervised* teknikak erabiltzea izan da. Ez da beraien arteko konparaketarik egin. | |
# R Kodea | |
Proiektuaren R kodea [hemen]() dago eskuragai. | |
# Erreferentziak | |
. | |
<!-- | |
# R Kodea | |
```{r code=readLines(knitr::purl('EAD_Proiektua.Rmd', documentation = 2)), eval = FALSE} | |
``` | |
--> | |
<!-- | |
```{r ref.label=knitr::all_labels(), echo = T, eval = F} | |
``` | |
--> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment