Created
June 15, 2013 13:47
-
-
Save cristiano74/5788196 to your computer and use it in GitHub Desktop.
Chi square - python.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Chi square - python | |
[toc] | |
##Definizione del processo | |
1. Generazione di una popolazione a partire da una tupla generica | |
2. estrazione casuale di un campione di n_elementi dalla pop creata | |
3. calcolo della frequenze attese e ossevarte in base alla tupla estratta | |
4. calcolo del chi-quadrato con p_value associato e definizione dei livelli di accuratezza del risulato. | |
+ State your conclusion in terms of your hypothesis. | |
+ **If the p value for the calculated chi-square is p > 0.05, accept your hypothesis. 'The deviation is small enough that chance alone accounts for it**. | |
+ A p value of **0.6**, for example, means that **there is a 60% probability that any deviation from expected is *due to chance only***. This is within the range of acceptable deviation. | |
+ If the p value for the calculated chi-square is ***p < 0.05*, REJECT your hypothesis, and conclude that some factor other than chance** is operating for the deviation to be so great. | |
+ For example, a p value of 0.01 means that there is only a 1% chance that this deviation is due to chance alone. Therefore, other factors must be involved. | |
+ The chi-square test will be used to test for the "goodness to fit" between observed and expected data from several laboratory investigations in this lab manual. | |
LINK (http://www2.lv.psu.edu/jxm57/irp/chisquar.html) | |
##Esempio | |
**La tupla generica estratta è:** | |
` px [0.0, 0.05, 0.15, 0.1, 0.35, 0.15, 0.2]` | |
Da una popolazione di 100,000 elementi , estraggo 1000 elementi: | |
Quindi la freq. **attesa** assoluta (FA) è la seguente (totale = 1000 elementi): | |
`fa_pre [50, 150, 100, 350, 150, 200]` | |
*Notare che il **valore 0** è stato eliminato dal vettore, per [bug con il sistema di calcolo python][6] | |
Quindi la freq. **osservata** assoluta (FO) è : | |
` fo_pre [53, 155, 96, 332, 162, 202]` | |
Il confronto fra FA e FO: | |
`[ 50 150 100 350 150 200] --> FA` | |
`[ 53 155 96 332 162 202] --> FO` | |
Calcolo del pvalue: **`Chi square test: 0.789628937062`** | |
A ciscun valore di p_value corrisponde un livello di accuratezza, come nella tabella sotto: | |
Dalla elaborazione della [v1.1](#v11), appare come risultato quanto segue: | |
###Risultato ottenuto con [CSTP](#cstp) v1.1 | |
BASH line$ in linux: `python test4.py 500 100000 1000 10 ` | |
` ('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 149, ' Med:', 201, ' Low:', 150, 'nullo', 0) | |
` | |
name | label | |
------------------------| | |
'pop=', 100000, | 100k casi considerati come pop | |
'campione=', 1000 | 1000 casi estratti (sample) | |
'Tuple estratte=' 500, | 500 combinazioni di freschezza su 230k totali utilizzate nel processo | |
' High:', 149, | 149 su 500 combinazioni hanno dato p value alto (vedere [tabella](#ptable)) | |
' Med:', 201, | 201 su 500 combinazioni hanno dato p value medio | |
' Low:', 150, | 150 su 500 combinazioni hanno dato p value medio | |
'nullo', 0 | da trascurare - solo per controllo | |
Il processo è ripetuto n volte, con la stessa numerosità della poopolazione e del campione, ma con differenti tuple estratte. | |
I risultati sono memorizzati nel file [**SAMPLERESULTS_CHISQUARE.txt**](https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/SAMPLERESULTS_CHISQUARE.txt?edit]). | |
Un estratto del file : | |
``` | |
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 139, ' Med:', 221, ' Low:', 140, 'nullo', 0) | |
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 163, ' Med:', 177, ' Low:', 160, 'nullo', 0) | |
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 170, ' Med:', 179, ' Low:', 151, 'nullo', 0) | |
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 142, ' Med:', 209, ' Low:', 149, 'nullo', 0) | |
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 146, ' Med:', 195, ' Low:', 159, 'nullo', 0) | |
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 146, ' Med:', 204, ' Low:', 150, 'nullo', 0) | |
``` | |
###<a id="ptable">Tabella di referimento per il p_value</a> | |
p value | level | |
--------- | ----- | |
<0.30 | low accuracy | |
0.30-0.70 | medium accuracy | |
\>0.70 | High accuracy | |
---------- | |
## CODE | |
### Create chi square test process [CSTP] <a id="cstp"></a>- test4.py | |
> note: Chi Square Test Process = CSTP | |
**`CODICE per la generazione del pvalue CHI SQUARE`** | |
[source code ] [7] | |
####<a id="v11">V1.1</a> | |
> - introdotto il ciclo sulle ripetizioni del processo, param 4 | |
> - la lista delle tuple viene ri-estratta ogni ciclo | |
> - save dei risultati in: [**SAMPLERESULTS_CHISQUARE.txt**](https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/SAMPLERESULTS_CHISQUARE.txt?edit]) | |
> - dipende dalle tuple create con [Cretate tuples v1.0](#createtuplesv1.0) | |
``` | |
############################################ | |
# USAGE: | |
# python test4.py 100 100000 5000 | |
# param1= numero estrazioni casuali di tuple | |
# param2= num pop | |
# param3= num sample | |
# param4= num di ripetizioni del processo | |
############################################ | |
import sys | |
import random | |
import collections | |
import numpy | |
#QUI SI APRE IL FILE CON LE COMBINAZIONI | |
################################# | |
lista=[] | |
import pickle | |
#f = file("test.txt", "r") | |
f = file("outfile.txt", "r") | |
lista=pickle.load(f) | |
############################# | |
#from scipy.stats import mannwhitneyu | |
#from scipy.stats import chisquare | |
#from scipy.stats import kruskal | |
from scipy.stats import rv_discrete | |
import scipy.stats.mstats as mst | |
tuple_tot = 0 | |
acc=0 | |
high = 0 | |
low = 0 | |
med = 0 | |
nullo = 0 | |
p_value=0 | |
# a= random.sample(range(100), 100) | |
n_extract = int(sys.argv[1]) #, indare numero di estrazioni casuali di tuple dalle permutazioni | |
fa = [] | |
fa_pre = [] | |
fo_pre = [] | |
fo =[] | |
tot=int(sys.argv[2]) | |
Nsample = int(sys.argv[3]) | |
for i in xrange(int(sys.argv[4])): | |
lista = random.sample(lista,n_extract) | |
acc=0 | |
high = 0 | |
low = 0 | |
med = 0 | |
nullo = 0 | |
p_value=0 | |
for e in lista: | |
print e | |
#px=[float(e[0]/100),float(e[1]/100),float(e[2]/100),float(e[3]/100),float(e[4]/100),float(e[5]/100),float(e[6]/100)] # 7 rep] | |
e1= ((e[0]/float(100))) | |
e2= ((e[1]/float(100))) | |
e3= ((e[2]/float(100))) | |
e4= ((e[3]/float(100))) | |
e5= ((e[4]/float(100))) | |
e6= ((e[5]/float(100))) | |
e7= ((e[6]/float(100))) | |
px=[e1,e2,e3,e4,e5,e6,e7] # 7 rep] | |
print "px", px | |
x =[1,2,3,4,5,6,7] | |
fa_pre=([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)]) | |
fa_pre[:] = (value for value in fa_pre if value != 0.0) | |
print "fa_pre",fa_pre | |
fa= numpy.array(fa_pre) | |
#fa = numpy.array([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)]) | |
pop = rv_discrete(values=(x, px)).rvs(size=tot) | |
random.shuffle(pop) | |
sample1 = random.sample(pop,Nsample) | |
counter_sample=collections.Counter(sample1) | |
fo_pre = [counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]] | |
fo_pre[:] = (value for value in fo_pre if value != 0.0) | |
print "fo_pre",fo_pre | |
fo = numpy.array(fo_pre) | |
#fo = numpy.array([counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]]) | |
print fa | |
print fo | |
#print mst.chisquare(fo, fa) | |
#print(samples) | |
try: | |
u, p_value = mst.chisquare(fo, fa) | |
#u, p_value = 2*mannwhitneyu(pop, sample1) | |
#u, p_value = kruskal(pop, sample1) | |
if p_value > 0.70: | |
acc += 1 | |
test = "HIGH precision" | |
high += 1 | |
else: | |
if p_value < 0.30: | |
test = "Low precision" | |
low += 1 | |
else: | |
test = "Medium precision" | |
med += 1 | |
except: | |
nullo +=1 | |
p_value = 0 | |
pass | |
print "Chi square test:", p_value | |
final = "<<<Risultato>>>",'pop=', tot, "; campione=", Nsample ,"Estrazioni=",n_extract, "High:",high," Med:",med," Low:",low, "nullo", nullo | |
print final | |
log = open("SAMPLERESULTS_CHISQUARE.txt","a") | |
print >> log,final | |
log.close() | |
``` | |
####v 1.0 | |
############################################ | |
# USAGE: | |
# python test4.py 100 100000 5000 | |
# param1= replicazioni test, casuali tuple | |
# param2= num pop | |
# param3= num sample | |
############################################ | |
import sys | |
import random | |
import collections | |
import numpy | |
#QUI SI APRE IL FILE CON LE COMBINAZIONI | |
################################# | |
lista=[] | |
import pickle | |
#f = file("test.txt", "r") | |
f = file("outfile.txt", "r") | |
lista=pickle.load(f) | |
############################# | |
#from scipy.stats import mannwhitneyu | |
#from scipy.stats import chisquare | |
#from scipy.stats import kruskal | |
from scipy.stats import rv_discrete | |
import scipy.stats.mstats as mst | |
tuple_tot = 0 | |
acc=0 | |
high = 0 | |
low = 0 | |
med = 0 | |
nullo = 0 | |
p_value=0 | |
# a= random.sample(range(100), 100) | |
n_extract = int(sys.argv[1]) #, indare numero di estrazioni casuali di tuple dalle permutazioni | |
lista = random.sample(lista,n_extract) | |
fa = [] | |
fa_pre = [] | |
fo_pre = [] | |
fo =[] | |
for e in lista: | |
tot=int(sys.argv[2]) | |
Nsample = int(sys.argv[3]) | |
print e | |
#px=[float(e[0]/100),float(e[1]/100),float(e[2]/100),float(e[3]/100),float(e[4]/100),float(e[5]/100),float(e[6]/100)] # 7 rep] | |
e1= ((e[0]/float(100))) | |
e2= ((e[1]/float(100))) | |
e3= ((e[2]/float(100))) | |
e4= ((e[3]/float(100))) | |
e5= ((e[4]/float(100))) | |
e6= ((e[5]/float(100))) | |
e7= ((e[6]/float(100))) | |
px=[e1,e2,e3,e4,e5,e6,e7] # 7 rep] | |
print "px", px | |
x =[1,2,3,4,5,6,7] | |
fa_pre=([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)]) | |
fa_pre[:] = (value for value in fa_pre if value != 0.0) | |
print "fa_pre",fa_pre | |
fa= numpy.array(fa_pre) | |
#fa = numpy.array([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)]) | |
pop = rv_discrete(values=(x, px)).rvs(size=tot) | |
random.shuffle(pop) | |
sample1 = random.sample(pop,Nsample) | |
counter_sample=collections.Counter(sample1) | |
fo_pre = [counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]] | |
fo_pre[:] = (value for value in fo_pre if value != 0.0) | |
print "fo_pre",fo_pre | |
fo = numpy.array(fo_pre) | |
#fo = numpy.array([counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]]) | |
print fa | |
print fo | |
#print mst.chisquare(fo, fa) | |
#print(samples) | |
try: | |
u, p_value = mst.chisquare(fo, fa) | |
#u, p_value = 2*mannwhitneyu(pop, sample1) | |
#u, p_value = kruskal(pop, sample1) | |
if p_value > 0.70: | |
acc += 1 | |
test = "HIGH precision" | |
high += 1 | |
else: | |
if p_value < 0.30: | |
test = "Low precision" | |
low += 1 | |
else: | |
test = "Medium precision" | |
med += 1 | |
except: | |
nullo +=1 | |
p_value = 0 | |
pass | |
print "Chi square test:", p_value | |
final = "<<<Risultato>>>",'pop=', tot, "; campione=", Nsample ,"Estrazioni=",n_extract, "High:",high," Med:",med," Low:",low, "nullo", nullo | |
print final | |
###<a id="createtuples">Create tuples-test1.py</a> | |
> file per la generazione delle tuple con permutazione , con i seguenti valori di percentuale: | |
`v=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100];` | |
>[Source code for test1.py][8] | |
Il risultato delle 230230 tuple è memorizzato nel file OUTFILE.TXT | |
####<a id="createtuplesv1.0">V1.0</a> | |
> esempio BASH : **python filename.py 1** | |
> il valore 1 indica che il testmode=1, cioe' falso. | |
``` | |
################################################## | |
#QUESTO FILE GENERA LE TUPLE DA UTILIZZARE NEL FILE | |
#DA ESEGUIRE UNA SOLA VOLTA | |
#IL FILE : OUTFILE.TXT | |
################################################## | |
import itertools; | |
v=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]; | |
import sys | |
# esempio BASH : python filename.py 1 | |
#testmode=0 | |
testmode=int(sys.argv[1]) | |
target=100 #somma da raggiungere - test | |
if testmode==1: | |
n_rep=3 | |
else: | |
n_rep=7 | |
############### CREAZIONE TUPLA | |
combos = [a for a in itertools.product(v,repeat=n_rep) if sum(a)==target] | |
############################# | |
lista = [] | |
pop = [] | |
############### AGGIUNGE TUPLA ALLA LISTA | |
for e in combos: | |
# print e #indica la tupla | |
lista.append(e) | |
######################################### | |
############### SCRIVE LISTA SU FILE OUTFILE.TXT | |
import pickle | |
f = file("outfile.txt", "w") | |
pickle.dump(lista, f) | |
#print lista | |
################################## | |
``` | |
##Notes | |
####Chi square Goodness of fit: - python | |
1. [Articolo di un modulo python per chi square GOF][4] | |
2. [**How to Test Your Discrete Distribution- Minitab**] [5] | |
3. [Difference between scipy.stats.mstats.chisquare and | |
scipy.stats.mstats in Python][6] | |
1. [p VALUE - INTERPRETATION][9] | |
1. [How to generate numbers based on an arbitrary **discrete distribution**?][2] | |
2. [ANOVA, Nonparametric Tests, and More Added to SVS via New **Python** Capabilities][3] | |
####PYTHON CODE. SUM and permutation | |
1. http://mathoverflow.net/questions/9477/uniquely-generate-all-permutations-of-three-digits-that-sum-to-a-particular-value | |
2. http://stackoverflow.com/questions/4632322/finding-all-possible-combinations-of-numbers-to-reach-a-given-sum | |
4. Utilizzo *LIST* in python: http://effbot.org/zone/python-list.htm | |
5. [Sample size for pilot studies](http://www.statstodo.com/SSiz2Props_Pgm.php) | |
####Algorithm | |
1. http://web.eecs.utk.edu/~vose/Publications/random.pdf | |
2. Bootstrap Python : https://github.com/cgevans/scikits-bootstrap | |
3. [wilcoxon](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wicoxon.html#scipy.stats.wilcoxon) | |
5. [mann withney](http://www.alglib.net/hypothesistesting/mannwhitneyu.php) | |
---------- | |
[1]: https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/6_example.py?edit | |
[2]: http://stats.stackexchange.com/questions/26858/how-to-generate-numbers-based-on-an-arbitrary-discrete-distribution | |
[3]:[http://blog.goldenhelix.com/?p=595] | |
[4]: (http://statsmodels.sourceforge.net/stable/stats.html#goodness-of-fit-tests-and-measures) | |
[5]: http://blog.minitab.com/blog/adventures-in-statistics/how-to-test-your-discrete-distribution | |
[6]:(http://stackoverflow.com/questions/13913572/difference-between-scipy-stats-mstats-chisquare-and-scipy-stats-mstats-in-python) | |
[7]:https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/test4.py?edit | |
[8]:https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/test1.py?edit | |
[9]: (http://www2.lv.psu.edu/jxm57/irp/chisquar.html) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment