Last active
December 21, 2015 05:29
-
-
Save briatte/6257734 to your computer and use it in GitHub Desktop.
Stata log of the entire SRQM course, produced with -srqm_demo using demo.log, s(1/12) test wipe- on August 17, 2013.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
------------------------------------------------------------------------------------ | |
name: srqm_demo | |
log: /Users/fr/Documents/Teaching/SRQM/demo.log | |
log type: text | |
opened on: 17 Aug 2013, 18:28:28 | |
. | |
. * Check setup. This line appears in every course do-file. It makes sure that | |
. * you have the appropriate files and packages to successfully run the code. | |
. run setup/require fre lookfor_all | |
. | |
. /* ------------------------------------------ SRQM Session 1 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - Hi! Welcome to your first SRQM do-file. | |
> | |
> - You are probably viewing this file from the Stata do-file editor, after | |
> opening it with the -doedit code/week1- command. If so, you are | |
> doing it right: congratulations. | |
> | |
> - You will be reading through your first do-file in just a minute. It is | |
> essential that you read through each week's do-file to become familiar | |
> with Stata commands. | |
> | |
> - We will start exploring do-files in class, and you get to finish them on your | |
> own as homework, along with reading one chapter from the course handbook and | |
> a few sections from the Stata Guide. These tasks complement each other. | |
> | |
> - Everything that you learn from the course do-files will be put to use in your | |
> research project. Practice with Stata by trying out commands as you learn | |
> them. If things do not work out, try again after checking the command syntax. | |
> | |
> Last updated 2013-05-29. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. | |
. * Comments | |
. * -------- | |
. | |
. * This line is a comment due to the '*' symbol at its beginning. It takes a | |
. * green colour in the Stata do-file editor. This do-file is fully commented | |
. * to guide you through the basics. In your own code, you should also use | |
. * comments to document and section your operations. | |
. | |
. // note: lines or chunks of code that start with '//' are also comments, ... | |
. | |
. /* and blocks of code that start with that symbol | |
> and end with the reverse one are also comments */ | |
. | |
. // ... and Stata helps you to detect comments by coloring them in green. | |
. | |
. * When you see the words 'uncomment to run', it means 'remove the comment to run | |
. * the code'. Remove the asterisk and trailing space on the next line, then run | |
. * it by copy-pasting it into the Command window and pressing Enter: | |
. | |
. * di "Hello world." | |
. | |
. * When I cite a Stata command in the comments, I cite it -between dashes-, but | |
. * the dashes are not part of the command. They are just here to delimit where | |
. * the command starts and where it stops. | |
. | |
. | |
. * Practice | |
. * -------- | |
. | |
. * Your mission for next week is to replicate this do-file. That means running | |
. * it in full, reading the comments along as you execute its commands. Use the | |
. * course slides to learn about running do-files and read from the Stata Guide | |
. * to understand the commands used. | |
. | |
. * There is no substitute to practice to learn statistical software. Code is | |
. * like music, you will recognize the tune and notation if you listen to it. | |
. * When you learn to code, you learn to play, either for yourself or for the | |
. * audience of your programming language. For Stata, the audience is a pretty | |
. * wide range of people and institutions. | |
. | |
. | |
. * Interface | |
. * --------- | |
. | |
. * Quickly review the Stata windows. The Command window is where you will enter | |
. * all commands, the results of which will show in the Results window. Your | |
. * past commands will also show in the Review window. Finally, the Variables | |
. * window should be empty at that stage, because no dataset is currently loaded | |
. * in Stata. More windows will be opened as we go on. | |
. | |
. * Note that we will use windows but not, as you are used to, menus. The menu | |
. * interface in Stata offers point-and-click accessibility but is not suited | |
. * for programming purposes. Instead, everything we do will be command-based. | |
. | |
. | |
. * ==================== | |
. * = WARM-UP EXERCISE = | |
. * ==================== | |
. | |
. | |
. * Type or copy and paste the following line to the Command window: | |
. pwd | |
/Users/fr/Documents/Teaching/SRQM | |
. | |
. * The previous command returns the path to your working directory. It prints | |
. * its output to the Results window, and the command is stored in history as | |
. * shown in the Review window. | |
. | |
. * Now, load a sample Stata dataset that is included with the software: | |
. sysuse lifeexp, clear | |
(Life expectancy, 1998) | |
. | |
. * The previous command loads data in the background. You can access the data | |
. * with the following command. Close the window after taking a look. | |
. browse | |
. | |
. * Back to the main window, the Variables window shows the list of variables. | |
. * We are going to use two of them to build a plot. Type the following: | |
. scatter lexp safewater | |
. | |
. * This command creates your first Stata graph. Close the Graph window when | |
. * you are done inspecting the graph. Finally, type the following command after | |
. * uncommenting it (remove the asterisk and trailing space): | |
. | |
. * doedit example | |
. | |
. * The previous command creates an empty do-file called 'example.do'. The file | |
. * is located in your working directory, which should be the SRQM folder. | |
. * Stata has also opened the file in the Do-file Editor so that you can edit it | |
. * from its programming interface. Copy and paste the four following lines into | |
. * that empty do-file window: | |
. | |
. // Example do-file. | |
. sysuse lifeexp, clear | |
(Life expectancy, 1998) | |
. sc lexp safewater | |
. clear | |
. | |
. * Notice that the syntax used for the -scatter- command is different because | |
. * it has been abbreviated to -sc-. The first line is a comment that uses an | |
. * alternative way to tell Stata that the line is a comment. Save and close | |
. * the do-file window when you have copied the full code to it. | |
. | |
. * The do-file can now be run with the following command (uncomment to run): | |
. * do example | |
. | |
. * The do-file can now be erased with the following command (uncomment to run): | |
. * rm example.do | |
. | |
. * These commands quickly show you how we are going to use the software: by | |
. * running (executing) code from Stata do-files, so that you can write your | |
. * own do-file for your research projects. | |
. | |
. | |
. * ============ | |
. * = COMMANDS = | |
. * ============ | |
. | |
. | |
. * Tip (1): Get to learn some syntax | |
. * --------------------------------- | |
. | |
. * Most Stata commands share an identical syntax that calls one or several | |
. * variables as the main argument: | |
. | |
. * command variable | |
. | |
. * Most Stata commands will also allow one or more options after a comma. | |
. * Optional arguments are shown in brackets in the Stata help pages: | |
. | |
. * command variable [, options] | |
. | |
. | |
. * Tip (2): Run all lines in sequential order | |
. * ------------------------------------------ | |
. | |
. * You need to execute all lines of a do-file in order to avoid execution errors. | |
. * The example below illustrates the point: | |
. | |
. clear | |
. set obs 100 | |
obs was 0, now 100 | |
. gen test = 1 | |
. ren test x // This line will not run if you do not run the previous ones first. | |
. // The command intends to rename the 'test' variable, but 'test' does | |
. // not exist unless you create it first by running the previous line. | |
. | |
. | |
. * Tip (3): Keyboard shortcuts for Mac / Win | |
. * ----------------------------------------- | |
. | |
. /* Mac: | |
> | |
> - Cmd-L (Ctrl-L) selects a whole line | |
> - Shift + Up/Down arrows selects or deselects neighbouring lines | |
> - Cmd-Shift-D (Ctrl-D) executes the selection | |
> - Cmd-` (Alt-Tab) switches between application windows | |
> | |
> Cmd is the 'Command' key. The ` ('back accent') key might be hard to | |
> find on non-QWERTY keyboards, so check if you see it on your system. | |
> | |
> Win: | |
> | |
> - Ctrl-L selects a whole line | |
> - Shift + Up/Down arrows selects or deselects neighbouring lines | |
> - Ctrl-D executes the selection | |
> - Alt-Tab switches between application windows */ | |
. | |
. * Do not confuse Mac and Win keyboard shortcuts, or you might execute the whole | |
. * do-file by mistake! If that happens, or if you get lost while replicating a | |
. * do-file, the safest option is to run it again from the top. To do that, make | |
. * your life easier with keyboard shortcuts: select the line where you want to | |
. * start again by pressing Cmd-L (Win: Ctrl-L), then press Cmd-Shift-UpArrow | |
. * (Win: Ctrl-Shift-UpArrow), and finally press Cmd-Shift-D (Win: Ctrl-D) to run | |
. * the code again down to your initial line. | |
. | |
. * Yes, all this takes a bit of practice. Think of it as music: learning to read | |
. * and write code is like learning to read and write music sheets, and learning | |
. * to type and run code is like learning a bit of piano. | |
. | |
. | |
. * Tip (4): Command navigation | |
. * --------------------------- | |
. | |
. * You can navigate through past commands from the Command window by using the | |
. * PageUp and PageDown keys. Try running the following command after taking out | |
. * the asterisk at the beginning of the line: | |
. | |
. * memory6 | |
. | |
. * You should get an error: the right command is -memory- without the final '6'. | |
. * To quickly correct your mistake, press PageUp and Stata will print the command | |
. * again to your Command window, allowing you to quickly correct the syntax of | |
. * your command and try it again without the final '6'. | |
. | |
. | |
. * Tip (5): Run multiple lines together | |
. * ------------------------------------ | |
. | |
. * When you see '///' at the end of a line, you have to select the next line too | |
. * and execute the lines together from the do-file: copy-pasting to the Command | |
. * window will not work. Use Ctrl-L (Win) or Cmd-L (Mac) and Shift+DownArrow to | |
. * select the lines, then run them with Ctrl-D (Win) or Cmd-Shift-D (Mac). | |
. | |
. di "This is a test. Select this line, " /// | |
> "and this line too, " _n /// | |
> "and this line too. Now, execute from the keyboard. Well done :)" | |
This is a test. Select this line, and this line too, | |
and this line too. Now, execute from the keyboard. Well done :) | |
. | |
. * You will have to do the same for code loops, such as 'foreach {}' loops. | |
. * You will usually be warned before in the comments. Finally, note that these | |
. * multiple-line commands do *not* work if you copy-paste from the do-file to | |
. * the Command window. This is why I recommend that you learn keyboard shortcuts | |
. * quickly, so as to minimize issues with code execution and focus on the rest. | |
. | |
. | |
. * ========= | |
. * = SETUP = | |
. * ========= | |
. | |
. | |
. * The following steps teach you about setting up Stata on any computer. Start | |
. * by making sure that you have nothing stored in Stata memory by wiping off | |
. * any data in memory with the -clear- command: | |
. clear | |
. | |
. * The settings covered in this section of the do-file can be taken care of by | |
. * a setup utility written for the course. Please turn to the README file of the | |
. * SRQM folder for instructions, or follow the procedure in our first classes. | |
. | |
. | |
. * (1) Memory | |
. * ---------- | |
. | |
. * Skip this section if you are running Stata 12+. | |
. | |
. * Your first step with Stata consists in allocating enough memory to it. The | |
. * default amount of memory that Stata loads at startup is too small to open | |
. * large datasets: if you forget to set memory, Stata will reply with an error | |
. * message. The basic command to allocate 500MB memory follows: | |
. set mem 500m | |
set memory ignored. | |
Memory no longer needs to be set in modern Statas; memory adjustments are | |
performed on the fly automatically. | |
. | |
. * You need to repeat that command every time you run Stata. The command works | |
. * only if Stata has no data in storage: if you already have a dataset opened, | |
. * then Stata will reply with an error message. Fortunately, if you are running | |
. * Stata from your own computer, you can set memory permanently: | |
. set mem 500m, perm | |
set memory ignored. | |
Memory no longer needs to be set in modern Statas; memory adjustments are | |
performed on the fly automatically. | |
. | |
. * There is more to learn about memory size and default settings in Stata, but | |
. * for the purpose of this course, this will largely suffice. Furthermore, if | |
. * you are running Stata 12, you are spared from setting memory yourself: Stata | |
. * will do it automatically. | |
. | |
. | |
. * (2) Screen breaks | |
. * ----------------- | |
. | |
. * By default, Stata uses screen breaks. If you forget to disable those, the | |
. * 'Results' window will nag you with useless 'more' prompts and you will have | |
. * to scroll results manually. Save yourself the hassle by disabling them: | |
. set more off | |
. | |
. * In fact, let's try to disable them permanently on your computer: | |
. set more off, perm | |
(set more preference recorded) | |
. | |
. | |
. * (3) Additional commands | |
. * ----------------------- | |
. | |
. * Stata can be extended by installing packages, just like you would install a | |
. * plugin or an extension for another software. The packages add new commands or | |
. * graph schemes to Stata. | |
. | |
. * Make sure that you are connected to the Internet before continuing, so that | |
. * Stata can connect to the SSC archive and to other online sources. If you are | |
. * using a Sciences Po workstation, you will also need to uncomment and run the | |
. * following command to avoid an issue with admin privileges: | |
. | |
. * sysdir set PLUS "c:\temp" | |
. | |
. * This course makes heavy use of the -fre- command to view frequencies. The | |
. * course setup should have installed it for you, but let's practice installing | |
. * additional Stata commands. Install the -fre- command (again) by uncommenting | |
. * and running this command while online: | |
. | |
. * ssc install fre | |
. | |
. * Now read the package description: | |
. ado de fre | |
------------------------------------------------------------------------------------ | |
[1] package fre from http://fmwww.bc.edu/repec/bocode/f | |
------------------------------------------------------------------------------------ | |
TITLE | |
'FRE': module to display one-way frequency table | |
DESCRIPTION/AUTHOR(S) | |
fre displays for each specified variable a univariate frequency | |
table containing counts, percent, and cumulative percent. | |
Variables may be string or numeric. Labels, in full length, and | |
values are printed. By default, fre only tabulates the smallest | |
and largest 10 values (along with all missing values), but this | |
can be changed. Furthermore, values with zero observed frequency | |
may be included in the tables. The default for fre is to display | |
the frequency tables in the results window. Alternatively, the | |
tables may be written to a file on disk, either tab-delimited or | |
LaTeX-formatted. | |
KW: data management | |
KW: frequencies | |
KW: frequency table | |
KW: tabulation | |
Requires: Stata version 9.2 | |
Distribution-Date: 20120618 | |
Author: Ben Jann, University of Bern | |
Support: email [email protected] | |
INSTALLATION FILES | |
f/fre.ado | |
f/fre.hlp | |
INSTALLED ON | |
17 Aug 2013 | |
------------------------------------------------------------------------------------ | |
. | |
. | |
. * (4) Working directory | |
. * --------------------- | |
. | |
. * The working directory is where Stata will look to open and save stuff like | |
. * datasets or logs. Use the -pwd- command to see where Stata is looking now. | |
. pwd | |
/Users/fr/Documents/Teaching/SRQM | |
. | |
. * Use -ls- command to list the files where Stata is looking. The -w- option will | |
. * cause the command to print only the filenames without system information. | |
. ls, w | |
README.md backup.log course/ demo.log setup/ | |
admin/ code/ data/ profile.do | |
. | |
. * For this course, you need to set the working directory to the SRQM folder. | |
. * Use the 'File :: Change Working Directory...' menu item in the Stata graphical | |
. * user interface to select the SRQM folder. The path to that folder will show in | |
. * the Results window. It might look like this: | |
. | |
. * cd ~/Documents/Teaching/SRQM/ | |
. | |
. * I use Mac OS X, which is why my file path takes that form. Ivaylo uses a PC, | |
. * and his own working directory might be set like this: | |
. | |
. * cd C:\Users\Ivo\Desktop\SRQM | |
. | |
. * You will need to identify that file path on your own computer. Choose a simple | |
. * location for the SRQM folder and then keep it there without renaming it or any | |
. * of the folders that lead to it. Be careful with that, or you will get errors | |
. * when trying to study for the course. | |
. | |
. * The -cd- command shown above navigates through your folders. The next example | |
. * assumes that you are now in the SRQM folder. It will select the folder that | |
. * contains the course do-files. Note that if the path contained spaces, you | |
. * would need to add quotes around it. | |
. | |
. * cd code | |
. | |
. * Uncomment and run the line above, then uncomment and run the next command to | |
. * go back one level and return to the SRQM folder: | |
. | |
. * cd .. | |
. | |
. * Finally, you can list the files without moving to a directory. The following | |
. * command shows the contents of the data/ folder: | |
. ls data/, w | |
ess2008.dta gss0012_variables.txt qog2013_variables.txt | |
ess2008.zip nhis2009.dta world-c.dta | |
ess2008_codebook.pdf nhis2009.zip world-d.dta | |
ess2008_variables.txt nhis2009_variables.txt wvs2000.dta | |
gss0012.dta qog2013.dta wvs2000.zip | |
gss0012.zip qog2013.zip wvs2000_codebook.pdf | |
gss0012_codebook.pdf qog2013_codebook.pdf wvs2000_variables.txt | |
. | |
. | |
. * (5) Log | |
. * ------- | |
. | |
. * You can save the commands and results from this do-file to a log file, which | |
. * will serve as a backup of your work. To log this session, type: | |
. log using code/week1.log, replace | |
(note: file /Users/fr/Documents/Teaching/SRQM/code/week1.log not found) | |
------------------------------------------------------------------------------------ | |
name: <unnamed> | |
log: /Users/fr/Documents/Teaching/SRQM/code/week1.log | |
log type: text | |
opened on: 17 Aug 2013, 18:28:33 | |
. | |
. * The log command will now create a history of your work on this do-file. You | |
. * should keep it for replication purposes. It will log all your commands and | |
. * their results, including commands that returned an error. Refer to the Stata | |
. * Guide for further guidance on log files, and do not forget to produce logs in | |
. * the .log plain text format rather than in the less handy SMCL default format. | |
. * Also make sure that you specify the -replace- option to overwite any previous | |
. * log file that might have been created by running this do-file in the past. | |
. * The -name- option can be omitted. | |
. | |
. * Now run these example commands (do not worry about the comments, you can leave | |
. * them where they are and 'execute' them too, Stata will just ignore them): | |
. | |
. * Loading data from the U.S. National Health Interview Survey (2009). | |
. use data/nhis2009, clear | |
(U.S. National Health Interview Survey 2009) | |
. | |
. * The -clear- option gets rid of any data previously loaded into memory, since | |
. * Stata can only open one dataset at once. | |
. | |
. * Describe a few variables. | |
. d year sex weight raceb | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
year int %8.0g year_lbl Survey year | |
sex byte %8.0g sex_lbl Sex | |
weight int %36.0g weight_lbl | |
Weight in pounds without clothes or | |
shoes | |
raceb float %9.0g raceb Race | |
. | |
. * Keep observations only for year 2009. | |
. keep if year == 2009 | |
(227298 observations deleted) | |
. | |
. * Calculate the frequencies for each racial-ethnic group. | |
. fre raceb | |
raceb -- Race | |
---------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------+-------------------------------------------- | |
Valid 1 White | 14269 58.74 58.74 58.74 | |
2 Black | 3893 16.03 16.03 74.77 | |
3 Hispanic | 4758 19.59 19.59 94.36 | |
4 Asian | 1371 5.64 5.64 100.00 | |
Total | 24291 100.00 100.00 | |
---------------------------------------------------------------- | |
. | |
. * Obtain summary statistics for the weight variable. | |
. su weight | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
weight | 24291 172.5895 37.12779 100 285 | |
. | |
. * List gender groups from the sex variable. | |
. tab sex | |
Sex | Freq. Percent Cum. | |
------------+----------------------------------- | |
Male | 10,978 45.19 45.19 | |
Female | 13,313 54.81 100.00 | |
------------+----------------------------------- | |
Total | 24,291 100.00 | |
. | |
. * Crosstabulate sex and race. | |
. tab sex raceb | |
| Race | |
Sex | White Black Hispanic Asian | Total | |
-----------+--------------------------------------------+---------- | |
Male | 6,676 1,532 2,150 620 | 10,978 | |
Female | 7,593 2,361 2,608 751 | 13,313 | |
-----------+--------------------------------------------+---------- | |
Total | 14,269 3,893 4,758 1,371 | 24,291 | |
. | |
. * Plot average weight by sex and race. You must run both lines below together. | |
. gr dot weight, over(raceb) over(sex) /// | |
> name(weight_race_sex, replace) | |
. | |
. * To close the log file previously opened, type the following command: | |
. cap log close | |
. | |
. * You will not be able to run the above command if no log is opened. The -cap- | |
. * prefix allows you to run the command and continue even if it returns an error. | |
. | |
. * If you now go to your code/ folder and open the week1.log file with | |
. * any plain text editor, you will find a copy of everything that was entered | |
. * between the -log using- and -log close- commands, including comments, the | |
. * example above and its output for each command. You can view the file in Stata: | |
. view code/week1.log | |
. | |
. * The dot graph will need to be saved separately: this can be done in several | |
. * ways that are documented in the course slides and in the Stata Guide. The | |
. * Stata help pages also cover each graph command. Have a look at them: | |
. help graph | |
. | |
. * Identically, there is more about logs in the Stata Guide and in several of | |
. * the tutorials included in the course material, but we also recommend that you | |
. * use the Stata help pages, as explained below. | |
. | |
. | |
. * ============ | |
. * = DATASETS = | |
. * ============ | |
. | |
. | |
. * (1) List datasets | |
. * ----------------- | |
. | |
. * Show all datasets for this course. The asterisk in the command is an escape | |
. * character that causes the command to return all matches (within .dta files). | |
. * The -w- option is to make the output less verbose. | |
. ls "data/*.dta", w | |
data/ess2008.dta data/qog2013.dta data/wvs2000.dta | |
data/gss0012.dta data/world-c.dta | |
data/nhis2009.dta data/world-d.dta | |
. | |
. * Note: the quotes in the command above are optional. Quotes are only required | |
. * when the path contains spaces. For example, if the data/ folder were called | |
. * 'Course datasets', quotes would be necessary to run -ls "Course datasets"-. | |
. * This means that, if the path to your working directory contains quotes, you | |
. * must enclose it in quotes if you use -cd- to set your working directory. | |
. | |
. * Typical example. | |
. * cd "/Users/somestudent/Documents/Sciences Po/4A/Semester 1/Stats stuff/SRQM" | |
. | |
. * Now back to the datasets. | |
. | |
. * All datasets are in the data/ folder of the SRQM Teaching Pack. The commands | |
. * used to load them in the course do-files will work only if you have correctly | |
. * set your working directory to the SRQM folder first. The course setup does it | |
. * for you, unless you move the SRQM folder, in which case it will stop working. | |
. | |
. * The README file of the data/ folder holds links to essential documents for you | |
. * to read if you want to use the data for your research project. You can start | |
. * looking for variables of interest by using the -lookfor- command after loading | |
. * one of the course datasets. | |
. | |
. | |
. * (2) European Social Survey Round 5, 2008 | |
. * ---------------------------------------- | |
. | |
. * Load. | |
. use data/ess2008, clear | |
(European Social Survey 2008) | |
. | |
. * Example search. | |
. lookfor health immig | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
stfhlth byte %2.0f stfhlth State of health services in country | |
nowadays | |
imsmetn byte %1.0f imsmetn Allow many/few immigrants of same | |
race/ethnic group as majority | |
imdfetn byte %1.0f imdfetn Allow many/few immigrants of different | |
race/ethnic group from majority | |
impcntr byte %1.0f impcntr Allow many/few immigrants from poorer | |
countries outside Europe | |
imbgeco byte %2.0f imbgeco Immigration bad or good for country's | |
economy | |
imueclt byte %2.0f imueclt Country's cultural life undermined or | |
enriched by immigrants | |
imwbcnt byte %2.0f imwbcnt Immigrants make country worse or | |
better place to live | |
health byte %1.0f health Subjective general health | |
gvhlthc byte %2.0f gvhlthc Health care for the sick, governments' | |
responsibility | |
hlthcef byte %2.0f hlthcef Provision of health care, how | |
efficient | |
imsclbn byte %1.0f imsclbn When should immigrants obtain rights | |
to social benefits/services | |
imrccon byte %2.0f imrccon Immigrants receive more or less than | |
they contribute | |
lvpbhlt byte %1.0f lvpbhlt Level of public health care affordable | |
10 years from now | |
lknhlcn byte %1.0f lknhlcn How likely not receive health care | |
needed if become ill next 12 months | |
p70hltb byte %2.0f p70hltb People over 70 a burden on health | |
service these days | |
. | |
. | |
. * (3) Quality of Government, 2013 | |
. * ------------------------------- | |
. | |
. * Load. | |
. use data/qog2013, clear | |
(Quality of Government 2013) | |
. | |
. * Example search. | |
. lookfor devel orig | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
gr_cso int %8.0g Development Civil Society | |
Organizations | |
ht_colonial byte %55.0g ht_colonial | |
Colonial Origin | |
iag_hd double %10.0g Human Development | |
lp_legor float %31.0g lp_legorlabel | |
Legal origin | |
undp_hdi double %10.0g Human Development Index | |
wdi_aid double %10.0g Net Development Assistance and Aid | |
(Constant USD) | |
wdi_aidcu double %10.0g Net Development Assistance and Aid | |
(Current USD) | |
. | |
. | |
. * (4) World Values Survey, 2000 | |
. * ----------------------------- | |
. | |
. * Load. | |
. use data/wvs2000, clear | |
(World Values Survey 2000) | |
. | |
. * Example search. | |
. lookfor army homo | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
v76 byte %8.0g v76 neighbors: homosexuals | |
v1660 byte %8.0g v1660 having the army rule | |
v208 byte %8.0g v208 justifiable: homosexuality | |
. | |
. | |
. * (5) General Social Survey, 2012 | |
. * ------------------------------- | |
. | |
. * Load. | |
. use data/gss0012, clear | |
(U.S. General Social Survey 2000-2012) | |
. | |
. * Example search. | |
. lookfor army homo | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
spkhomo byte %8.0g LABAE allow homosexual to speak | |
colhomo byte %8.0g LABAF allow homosexual to teach | |
libhomo byte %8.0g LABAG allow homosexuals book in library | |
conarmy byte %8.0g LABBJ confidence in military | |
homosex byte %8.0g LABCC homosexual sex relations | |
marhomo byte %8.0g LABJK homosexuals should have right to marry | |
homosex1 byte %8.0g LABPU is homosexual sex wrong? | |
. | |
. * Note that this dataset holds more than one year of data. | |
. tab year | |
gss year | | |
for this | | |
respondent | Freq. Percent Cum. | |
------------+----------------------------------- | |
2000 | 2,817 14.87 14.87 | |
2002 | 2,765 14.59 29.46 | |
2004 | 2,812 14.84 44.31 | |
2006 | 4,510 23.81 68.11 | |
2008 | 2,023 10.68 78.79 | |
2010 | 2,044 10.79 89.58 | |
2012 | 1,974 10.42 100.00 | |
------------+----------------------------------- | |
Total | 18,945 100.00 | |
. | |
. * This means that you will have to reduce it to one year of observations before | |
. * analyzing it. More on that next week. For now, back to looking for variables. | |
. | |
. | |
. * (6) Search across datasets | |
. * -------------------------- | |
. | |
. * Tip: an additional package can help you search for variables across datasets. | |
. * It should have been installed by the course setup utility. If not, install it | |
. * yourself with -ssc install lookfor_all- (requires an Internet connection). | |
. lookfor_all health, dir(data) | |
Variables in: | |
use "/Users/fr/Documents/Teaching/SRQM/data/ess2008.dta" | |
variables: stfhlth health gvhlthc hlthcef lvpbhlt lknhlcn p70hltb | |
Variables in: | |
use "/Users/fr/Documents/Teaching/SRQM/data/gss0012.dta" | |
variables: natheal nathealy health abhlth mentloth health30 health12 hlthinfo hlthp | |
> apr hlthmag1 hlthmag2 hlthdoc hlthfrel hlthtv hlthwww didlessp limitedp treat11 do | |
> ccosts safehlth health1 physhlth mntlhlth outsider medsavtx medsymps medaddct medu | |
> nacc mhhlpmhp mhgvthlt mhtrtot2 mhclsoth mhseroth mhhlpoth mhreloth mhtrtslf mhsee | |
> pub emphlth sphlth richhlth hrdshp6 askmentl | |
Variables in: | |
use "/Users/fr/Documents/Teaching/SRQM/data/nhis2009.dta" | |
variables: health uninsured | |
Variables in: | |
use "/Users/fr/Documents/Teaching/SRQM/data/qog2013.dta" | |
variables: wdi_hec wdi_prhe wdi_puhe wdi_the wvs_a009 | |
File "/Users/fr/Documents/Teaching/SRQM/data/world-c.dta" cannot be open in current | |
> version of Stata | |
File "/Users/fr/Documents/Teaching/SRQM/data/world-d.dta" cannot be open in current | |
> version of Stata | |
Variables in: | |
use "/Users/fr/Documents/Teaching/SRQM/data/wvs2000.dta" | |
variables: v12 v52 v67 | |
Total 7 out of 7 files checked in "/Users/fr/Documents/Teaching/SRQM/data/" | |
. | |
. * The command above, like all commands that calls datasets or do-files, | |
. * requires that the SRQM folder has been set as the working directory. | |
. | |
. * Because some commands like -lookfor_all- require to be installed before you | |
. * run the course do-files, the course setup utility has installed them in our | |
. * first session together. However, by security, I also include a small loop in | |
. * all course do-files that automatically detect uninstalled commands and fetch | |
. * them from online if needed. These loops look like the one below and require | |
. * that you select all four lines together and then execute them. | |
. foreach p in lookfor_all { | |
2. cap which `p' | |
3. if _rc == 111 cap noi ssc install `p' | |
4. } | |
. | |
. * The syntax of these loops is typically more complex than anything that you | |
. * will have to read or write for this course, so do not panic if they do not | |
. * make sense to you. Focus on getting the rest of the code straight. | |
. | |
. | |
. * ======== | |
. * = HELP = | |
. * ======== | |
. | |
. | |
. * It is essential to the methods covered by this course that you learn to use | |
. * help extensively. The course material includes a lot of help with Stata, but | |
. * you should also learn to use internal Stata help pages, accessible with the | |
. * -help- command. If you want to understand the following command: | |
. * | |
. * su weight if raceb == 1, d | |
. | |
. * To understand what -su- means and does, type -help- followed by -su-: | |
. help su | |
. | |
. * The underline tells you that -su- is shorthand for -summarize-, which returns | |
. * a few summary statistics for one or more variables. The -help- command itself | |
. * can be abbreviated to simply -h-. The -if- component of the command is also | |
. * documented in Stata: | |
. h if | |
. | |
. * Finally, the -d- option shown in the example is documented on the help page | |
. * for -summarize-. It produces more statistics: -d- is shorthand for -detail-. | |
. * Do not confuse it with the -d- shorthand for the -describe- command, which | |
. * lists the variables in the current dataset. | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * The course will teach you to write commands like the ones featured in this | |
. * do-file. If you combine practice, documentation and a bit of intuition, you | |
. * can learn most of the Stata syntax in a few weeks through trial-and-error. | |
. * Get ready by practicing as soon as possible! Programming works that way. | |
. * Oh, and congratulations for reaching this line. | |
. | |
. * Last words: when you leave Stata, DO NOT SAVE YOUR DATASET. Keep it intact as | |
. * originally downloaded. Instead, save the do-file that contains the commands | |
. * you used to perform your analysis. Stata will automatically save the log file | |
. * for you when you shut it down, so this requires no action on your side. For | |
. * additional help, please turn again to the Stata Guide. | |
. | |
. * Close log (if still opened, which it should not). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require fre | |
. | |
. * Log results. | |
. cap log using code/week2.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 2 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - WHAT: Support for Sharia Law in Nine Countries | |
> | |
> - DATA: U.S. National Health Interview Survey (2009) | |
> | |
> - Hi! Welcome to your second SRQM do-file. | |
> | |
> - All the do-files for this course assume that you have set up Stata first by | |
> adjusting some parameters, most importantly setting the working directory to | |
> your SRQM folder. Please refer to the do-file from Session 1 for guidance. | |
> | |
> - Welcome again to Stata. Read the comment lines as you go along, and run the | |
> code by executing command lines sequentially. Select lines with Cmd-L (Mac) | |
> or Ctrl-L (Win), and execute them with Cmd-Shift-D (Mac) or Ctrl-D (Win). | |
> | |
> - We will explore the National Health Interview Survey with a few basic Stata | |
> commands. This is to show you how to explore a dataset and its variables. You | |
> need to make a choice of dataset for your project by the end of the week. | |
> | |
> - If you want to study one country or compare two of them, turn to survey data | |
> from the European Social Survey (ESS), U.S. General Social Survey (GSS) or | |
> World Values Survey (WVS). | |
> | |
> - If you want to study country-level data, use the Quality of Government (QOG) | |
> dataset. Your sample should be all world countries: do not further restrict | |
> the sample further by subsetting to less observations. | |
> | |
> Last updated 2013-02-17. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load NHIS dataset. | |
. use data/nhis2009, clear | |
(U.S. National Health Interview Survey 2009) | |
. | |
. * Once the dataset is loaded, the Variables window will fill up, and you will | |
. * be able to look at the actual dataset from the Data Editor. Read from the | |
. * course material to make sure that you know how to read through a dataset: | |
. * its data structure shows observations in rows and variables in columns. | |
. | |
. * List all variables in the dataset. | |
. describe | |
Contains data from data/nhis2009.dta | |
obs: 251,589 U.S. National Health Interview | |
Survey 2009 | |
vars: 32 16 Aug 2013 05:22 | |
size: 20,630,298 (_dta has notes) | |
------------------------------------------------------------------------------------ | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
year int %8.0g year_lbl Survey year | |
serial double %8.0f Sequential Serial Number, Household | |
Record | |
strata int %8.0g strata_lbl | |
Stratum for variance estimation | |
psu int %8.0g psu_lbl Primary sampling unit (PSU) for | |
variance estimation | |
hhweight long %12.0g Household weight, final annual | |
pernum byte %8.0g Person number in household | |
perweight double %9.0f Final basic annual weight | |
sampweight double %9.0f Sample Person Weight | |
nhispid str16 %16s NHIS Unique Identifier, person | |
age byte %31.0g age_lbl Age | |
marstat byte %37.0g marstat_lbl | |
Legal marital status | |
sex byte %8.0g sex_lbl Sex | |
hispeth byte %45.0g hispeth_lbl | |
Hispanic ethnicity | |
racea int %43.0g racea_lbl | |
Main Racial Background (Pre-1997 | |
Revised OMB Standards), | |
self-reported or interv | |
regionbr byte %42.0g regionbr_lbl | |
Global region of birth | |
yrsinus byte %30.0g yrsinus_lbl | |
Number of years spent in the U.S. | |
educrec1 byte %36.0g educrec1_lbl | |
Educational attainment recode, | |
nonintervalled | |
earnings byte %23.0g earnings_lbl | |
Person's total earnings, previous | |
calendar year | |
incimp1 byte %17.0g incimp1_lbl | |
Imputed total combined family income | |
(1997+ grouping) | |
health byte %23.0g health_lbl | |
Health status | |
height byte %30.0g height_lbl | |
Height in inches without shoes | |
weight int %36.0g weight_lbl | |
Weight in pounds without clothes or | |
shoes | |
visityrno byte %19.0g visityrno_lbl | |
Total office visits in past 12 months | |
ybarcare byte %23.0g ybarcare_lbl | |
Needed but couldn't afford medical | |
care, past 12 months | |
uninsured byte %23.0g uninsured_lbl | |
Health Insurance coverage status | |
diayrsago byte %34.0g diayrsago_lbl | |
Years since first diagnosed with | |
diabetes | |
strongfwk byte %35.0g strongfwk_lbl | |
Frequency of strengthening activity: | |
Times per week | |
vig10fwk byte %30.0g vig10fwk_lbl | |
Frequency of vigorous activity 10+ | |
minutes: Times per week | |
rsweight float %9.0g Adjusted to original size Sample | |
Person Weight | |
raceb float %9.0g raceb Race | |
vigor byte %9.0g Frequenciy of vigorous activity 10+ | |
minutes: times per week | |
strength byte %9.0g Frequenciy of strengthening activity: | |
times per week | |
------------------------------------------------------------------------------------ | |
Sorted by: | |
. | |
. | |
. * Finding variables | |
. * ----------------- | |
. | |
. * Locate some variables of interest by looking for keywords in the variables. | |
. * You can explore your dataset by looking for particular keywords in the | |
. * variable names and labels. This is particularly useful when your dataset | |
. * comes with variable names that are hard or impossible to understand by | |
. * themselves, such as 'v1' or 'epi_epi'. The example below will identify | |
. * several variables with either 'height' or 'weight' in their descriptors. | |
. lookfor height weight | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
hhweight long %12.0g Household weight, final annual | |
perweight double %9.0f Final basic annual weight | |
sampweight double %9.0f Sample Person Weight | |
height byte %30.0g height_lbl | |
Height in inches without shoes | |
weight int %36.0g weight_lbl | |
Weight in pounds without clothes or | |
shoes | |
rsweight float %9.0g Adjusted to original size Sample | |
Person Weight | |
. | |
. * List their values for the first ten observations. | |
. list height weight in 1/10 | |
+-----------------+ | |
| height weight | | |
|-----------------| | |
1. | 67 185 | | |
2. | 68 125 | | |
3. | 67 132 | | |
4. | 69 150 | | |
5. | 62 143 | | |
|-----------------| | |
6. | 70 160 | | |
7. | 71 183 | | |
8. | 75 200 | | |
9. | 67 125 | | |
10. | 69 140 | | |
+-----------------+ | |
. | |
. | |
. * Subsetting to cross-sectional format | |
. * ------------------------------------ | |
. | |
. * Our first step verifies whether the survey is cross-sectional. As we find | |
. * that the data contains more than one survey wave and spans over several years, | |
. * we keep only most recent observations. This step applies only to datasets that | |
. * contain multiple survey years, which is generally not the case in this course. | |
. | |
. * Check whether the survey is cross-sectional. | |
. lookfor year | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
year int %8.0g year_lbl Survey year | |
yrsinus byte %30.0g yrsinus_lbl | |
Number of years spent in the U.S. | |
earnings byte %23.0g earnings_lbl | |
Person's total earnings, previous | |
calendar year | |
diayrsago byte %34.0g diayrsago_lbl | |
Years since first diagnosed with | |
diabetes | |
. tab year | |
Survey year | Freq. Percent Cum. | |
------------+----------------------------------- | |
2000 | 28,712 11.41 11.41 | |
2001 | 29,459 11.71 23.12 | |
2002 | 27,087 10.77 33.89 | |
2003 | 26,998 10.73 44.62 | |
2004 | 27,462 10.92 55.53 | |
2005 | 27,484 10.92 66.46 | |
2006 | 21,010 8.35 74.81 | |
2007 | 20,173 8.02 82.83 | |
2008 | 18,913 7.52 90.34 | |
2009 | 24,291 9.66 100.00 | |
------------+----------------------------------- | |
Total | 251,589 100.00 | |
. | |
. * The data should be cross-sectional for the purpose of this course. However, | |
. * the dataset contains observations for more than one year. We will solve that | |
. * issue by keeping observations for the 2009 survey year only. | |
. | |
. * Delete all observations except for 2009. | |
. drop if year != 2009 | |
(227298 observations deleted) | |
. | |
. * The -drop- command deleted all observations for which the variable 'year' is | |
. * different (!=) from 2009. An equivalent command would be: | |
. * | |
. * keep if year == 2009 | |
. * | |
. * This command keeps only observations for which the 'year' variable is equal | |
. * (==) to 2009. Notice that the 'equal to' operator in Stata is a double equal | |
. * sign (==). Logical operators apply to many commands: read on to find out. | |
. * Also note that the spaces around logical operators are optional. | |
. | |
. * Make sure that you fully understand how cross-sectional data are arranged by | |
. * opening the Data Editor or using the -browse- command to take a quick look. | |
. | |
. | |
. * Survey weights | |
. * -------------- | |
. | |
. * The command below sets survey weights, which can be used to obtain weighted | |
. * estimates at later stages of the analysis. We will not require them much. | |
. | |
. * Survey weights (see NHIS documentation). | |
. svyset psu [pw = perweight], strata(strata) | |
pweight: perweight | |
VCE: linearized | |
Single unit: missing | |
Strata 1: strata | |
SU 1: psu | |
FPC 1: <zero> | |
. | |
. | |
. * ========================= | |
. * = VARIABLE MANIPULATION = | |
. * ========================= | |
. | |
. | |
. * Dependent variable: Body Mass Index | |
. * ----------------------------------- | |
. | |
. * Our next step is to compute the Body Mass Index for each observation in the | |
. * dataset (i.e. for each respondent to the survey) from their height and weight | |
. * by using the 'height' and 'weight' variables, and the formula for BMI. | |
. | |
. * Create the Body Mass Index from height and weight. We can write the -generate- | |
. * command as its -gen- shorthand. We will later call BMI our dependent variable, | |
. * and we will use other (independent) variables to try to predict its values. | |
. gen bmi = weight * 703 / height^2 | |
. | |
. * If something looks wrong later on in your analysis, check your BMI equation. | |
. * Also note that Stata is case-sensitive: we will write 'BMI' in the comments, | |
. * but the variable itself is called 'bmi' and should be written in lowercase. | |
. | |
. | |
. * Labelling a variable | |
. * -------------------- | |
. | |
. * Add a description label to the variable. All label commands start with -label- | |
. * (shorthand -la-). The one below labels a variable (shorthand -var-). | |
. la var bmi "Body Mass Index" | |
. | |
. * List BMI among the variables included in the current dataset. | |
. d bmi | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
bmi float %9.0g Body Mass Index | |
. | |
. * The -describe- command (shorthand -d-) shows that the BMI variable is now | |
. * part of the NHIS dataset. However, DO NOT SAVE your dataset, even when you | |
. * perform a useful operation like this one. Instead, you will run the do-file | |
. * to generate the variable again, hence making your calculation of BMI fully | |
. * understandable and replicable by an exterior observer, like us, or anyone. | |
. | |
. * Take a look at the BMI of a few respondents. Values between 15 and 40 are | |
. * expected for human beings as we know them on this planet. We also list a | |
. * few other variables to start thinking about possible relationships. | |
. li sex age health bmi in 50/60 | |
+-------------------------------------+ | |
| sex age health bmi | | |
|-------------------------------------| | |
50. | Male 28 Poor 29.11834 | | |
51. | Male 29 Excellent 26.62286 | | |
52. | Male 21 Very Good 35.86735 | | |
53. | Male 40 Good 29.64641 | | |
54. | Female 63 Fair 37.58521 | | |
|-------------------------------------| | |
55. | Female 38 Very Good 23.40106 | | |
56. | Male 54 Good 33.71531 | | |
57. | Male 47 Very Good 23.49076 | | |
58. | Female 38 Excellent 20.89472 | | |
59. | Female 81 Good 24.68913 | | |
|-------------------------------------| | |
60. | Male 32 Very Good 33.08425 | | |
+-------------------------------------+ | |
. li sex age health bmi in -10/l | |
+-------------------------------------+ | |
| sex age health bmi | | |
|-------------------------------------| | |
24282. | Female 26 Excellent 24.12663 | | |
24283. | Male 70 Good 33.77728 | | |
24284. | Female 19 Very Good 23.29467 | | |
24285. | Female 24 Good 37.49089 | | |
24286. | Female 77 Poor 29.85058 | | |
|-------------------------------------| | |
24287. | Female 57 Very Good 24.20799 | | |
24288. | Male 20 Very Good 24.40488 | | |
24289. | Female 67 Good 28.49072 | | |
24290. | Male 62 Poor 33.27811 | | |
24291. | Female 55 Good 19.76427 | | |
+-------------------------------------+ | |
. | |
. | |
. * Summary statistics | |
. * ------------------ | |
. | |
. * We now turn to analysing the newly created 'bmi' variable, using the | |
. * -summarize- command (shorthand -su-) to obtain its mean, min and max values, | |
. * as well as standard deviation, which we will cover later on. | |
. su bmi | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 24291 27.27 5.134197 15.20329 50.48837 | |
. | |
. * Add the -detail- option (shorthand -d-) for precise statistics that cover | |
. * its mean, minimum and maximum values, as well as its percentile distribution. | |
. su bmi, d | |
Body Mass Index | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 18.30729 15.20329 | |
5% 20.11707 15.20329 | |
10% 21.26276 15.20329 Obs 24291 | |
25% 23.51343 15.5041 Sum of Wgt. 24291 | |
50% 26.57845 Mean 27.27 | |
Largest Std. Dev. 5.134197 | |
75% 30.22843 49.60056 | |
90% 34.32617 50.38167 Variance 26.35998 | |
95% 36.91451 50.48837 Skewness .7207431 | |
99% 41.59763 50.48837 Kurtosis 3.463278 | |
. | |
. * Further sessions will gradually explain how to read each statistic displayed. | |
. * For now, just note that the median respondent in the dataset, which is meant | |
. * to be representative of the United States adult population in 2009, has a | |
. * BMI of 26, which indicates overweight. The average (mean) BMI is over that | |
. * value, which indicates that higher BMI values are either more frequent | |
. * and/or more extreme than lower BMI values. You can also note that the top 1% | |
. * respondents has a BMI between 41 and 50, which indicates morbid obesity. | |
. | |
. | |
. * Visualization | |
. * ------------- | |
. | |
. * Visualizing the distribution of BMI values among the observations contained | |
. * in the dataset will make these first insights more clear and more complete. | |
. * Create a histogram (shorthand -hist-) for the distribution of BMI. | |
. hist bmi, freq normal /// | |
> name(bmi, replace) | |
(bin=43, start=15.203287, width=.82058321) | |
. | |
. * A histogram describes the distribution of the variable in the sample, i.e. | |
. * the distribution of different values of BMI among the respondents to the | |
. * survey. The -freq- option specifies to use percentages, and the -normal- | |
. * option overlays a normal distribution to the histogram, a curve to which | |
. * we will soon come back when we cover essential statistical theory. The | |
. * -name- option saves the graph under that name in Stata temporary memory. | |
. | |
. * Another visualization is the boxplot, which uses different criteria to shape | |
. * the distribution of the variable. Refer to the course material to understand | |
. * how quartiles and outliers are constructed to form each element of the plot. | |
. * Also note that a boxplot is pretty uninformative if, as in this example, you | |
. * decide not to split the visualization over any number of categories. | |
. gr hbox bmi, /// | |
> name(bmi_boxplot, replace) | |
. | |
. * The next example uses the -over() asyvars- options to produce boxplots of BMI | |
. * over gender groups, and then again over insurance status. This method creates | |
. * several box plots, one for each category -- a method called 'visualizing over | |
. * small multiples'. The result will stay in memory under the name given by the | |
. * -name()- option. Note, finally, that you need to select both lines to run the | |
. * command properly: if you do not include the final line, nothing will happen. | |
. gr hbox bmi if uninsured != 9, over(sex) asyvars over(uninsured) /// | |
> name(bmi_sex_ins, replace) | |
. | |
. | |
. * Logical expressions | |
. * ------------------- | |
. | |
. * Note how the 'DK' category for insurance status was removed by using a call | |
. * to the conditional operator -if-, to exclude observations with an insurance | |
. * status equal to 9 when drawing the plot. This part of the command reads as: | |
. * draw a boxplot of all observations with an insurance status not equal to 9. | |
. | |
. * Here are more examples of logical expressions. | |
. | |
. su bmi if age >= 20 & age < 25 | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 1923 25.3919 4.839532 15.96345 46.63502 | |
. * This command reads as: 'run the -summarize- command on the 'bmi' variable, | |
. * but only for observations for wich the 'age' variable takes a value greater | |
. * than or equal to 20 and ('&') lesser than 25.' | |
. | |
. su bmi if sex == 1 & age >= 65 | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 1831 27.71574 4.401945 17.62924 45.18825 | |
. * This command reads as: 'summarize BMI for observations of sex equal to 1 | |
. * (i.e. males in this dataset) and of age greater or equal to 65.' | |
. | |
. su bmi if raceb == 2 | raceb == 3 | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 8651 28.17926 5.218596 15.5041 50.48837 | |
. * This command uses the 'raceb' variable, which codes Blacks and Hispanics | |
. * with values 2 and 3. This command therefore summarises BMI only for these | |
. * two ethnic groups: the '|' symbol is the logical operator for 'or'. It | |
. * reads as: 'summarize BMI if the respondent is Black or Hispanic.' | |
. | |
. * If you have many categories to select, then using the -inlist- operator might | |
. * be much quicker. The example below selects a series of income categories that | |
. * fall either below the minimum wage in 2009 (15,000 dollars/year) or that fall | |
. * five times over that or more (i.e. earnings == 11, the highest income category | |
. * in the dataset). | |
. tab earnings if inlist(earnings, 1, 2, 3, 11) | |
Person's total | | |
earnings, previous | | |
calendar year | Freq. Percent Cum. | |
------------------------+----------------------------------- | |
$01 to $4999 | 1,081 21.63 21.63 | |
$5000 to $9999 | 923 18.47 40.10 | |
$10000 to $14999 | 1,252 25.06 65.16 | |
$75000 and over | 1,741 34.84 100.00 | |
------------------------+----------------------------------- | |
Total | 4,997 100.00 | |
. | |
. * This operator is also practical to select countries, regions and other nominal | |
. * variables in country-level data, and it accepts strings, i.e. text variables. | |
. * Examples to follow later. For the moment, simply note that the example above | |
. * uses a tabulation command because the earnings variable is categorical. This | |
. * difference in the type of variable is crucial, and is illustrated further. | |
. | |
. | |
. * ========================= | |
. * = INDEPENDENT VARIABLES = | |
. * ========================= | |
. | |
. | |
. * Body Mass Index is our 'dependent variable', i.e. the one that we want to | |
. * explain. We have reason to believe that some 'independent' variables like | |
. * gender, health status and race could be influencing BMI. In other words, | |
. * we assume that BMI can be partially 'predicted' by sex, health and race. | |
. lookfor sex health race | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
sex byte %8.0g sex_lbl Sex | |
racea int %43.0g racea_lbl | |
Main Racial Background (Pre-1997 | |
Revised OMB Standards), | |
self-reported or interv | |
health byte %23.0g health_lbl | |
Health status | |
uninsured byte %23.0g uninsured_lbl | |
Health Insurance coverage status | |
raceb float %9.0g raceb Race | |
. | |
. | |
. * Summarizing over categories | |
. * --------------------------- | |
. | |
. * Summarize BMI (as well as height and weight) for each value of 'sex'. The | |
. * -su- command assumes that you are describing a variable that can take any | |
. * numeric value, and shows summary statistics for it. The -bysort- prefix | |
. * (shorthand -bys-) takes one categorical variable and repeats the command | |
. * over its categories. The entire command thus reads: for each value of the | |
. * 'sex' variable, summarize the continuous variables 'bmi', 'age' and weight. | |
. bysort sex: su bmi age weight | |
------------------------------------------------------------------------------------ | |
-> sex = Male | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 10978 27.57415 4.430363 17.14956 48.70874 | |
age | 10978 46.47404 16.93469 18 84 | |
weight | 10978 190.5036 33.0331 126 285 | |
------------------------------------------------------------------------------------ | |
-> sex = Female | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 13313 27.01919 5.636827 15.20329 50.48837 | |
age | 13313 47.09419 17.35074 18 84 | |
weight | 13313 157.8174 33.65398 100 259 | |
. | |
. * Read the Stata codebook for the 'health' variable. | |
. codebook health | |
------------------------------------------------------------------------------------ | |
health Health status | |
------------------------------------------------------------------------------------ | |
type: numeric (byte) | |
label: health_lbl | |
range: [1,5] units: 1 | |
unique values: 5 missing .: 7/24291 | |
tabulation: Freq. Numeric Label | |
6750 1 Excellent | |
7833 2 Very Good | |
6423 3 Good | |
2496 4 Fair | |
782 5 Poor | |
7 . | |
. | |
. * The codebook shows that the health variable comes in ordered categories. | |
. * In that case, the -su- command will not inspect the variable properly. You | |
. * will instead need to use either the -tab- or the -fre- command to describe | |
. * the variable properly, by viewing its frequencies: | |
. fre health | |
health -- Health status | |
----------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------+-------------------------------------------- | |
Valid 1 Excellent | 6750 27.79 27.80 27.80 | |
2 Very Good | 7833 32.25 32.26 60.05 | |
3 Good | 6423 26.44 26.45 86.50 | |
4 Fair | 2496 10.28 10.28 96.78 | |
5 Poor | 782 3.22 3.22 100.00 | |
Total | 24284 99.97 100.00 | |
Missing . | 7 0.03 | |
Total | 24291 100.00 | |
----------------------------------------------------------------- | |
. | |
. * Note that health is measured on five levels that come as values (1-5), and | |
. * labels attached to them (from 'Excellent' to 'Poor'). We will discuss this | |
. * structure in depth when we introduce variable types and value labels. For | |
. * the moment, simply note that the health variable holds an ordinal scale | |
. * of self-reported health status, and that the values attached to its labels | |
. * are merely a way to create an ordinal scale: 'poor' health is not worth 5 | |
. * points of anything. Refer later to the course material to make sure that | |
. * you are familiar with the terminology and notions of variable description. | |
. | |
. * Summarize BMI (as well as height and weight) for each value of the health | |
. * variable. Note that -bys- is shorthand for the -bysort- prefix. | |
. bys health: su bmi weight | |
------------------------------------------------------------------------------------ | |
-> health = Excellent | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 6750 25.78944 4.399813 16.13866 49.60056 | |
weight | 6750 165.2935 34.52845 100 285 | |
------------------------------------------------------------------------------------ | |
-> health = Very Good | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 7833 27.06963 4.864313 15.20329 50.48837 | |
weight | 7833 172.0412 36.43623 100 285 | |
------------------------------------------------------------------------------------ | |
-> health = Good | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 6423 28.21763 5.380641 15.5041 50.48837 | |
weight | 6423 177.2219 38.11962 100 285 | |
------------------------------------------------------------------------------------ | |
-> health = Fair | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 2496 28.89986 5.636523 15.20329 48.81944 | |
weight | 2496 179.5897 38.78594 100 285 | |
------------------------------------------------------------------------------------ | |
-> health = Poor | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 782 29.08097 6.087322 15.6605 48.70874 | |
weight | 782 180.7225 40.37895 100 283 | |
------------------------------------------------------------------------------------ | |
-> health = . | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 7 26.17216 4.125204 19.15614 31.95455 | |
weight | 7 166.4286 33.44576 126 205 | |
. | |
. | |
. * Visualization over categories | |
. * ----------------------------- | |
. | |
. * Graph the mean BMI of each ethnic group, using a dot plot. | |
. gr dot bmi, over(raceb) ytitle("Average Body Mass Index") /// | |
> name(bmi_race, replace) | |
. | |
. * Add a new categorical division between men and women to the dot plot. | |
. gr dot bmi, over(sex) over(raceb) ytitle("Average Body Mass Index") /// | |
> name(bmi_race, replace) | |
. | |
. * Each independent variable might influence BMI, but can also interact with | |
. * another independent variable, making the explanation of BMI more complex | |
. * and detailed because its predictors might also significantly interact with | |
. * each other. Visualization allows to explore that intuition in the same way | |
. * that it helped thinking about predictors to the dependent variable. | |
. | |
. * The graph below explores a relationship between three independent variables. | |
. * An additional trick in this graph is that its command runs over three lines. | |
. * The '///' indicates that you have to select all three lines to properly run | |
. * the graph command. This trick helps formatting do-files in short lines. | |
. gr dot health, exclude0 yreverse over(sex) over(raceb) /// | |
> ylabel(1 "Excellent" 3 "Good" 5 "Poor") ytitle("Average health status") // | |
> / | |
> name(health_sex_race, replace) | |
. | |
. * The graph uses several options: due to the numerical coding of the 'health' | |
. * variable, we had to remove 0 from the dot plot, and reverse the axis. We also | |
. * made the horizontal (y) axis more legible by adding (y)labels and a (y)title. | |
. * Note that the visual difference is naturally not sufficient to establish that | |
. * there is a significant difference in mean BMI across racial/ethnic groups. | |
. | |
. | |
. * ========================== | |
. * = FINALIZING THE DATASET = | |
. * ========================== | |
. | |
. | |
. * Patterns of missing values | |
. * -------------------------- | |
. | |
. * Finally, let's see how many observations have all variables measured for our | |
. * selection of variables. The -misstable- command produces a pattern that shows | |
. * the number of observations with no missing values across all listed variables. | |
. misstable pat bmi age sex health raceb earnings uninsured, freq | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Frequency | 1 | |
------------+------------- | |
24,284 | 1 | |
| | |
7 | 0 | |
------------+------------- | |
24,291 | | |
Variables are (1) health | |
. | |
. * There are only 7 missing values in the selection of variables above. Let's see | |
. * what happens if we also want to analyze the 'strength' and 'vigor' variables, | |
. * which measure physical activity. We remove the -freq- option to read the size | |
. * the data with no missing values as a percentage. The loss is still trivial. | |
. misstable pat bmi age sex health raceb earnings uninsured strength vigor | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Percent | 1 2 3 | |
------------+------------- | |
99% | 1 1 1 | |
| | |
<1 | 1 1 0 | |
<1 | 1 0 1 | |
<1 | 1 0 0 | |
<1 | 0 1 1 | |
<1 | 0 0 0 | |
------------+------------- | |
100% | | |
Variables are (1) health (2) strength (3) vigor | |
. | |
. | |
. * Subsetting | |
. * ---------- | |
. | |
. * We can now finalize the dataset by deleting observations with missing data in | |
. * our selection of variables. The final count is the actual sample size that we | |
. * will analyze at later stages of the course. | |
. drop if mi(bmi, age, sex, health, raceb, earnings, uninsured, strength, vigor) | |
(228 observations deleted) | |
. | |
. * Final count. | |
. count | |
24063 | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * The command above closes the log that we opened when we started this do-file. | |
. * Logs are essential to keep records of your analysis. They complement do-files, | |
. * which are records of your commands and comments only. Now that you have closed | |
. * the log below, have a quick look at it. | |
. view code/week2.log | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require fre scheme-burd | |
. | |
. * Log results. | |
. cap log using code/week3.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 3 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Support for Sharia Law in Nine Countries | |
> | |
> - DATA: World Values Survey Wave 4 (2000) | |
> | |
> - Welcome again to Stata. This do-file contains the commands used in our third | |
> session. For coursework, practice with Stata code by running the code again, | |
> and read the full comments on the way. | |
> | |
> - This do-file explores the World Values Survey (WVS) dataset and focuses on | |
> support for sharia law among respondents in Arab-speaking countries, several | |
> of which have been in political turmoil over the past few years. | |
> | |
> - The dependent variable (DV) is a 5-point agreement scale with the statement: | |
> "[The government] should implement only the laws of the sharia". The variable | |
> was measured during WVS Wave 4 (1999-2004). | |
> | |
> - Make sure that you understand how to distinguish continuous and categorical | |
> types of variables by the end of this training session. Also make sure that | |
> you know how to encode variables and missing values for analysis in Stata. | |
> | |
> - Select a dataset for analysis. Use the -lookfor- and -lookfor_all- commands | |
> to identify which dataset has variables that match your interests, and use | |
> the -d-, -fre- and -su- commands to describe and inspect the variables. | |
> | |
> - Start writing a draft do-file in which you prepare your dataset for analysis. | |
> Use the course do-files for inspiration: start with a short header, then load | |
> the data and describe the variables, recoding them if needed. | |
> | |
> - When selecting variables, make sure that the dependent variable is continuous | |
> or pseudo-continuous. The dependent variable (DV) is the one that you want to | |
> explain using your selection of independent variables (IVs). | |
> | |
> - Write a draft paragraph that describes the dependent variable in sufficient | |
> detail, and another draft paragraph that lists your independent variables and | |
> offers a general theory on the articulation between your variables. | |
> | |
> Last updated 2013-02-18. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load WVS dataset. | |
. use data/wvs2000, clear | |
(World Values Survey 2000) | |
. | |
. * Survey weights (see WVS documentation). | |
. svyset [pw = v245] | |
pweight: v245 | |
VCE: linearized | |
Single unit: missing | |
Strata 1: <one> | |
SU 1: <observations> | |
FPC 1: <zero> | |
. | |
. * Inspect the list of included countries. | |
. fre v2 | |
v2 -- country/region | |
--------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------+-------------------------------------------- | |
Valid 8 Spain | 1209 1.98 1.98 1.98 | |
11 Usa | 1200 1.97 1.97 3.95 | |
12 Canada | 1931 3.16 3.16 7.11 | |
13 Japan | 1362 2.23 2.23 9.34 | |
14 Mexico | 1535 2.51 2.51 11.85 | |
15 S Africa | 3000 4.91 4.91 16.76 | |
19 Sweden | 1015 1.66 1.66 18.43 | |
22 Argentina | 1280 2.10 2.10 20.52 | |
24 S Korea | 1200 1.97 1.97 22.49 | |
27 Puerto Rico | 720 1.18 1.18 23.67 | |
29 Nigeria | 2022 3.31 3.31 26.98 | |
30 Chile | 1200 1.97 1.97 28.94 | |
32 India | 2002 3.28 3.28 32.22 | |
38 Pakistan | 2000 3.28 3.28 35.50 | |
39 China | 1000 1.64 1.64 37.14 | |
44 Turkey | 3401 5.57 5.57 42.71 | |
51 Peru | 1501 2.46 2.46 45.16 | |
53 Venezuela | 1200 1.97 1.97 47.13 | |
57 Zimbabwe | 1002 1.64 1.64 48.77 | |
58 Philippines | 1200 1.97 1.97 50.74 | |
59 Israel | 1199 1.96 1.96 52.70 | |
60 Tanzania | 1171 1.92 1.92 54.62 | |
61 Moldova | 1008 1.65 1.65 56.27 | |
67 Saudi Arabia | 1502 2.46 2.46 58.73 | |
69 Bangladesh | 1500 2.46 2.46 61.18 | |
70 Indonesia | 1004 1.64 1.64 62.83 | |
71 Vietnam | 1000 1.64 1.64 64.47 | |
72 Albania | 1000 1.64 1.64 66.10 | |
74 Uganda | 1002 1.64 1.64 67.74 | |
77 Singapore | 1512 2.48 2.48 70.22 | |
81 Serbia | 1200 1.97 1.97 72.19 | |
82 Montenegro | 1060 1.74 1.74 73.92 | |
83 Macedonia | 1055 1.73 1.73 75.65 | |
89 Egypt | 3000 4.91 4.91 80.56 | |
90 Morocco | 2264 3.71 3.71 84.27 | |
91 Iran | 2532 4.15 4.15 88.42 | |
92 Jordan | 1223 2.00 2.00 90.42 | |
93 Bosnia | 1200 1.97 1.97 92.38 | |
96 Algeria | 1282 2.10 2.10 94.48 | |
97 Iraq | 2325 3.81 3.81 98.29 | |
99 Kyrgyzstan | 1043 1.71 1.71 100.00 | |
Total | 61062 100.00 100.00 | |
--------------------------------------------------------------------- | |
. | |
. * Rename the variable to something understandable. | |
. ren v2 country | |
. | |
. * Survey years. | |
. table country, c(min s020 max s020) | |
------------------------------------- | |
country/regi | | |
on | min(s020) max(s020) | |
-------------+----------------------- | |
Spain | 2000 2000 | |
Usa | 1999 1999 | |
Canada | 2000 2000 | |
Japan | 2000 2000 | |
Mexico | 2000 2000 | |
S Africa | 2001 2001 | |
Sweden | 1999 1999 | |
Argentina | 1999 1999 | |
S Korea | 2001 2001 | |
Puerto Rico | 2001 2001 | |
Nigeria | 2000 2000 | |
Chile | 2000 2000 | |
India | 2001 2001 | |
Pakistan | 2001 2001 | |
China | 2001 2001 | |
Turkey | 2001 2001 | |
Peru | 2001 2001 | |
Venezuela | 2000 2000 | |
Zimbabwe | 2001 2001 | |
Philippines | 2001 2001 | |
Israel | 2001 2001 | |
Tanzania | 2001 2001 | |
Moldova | 2002 2002 | |
Saudi Arabia | 2003 2003 | |
Bangladesh | 2002 2002 | |
Indonesia | 2001 2001 | |
Vietnam | 2001 2001 | |
Albania | 2002 2002 | |
Uganda | 2001 2001 | |
Singapore | 2002 2002 | |
Serbia | 2001 2001 | |
Montenegro | 2001 2001 | |
Macedonia | 2001 2001 | |
Egypt | 2000 2000 | |
Morocco | 2001 2001 | |
Iran | 2000 2000 | |
Jordan | 2001 2001 | |
Bosnia | 2001 2001 | |
Algeria | 2002 2002 | |
Iraq | 2004 2004 | |
Kyrgyzstan | 2003 2003 | |
------------------------------------- | |
. | |
. | |
. * Dependent variable: Support for sharia law | |
. * ------------------------------------------ | |
. | |
. * Inspect the overall dependent variable. | |
. fre iv166 | |
iv166 -- laws of the shari¥a | |
---------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------------------------+-------------------------------------------- | |
Valid -4 not asked | 45204 74.03 74.03 74.03 | |
1 agree strongly | 5499 9.01 9.01 83.04 | |
2 agree | 3572 5.85 5.85 88.89 | |
3 neither agree or disagree | 2364 3.87 3.87 92.76 | |
4 disagree | 1335 2.19 2.19 94.94 | |
5 strongly disagree | 771 1.26 1.26 96.21 | |
8 na | 1476 2.42 2.42 98.62 | |
9 dk | 841 1.38 1.38 100.00 | |
Total | 61062 100.00 100.00 | |
---------------------------------------------------------------------------------- | |
. | |
. * Clone the nonmissing values of the dependent variable (exclude 'DK/NA' codes). | |
. clonevar sharia = iv166 if iv166 > 0 & iv166 < 8 | |
(47521 missing values generated) | |
. | |
. * We use -clonevar- to create a variable with the same coding and labels as the | |
. * original one, but exclude missing values from the clone with the -if- logical | |
. * operator. The first argument is the name of the new variable that we created. | |
. | |
. * This approach to data preparation allows to rename and recode while preserving | |
. * the original variable. The new variable will appear at the end of the dataset, | |
. * as the -d- command (for -describe-) would show. | |
. | |
. * Inspect the clean version of the variable. | |
. fre sharia | |
sharia -- laws of the shari¥a | |
--------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------------------+-------------------------------------------- | |
Valid 1 agree strongly | 5499 9.01 40.61 40.61 | |
2 agree | 3572 5.85 26.38 66.99 | |
3 neither agree or disagree | 2364 3.87 17.46 84.45 | |
4 disagree | 1335 2.19 9.86 94.31 | |
5 strongly disagree | 771 1.26 5.69 100.00 | |
Total | 13541 22.18 100.00 | |
Missing . | 47521 77.82 | |
Total | 61062 100.00 | |
--------------------------------------------------------------------------------- | |
. | |
. * Find in which countries the variable was measured. | |
. fre country if !mi(sharia) | |
country -- country/region | |
--------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------+-------------------------------------------- | |
Valid 29 Nigeria | 626 4.62 4.62 4.62 | |
38 Pakistan | 1949 14.39 14.39 19.02 | |
67 Saudi Arabia | 1413 10.43 10.43 29.45 | |
69 Bangladesh | 1217 8.99 8.99 38.44 | |
70 Indonesia | 929 6.86 6.86 45.30 | |
89 Egypt | 2970 21.93 21.93 67.23 | |
92 Jordan | 1176 8.68 8.68 75.92 | |
96 Algeria | 1177 8.69 8.69 84.61 | |
97 Iraq | 2084 15.39 15.39 100.00 | |
Total | 13541 100.00 100.00 | |
--------------------------------------------------------------------- | |
. | |
. * Remove other countries. | |
. drop if mi(sharia) | |
(47521 observations deleted) | |
. | |
. * In the first command, the -!mi- operator means 'not missing' and therefore | |
. * produces the list of countries for which the DV is available. In the second | |
. * command, -drop- removes all observations for which the DV is missing. | |
. | |
. | |
. * Recoding to dummies | |
. * ------------------- | |
. | |
. * Recall the DV frequencies. | |
. fre sharia | |
sharia -- laws of the shari¥a | |
--------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------------------+-------------------------------------------- | |
Valid 1 agree strongly | 5499 40.61 40.61 40.61 | |
2 agree | 3572 26.38 26.38 66.99 | |
3 neither agree or disagree | 2364 17.46 17.46 84.45 | |
4 disagree | 1335 9.86 9.86 94.31 | |
5 strongly disagree | 771 5.69 5.69 100.00 | |
Total | 13541 100.00 100.00 | |
--------------------------------------------------------------------------------- | |
. | |
. * Recode the variable to a simpler form: pro-sharia respondents vs others. | |
. * The recoded variable is binary: it takes only two values, either 0 or 1. | |
. * These variables are affectionately called 'dummies'. | |
. recode sharia /// | |
> (1/2 = 1 "Support") /// | |
> (4/5 = 0 "Oppose") /// | |
> (else = .), gen(prosharia) | |
(8042 differences between sharia and prosharia) | |
. la var prosharia "Legislative enforcement of sharia (0/1)" | |
. fre prosharia | |
prosharia -- Legislative enforcement of sharia (0/1) | |
--------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------+-------------------------------------------- | |
Valid 0 Oppose | 2106 15.55 18.84 18.84 | |
1 Support | 9071 66.99 81.16 100.00 | |
Total | 11177 82.54 100.00 | |
Missing . | 2364 17.46 | |
Total | 13541 100.00 | |
--------------------------------------------------------------- | |
. | |
. * Another way to understand a binary variable is to look at its mean: because | |
. * the values of that variable are equal to either 0 or 1, its mean reads as the | |
. * proportion of positive cases (1) within the total number of observations. | |
. su prosharia | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
prosharia | 11177 .8115773 .3910668 0 1 | |
. | |
. * Same thing, different command (more flexible; used later). | |
. tabstat prosharia, s(n mean) c(s) | |
variable | N mean | |
-------------+-------------------- | |
prosharia | 11177 .8115773 | |
---------------------------------- | |
. | |
. * Finally, you can generates dummies for each value of a variable, which here | |
. * means generating five dummies starting with the 'sharia_' prefix: | |
. tab sharia, gen(sharia_) | |
laws of the shari¥a | Freq. Percent Cum. | |
--------------------------+----------------------------------- | |
agree strongly | 5,499 40.61 40.61 | |
agree | 3,572 26.38 66.99 | |
neither agree or disagree | 2,364 17.46 84.45 | |
disagree | 1,335 9.86 94.31 | |
strongly disagree | 771 5.69 100.00 | |
--------------------------+----------------------------------- | |
Total | 13,541 100.00 | |
. | |
. * Show all variables named 'sharia_[whatever]'. | |
. codebook sharia_*, c | |
Variable Obs Unique Mean Min Max Label | |
------------------------------------------------------------------------------------ | |
sharia_1 13541 2 .4061 0 1 sharia==agree strongly | |
sharia_2 13541 2 .2637914 0 1 sharia==agree | |
sharia_3 13541 2 .1745809 0 1 sharia==neither agree or disagree | |
sharia_4 13541 2 .0985895 0 1 sharia==disagree | |
sharia_5 13541 2 .0569382 0 1 sharia==strongly disagree | |
------------------------------------------------------------------------------------ | |
. | |
. | |
. * Stacked plots with dummies | |
. * -------------------------- | |
. | |
. * One reason to recode is to have a look at simplified versions of the DV in | |
. * graphs. Here's a dot plot showing the mean value of the DV (its proportion) | |
. * in each country, sorted by descending order: | |
. gr dot prosharia, over(country, sort(1)des) /// | |
> name(dv_dot, replace) | |
. | |
. * Recode the DV to three groups. | |
. recode sharia /// | |
> (1/2 = 1 "Agree") /// | |
> (3 = 2 "Neither") /// | |
> (4/5 = 3 "Disagree") /// | |
> (else = .), gen(sharia3) | |
(8042 differences between sharia and sharia3) | |
. la var sharia3 "Legislative enforcement of sharia (3 groups)" | |
. | |
. * Recode each category to a dummy. | |
. tab sharia3, gen(sharia3_) | |
Legislative | | |
enforcement | | |
of sharia | | |
(3 groups) | Freq. Percent Cum. | |
------------+----------------------------------- | |
Agree | 9,071 66.99 66.99 | |
Neither | 2,364 17.46 84.45 | |
Disagree | 2,106 15.55 100.00 | |
------------+----------------------------------- | |
Total | 13,541 100.00 | |
. d sharia3_* | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
sharia3_1 byte %8.0g sharia3==Agree | |
sharia3_2 byte %8.0g sharia3==Neither | |
sharia3_3 byte %8.0g sharia3==Disagree | |
. | |
. * Comparative plot at the country level, shown with tons of graphical options | |
. * to illustrate a limitation of Stata: it requires some work to produce decent | |
. * visualizations, especially with categorical variables. | |
. gr bar sharia3_*, over(country, sort(1)des lab(angle(45))) stack percent /// | |
> ti("Support for sharia legislation") yti("% respondents") /// | |
> legend(row(1) order(1 "For" 2 "Neutral" 3 "Against")) /// | |
> note("World Values Survey 1999-2004. {it:N} = 13,541") /// | |
> scheme(burd3) name(dv_bar, replace) | |
. | |
. * Identical plot, shown with horizontal bars and less options. Some settings | |
. * that show up on my end are provided by the burd3 scheme, which is part of | |
. * the course material; it will look different with other graph schemes. | |
. gr hbar sharia3_*, over(country, sort(1)des) stack percent /// | |
> ti("Support for sharia legislation") yti("% respondents") /// | |
> legend(pos(1) row(1) order(1 "For" 2 "Neutral" 3 "Against")) /// | |
> note("World Values Survey 1999-2004. {it:N} = 13,541") /// | |
> scheme(burd3) name(dv_hbar, replace) | |
. | |
. | |
. * ========================= | |
. * = INDEPENDENT VARIABLES = | |
. * ========================= | |
. | |
. | |
. * Describe independent variables. | |
. d v223 v225 v226 v241 | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
v223 byte %8.0g v223 sex | |
v225 byte %8.0g v225 age | |
v226 byte %8.0g v226 highest educational level attained | |
v241 byte %8.0g v241 size of town | |
. | |
. * Overview of variable codes. | |
. fre v223 v225 v226 v241 | |
v223 -- sex | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 male | 6961 51.41 51.41 51.41 | |
2 female | 6580 48.59 48.59 100.00 | |
Total | 13541 100.00 100.00 | |
-------------------------------------------------------------- | |
v225 -- age | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 15 | 20 0.15 0.15 0.15 | |
16 | 67 0.49 0.49 0.64 | |
17 | 100 0.74 0.74 1.38 | |
18 18 | 286 2.11 2.11 3.49 | |
19 | 320 2.36 2.36 5.86 | |
20 | 375 2.77 2.77 8.63 | |
21 | 331 2.44 2.44 11.07 | |
22 | 491 3.63 3.63 14.70 | |
23 | 468 3.46 3.46 18.15 | |
24 | 503 3.71 3.71 21.87 | |
25 | 458 3.38 3.38 25.25 | |
26 | 468 3.46 3.46 28.71 | |
27 | 383 2.83 2.83 31.53 | |
28 | 413 3.05 3.05 34.58 | |
29 | 321 2.37 2.37 36.95 | |
30 | 492 3.63 3.63 40.59 | |
31 | 349 2.58 2.58 43.17 | |
32 | 450 3.32 3.32 46.49 | |
33 | 342 2.53 2.53 49.01 | |
34 | 346 2.56 2.56 51.57 | |
: | : : : : | |
72 | 32 0.24 0.24 98.71 | |
73 | 26 0.19 0.19 98.90 | |
74 | 18 0.13 0.13 99.03 | |
75 | 29 0.21 0.21 99.25 | |
76 | 20 0.15 0.15 99.39 | |
77 | 15 0.11 0.11 99.51 | |
78 | 11 0.08 0.08 99.59 | |
79 | 7 0.05 0.05 99.64 | |
80 | 11 0.08 0.08 99.72 | |
81 | 6 0.04 0.04 99.76 | |
82 | 10 0.07 0.07 99.84 | |
83 | 4 0.03 0.03 99.87 | |
85 | 2 0.01 0.01 99.88 | |
86 | 3 0.02 0.02 99.90 | |
87 | 3 0.02 0.02 99.93 | |
88 | 1 0.01 0.01 99.93 | |
90 90 | 1 0.01 0.01 99.94 | |
92 | 1 0.01 0.01 99.95 | |
93 | 2 0.01 0.01 99.96 | |
99 dk | 5 0.04 0.04 100.00 | |
Total | 13541 100.00 100.00 | |
----------------------------------------------------------- | |
v226 -- highest educational level attained | |
----------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------------+-------------------------------------------- | |
Valid 1 no formal education | 2169 16.02 16.02 16.02 | |
2 incomplete primary school | 1218 8.99 8.99 25.01 | |
3 complete primary school | 1791 13.23 13.23 38.24 | |
4 incomplete secondary | 764 5.64 5.64 43.88 | |
school: | | |
technical/vocational type | | |
5 complete secondary school: | 1886 13.93 13.93 57.81 | |
technical/vocational type | | |
6 incomplete secondary: | 805 5.94 5.94 63.75 | |
university-preparatory | | |
type | | |
7 complete secondary: | 1974 14.58 14.58 78.33 | |
university-preparatory | | |
type | | |
8 some university without | 1004 7.41 7.41 85.75 | |
degree | | |
9 university with degree | 1881 13.89 13.89 99.64 | |
98 na | 14 0.10 0.10 99.74 | |
99 dk | 35 0.26 0.26 100.00 | |
Total | 13541 100.00 100.00 | |
----------------------------------------------------------------------------------- | |
v241 -- size of town | |
------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
----------------------------+-------------------------------------------- | |
Valid -4 not asked | 2084 15.39 15.39 15.39 | |
1 2,000 and less | 597 4.41 4.41 19.80 | |
2 2,000-5,000 | 1550 11.45 11.45 31.25 | |
3 5,000-10,000 | 1777 13.12 13.12 44.37 | |
4 10,000-20,000 | 1006 7.43 7.43 51.80 | |
5 20,000-50,000 | 1374 10.15 10.15 61.95 | |
6 50,000-100,000 | 753 5.56 5.56 67.51 | |
7 100,000-500,000 | 958 7.07 7.07 74.58 | |
8 500,000 and more | 3416 25.23 25.23 99.81 | |
9 dk | 26 0.19 0.19 100.00 | |
Total | 13541 100.00 100.00 | |
------------------------------------------------------------------------- | |
. | |
. | |
. * IV: Gender | |
. * ---------- | |
. | |
. fre v223 | |
v223 -- sex | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 male | 6961 51.41 51.41 51.41 | |
2 female | 6580 48.59 48.59 100.00 | |
Total | 13541 100.00 100.00 | |
-------------------------------------------------------------- | |
. | |
. * Recode gender as a meaningful binary (either female or not) using a logical | |
. * operator (in brackets), excluding missing observations from the operation and | |
. * applying the 'female' label to the new 'female' dummy variable: | |
. gen female:female = (v223 == 1) if !mi(v223) | |
. | |
. * Label the values. | |
. la def female 0 "Male" 1 "Female", replace | |
. | |
. * Label the variable. | |
. la var female "Gender" | |
. | |
. * Final result. | |
. fre female | |
female -- Gender | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 0 Male | 6580 48.59 48.59 48.59 | |
1 Female | 6961 51.41 51.41 100.00 | |
Total | 13541 100.00 100.00 | |
-------------------------------------------------------------- | |
. | |
. * Compute the average support for sharia law among each gender group. Since the | |
. * recoded DV only takes values of 0 or 1, its mean indicates the percentage of | |
. * sharia supporters in each gender group. | |
. bys female: su prosharia | |
------------------------------------------------------------------------------------ | |
-> female = Male | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
prosharia | 5415 .8158818 .3876163 0 1 | |
------------------------------------------------------------------------------------ | |
-> female = Female | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
prosharia | 5762 .8075321 .3942727 0 1 | |
. | |
. * The same result can be viewed as a frequency by crosstabulating the variables. | |
. tab prosharia female, col nof | |
Legislativ | | |
e | | |
enforcemen | | |
t of | | |
sharia | Gender | |
(0/1) | Male Female | Total | |
-----------+----------------------+---------- | |
Oppose | 18.41 19.25 | 18.84 | |
Support | 81.59 80.75 | 81.16 | |
-----------+----------------------+---------- | |
Total | 100.00 100.00 | 100.00 | |
. | |
. | |
. * IV: Age | |
. * ------- | |
. | |
. fre v225 | |
v225 -- age | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 15 | 20 0.15 0.15 0.15 | |
16 | 67 0.49 0.49 0.64 | |
17 | 100 0.74 0.74 1.38 | |
18 18 | 286 2.11 2.11 3.49 | |
19 | 320 2.36 2.36 5.86 | |
20 | 375 2.77 2.77 8.63 | |
21 | 331 2.44 2.44 11.07 | |
22 | 491 3.63 3.63 14.70 | |
23 | 468 3.46 3.46 18.15 | |
24 | 503 3.71 3.71 21.87 | |
25 | 458 3.38 3.38 25.25 | |
26 | 468 3.46 3.46 28.71 | |
27 | 383 2.83 2.83 31.53 | |
28 | 413 3.05 3.05 34.58 | |
29 | 321 2.37 2.37 36.95 | |
30 | 492 3.63 3.63 40.59 | |
31 | 349 2.58 2.58 43.17 | |
32 | 450 3.32 3.32 46.49 | |
33 | 342 2.53 2.53 49.01 | |
34 | 346 2.56 2.56 51.57 | |
: | : : : : | |
72 | 32 0.24 0.24 98.71 | |
73 | 26 0.19 0.19 98.90 | |
74 | 18 0.13 0.13 99.03 | |
75 | 29 0.21 0.21 99.25 | |
76 | 20 0.15 0.15 99.39 | |
77 | 15 0.11 0.11 99.51 | |
78 | 11 0.08 0.08 99.59 | |
79 | 7 0.05 0.05 99.64 | |
80 | 11 0.08 0.08 99.72 | |
81 | 6 0.04 0.04 99.76 | |
82 | 10 0.07 0.07 99.84 | |
83 | 4 0.03 0.03 99.87 | |
85 | 2 0.01 0.01 99.88 | |
86 | 3 0.02 0.02 99.90 | |
87 | 3 0.02 0.02 99.93 | |
88 | 1 0.01 0.01 99.93 | |
90 90 | 1 0.01 0.01 99.94 | |
92 | 1 0.01 0.01 99.95 | |
93 | 2 0.01 0.01 99.96 | |
99 dk | 5 0.04 0.04 100.00 | |
Total | 13541 100.00 100.00 | |
----------------------------------------------------------- | |
. | |
. * Strangely enough, '99' is a missing value here, so we replace '99' values with | |
. * a missing value code. The -replace- command is the quickest way to do that. | |
. replace v225 = . if v225 == 99 | |
(5 real changes made, 5 to missing) | |
. | |
. * We can now clone the variable. | |
. clonevar age = v225 | |
(5 missing values generated) | |
. | |
. * Use -summarize- (or simply -su-) to get the summary statistics, as appropriate | |
. * for continuous variables where the mean and standard deviation are meaningful. | |
. * Do -not- use either -fre- or -tab- to summarize a continuous variable! | |
. su age | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
age | 13536 36.32159 13.53227 15 93 | |
. | |
. * Histograms showing the distribution of age in each country. | |
. hist age, by(country, note("")) bin(9) percent /// | |
> xti("Age distribution") /// | |
> name(age,replace) | |
. | |
. * Recode to quartiles -- shown for demonstration purposes: recoding to groups | |
. * makes much more sense here, but recoding to n-quantiles like percentiles or | |
. * quartiles is useful in many explorative situations. | |
. xtile age_q4 = age, nq(4) | |
. | |
. * Check that the quartiles each capture roughly a quarter of the distribution. | |
. fre age_q4 | |
age_q4 -- 4 quantiles of age | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 1 | 3419 25.25 25.26 25.26 | |
2 | 3564 26.32 26.33 51.59 | |
3 | 3197 23.61 23.62 75.21 | |
4 | 3356 24.78 24.79 100.00 | |
Total | 13536 99.96 100.00 | |
Missing . | 5 0.04 | |
Total | 13541 100.00 | |
----------------------------------------------------------- | |
. | |
. * Inspect how age varies within each quartile (e.g. compare top and bottom 25%). | |
. tab age_q4, sum(age) | |
4 quantiles | Summary of age | |
of age | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
1 | 21.596666 2.4919297 3419 | |
2 | 29.857183 2.5735494 3564 | |
3 | 39.049734 2.8584182 3197 | |
4 | 55.589094 8.5925202 3356 | |
------------+------------------------------------ | |
Total | 36.321587 13.532266 13536 | |
. | |
. * Expectedly, there is more variance in the last, older group. Let's finally get | |
. * the range, or lower (min) and lower (max) bounds, of each age quartile. | |
. table age_q4, c(min age max age) | |
---------------------------------- | |
4 | | |
quantiles | | |
of age | min(age) max(age) | |
----------+----------------------- | |
1 | 15 25 | |
2 | 26 34 | |
3 | 35 44 | |
4 | 45 93 | |
---------------------------------- | |
. | |
. * Recode to four age groups. The -irecode- command creates categories based on | |
. * continuous intervals: category 0 of age4 will contain observations of age up | |
. * to 33, category 1 will contain those from 34 to 49, and so on. | |
. gen age4:age4 = irecode(age, 33, 49, 64, .) | |
(5 missing values generated) | |
. | |
. * Check the results. This is a different -table- command than the -tab- one used | |
. * previously, which we will get to use for more flexible crosstabulations. | |
. table age4, c(min age max age) | |
---------------------------------- | |
age4 | min(age) max(age) | |
----------+----------------------- | |
0 | 15 33 | |
1 | 34 49 | |
2 | 50 64 | |
3 | 65 93 | |
---------------------------------- | |
. | |
. * And here's yet another way to crosstabulate: the -tab- command with the -sum- | |
. * option returns the average age in each age group, along with the SD and count. | |
. * More on the SD (standard deviation) next week. | |
. tab age4, sum(age) | |
| Summary of age | |
age4 | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
0 | 25.385867 4.5848243 6637 | |
1 | 40.285682 4.3926977 4484 | |
2 | 55.555377 4.016129 1869 | |
3 | 70.858974 5.3419134 546 | |
------------+------------------------------------ | |
Total | 36.321587 13.532266 13536 | |
. | |
. * Write the value and variable labels. | |
. la def age4 0 "16-33" 1 "34-49" 2 "50-64" 3 "65+", replace | |
. la var age4 "Age groups" | |
. fre age4 | |
age4 -- Age groups | |
------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
----------------+-------------------------------------------- | |
Valid 0 16-33 | 6637 49.01 49.03 49.03 | |
1 34-49 | 4484 33.11 33.13 82.16 | |
2 50-64 | 1869 13.80 13.81 95.97 | |
3 65+ | 546 4.03 4.03 100.00 | |
Total | 13536 99.96 100.00 | |
Missing . | 5 0.04 | |
Total | 13541 100.00 | |
------------------------------------------------------------- | |
. | |
. * Average support for sharia law by age group in each country. | |
. gr dot prosharia, over(female) asyvars over(age4) by(country) /// | |
> name(dv_sex_age2, replace) | |
. | |
. | |
. * IV: Education | |
. * ------------- | |
. | |
. fre v226 | |
v226 -- highest educational level attained | |
----------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------------+-------------------------------------------- | |
Valid 1 no formal education | 2169 16.02 16.02 16.02 | |
2 incomplete primary school | 1218 8.99 8.99 25.01 | |
3 complete primary school | 1791 13.23 13.23 38.24 | |
4 incomplete secondary | 764 5.64 5.64 43.88 | |
school: | | |
technical/vocational type | | |
5 complete secondary school: | 1886 13.93 13.93 57.81 | |
technical/vocational type | | |
6 incomplete secondary: | 805 5.94 5.94 63.75 | |
university-preparatory | | |
type | | |
7 complete secondary: | 1974 14.58 14.58 78.33 | |
university-preparatory | | |
type | | |
8 some university without | 1004 7.41 7.41 85.75 | |
degree | | |
9 university with degree | 1881 13.89 13.89 99.64 | |
98 na | 14 0.10 0.10 99.74 | |
99 dk | 35 0.26 0.26 100.00 | |
Total | 13541 100.00 100.00 | |
----------------------------------------------------------------------------------- | |
. | |
. * Recode to simpler educational attainment levels. | |
. recode v226 /// | |
> (1/2 = 0 "None") /// | |
> (3/4 = 1 "Primary") /// | |
> (5/8 = 2 "Secondary") /// | |
> (9 = 3 "University") /// | |
> (else = .), gen(edu4) | |
(13541 differences between v226 and edu4) | |
. la var edu4 "Education" | |
. fre edu4 | |
edu4 -- Education | |
------------------------------------------------------------------ | |
| Freq. Percent Valid Cum. | |
---------------------+-------------------------------------------- | |
Valid 0 None | 3387 25.01 25.10 25.10 | |
1 Primary | 2555 18.87 18.94 44.04 | |
2 Secondary | 5669 41.87 42.02 86.06 | |
3 University | 1881 13.89 13.94 100.00 | |
Total | 13492 99.64 100.00 | |
Missing . | 49 0.36 | |
Total | 13541 100.00 | |
------------------------------------------------------------------ | |
. | |
. * Histograms showing the distribution of education in each country. Because the | |
. * variable is categorical, the histograms require the -discrete- option to plot | |
. * the histograms bin as zero-spaced frequency bars. | |
. hist edu4, by(country, note("")) percent discrete xla(0(1)3) /// | |
> name(edu,replace) | |
. | |
. | |
. * IV: Employment status | |
. * --------------------- | |
. | |
. fre v229 | |
v229 -- are you employed now | |
--------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------+-------------------------------------------- | |
Valid 1 full time | 3628 26.79 26.79 26.79 | |
2 part time | 864 6.38 6.38 33.17 | |
3 self employed | 1661 12.27 12.27 45.44 | |
4 retired | 540 3.99 3.99 49.43 | |
5 housewife | 4127 30.48 30.48 79.91 | |
6 students | 1260 9.31 9.31 89.21 | |
7 unemployed | 1137 8.40 8.40 97.61 | |
8 other | 232 1.71 1.71 99.32 | |
9 dk,na | 92 0.68 0.68 100.00 | |
Total | 13541 100.00 100.00 | |
--------------------------------------------------------------------- | |
. | |
. * Clone variable without missing values. | |
. clonevar empl = v229 if v229 < 8 | |
(324 missing values generated) | |
. fre empl | |
empl -- are you employed now | |
--------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------+-------------------------------------------- | |
Valid 1 full time | 3628 26.79 27.45 27.45 | |
2 part time | 864 6.38 6.54 33.99 | |
3 self employed | 1661 12.27 12.57 46.55 | |
4 retired | 540 3.99 4.09 50.64 | |
5 housewife | 4127 30.48 31.22 81.86 | |
6 students | 1260 9.31 9.53 91.40 | |
7 unemployed | 1137 8.40 8.60 100.00 | |
Total | 13217 97.61 100.00 | |
Missing . | 324 2.39 | |
Total | 13541 100.00 | |
--------------------------------------------------------------------- | |
. | |
. | |
. * IV: Household composition | |
. * ------------------------- | |
. | |
. fre v106 v107 | |
v106 -- marital status | |
---------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------------------------+-------------------------------------------- | |
Valid 1 married | 8309 61.36 61.36 61.36 | |
2 living together as married | 695 5.13 5.13 66.49 | |
3 divorced | 148 1.09 1.09 67.59 | |
4 separated | 59 0.44 0.44 68.02 | |
5 widowed | 507 3.74 3.74 71.77 | |
6 single | 3804 28.09 28.09 99.86 | |
8 na | 3 0.02 0.02 99.88 | |
9 dk | 16 0.12 0.12 100.00 | |
Total | 13541 100.00 100.00 | |
---------------------------------------------------------------------------------- | |
v107 -- have you had any children | |
-------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------------------+-------------------------------------------- | |
Valid 0 no child | 4281 31.62 31.62 31.62 | |
1 1 child | 1198 8.85 8.85 40.46 | |
2 2 children | 1771 13.08 13.08 53.54 | |
3 3 children | 1880 13.88 13.88 67.42 | |
4 4 children | 1430 10.56 10.56 77.99 | |
5 5 children | 975 7.20 7.20 85.19 | |
6 6 children | 701 5.18 5.18 90.36 | |
7 7 children | 431 3.18 3.18 93.55 | |
8 8 or more children | 533 3.94 3.94 97.48 | |
9 na | 341 2.52 2.52 100.00 | |
Total | 13541 100.00 100.00 | |
-------------------------------------------------------------------------- | |
. | |
. * Married dummy. | |
. gen married = (v106 == 1) if v106 < 8 | |
(19 missing values generated) | |
. tab v106 married | |
| married | |
marital status | 0 1 | Total | |
----------------------+----------------------+---------- | |
married | 0 8,309 | 8,309 | |
living together as ma | 695 0 | 695 | |
divorced | 148 0 | 148 | |
separated | 59 0 | 59 | |
widowed | 507 0 | 507 | |
single | 3,804 0 | 3,804 | |
----------------------+----------------------+---------- | |
Total | 5,213 8,309 | 13,522 | |
. | |
. * Children dummy. | |
. gen haskids = (v107 > 0) if v107 < 9 | |
(341 missing values generated) | |
. tab v107 haskids | |
have you had any | haskids | |
children | 0 1 | Total | |
-------------------+----------------------+---------- | |
no child | 4,281 0 | 4,281 | |
1 child | 0 1,198 | 1,198 | |
2 children | 0 1,771 | 1,771 | |
3 children | 0 1,880 | 1,880 | |
4 children | 0 1,430 | 1,430 | |
5 children | 0 975 | 975 | |
6 children | 0 701 | 701 | |
7 children | 0 431 | 431 | |
8 or more children | 0 533 | 533 | |
-------------------+----------------------+---------- | |
Total | 4,281 8,919 | 13,200 | |
. | |
. | |
. * IV: City size | |
. * ------------- | |
. | |
. * Recode to simpler categories. | |
. recode v241 /// | |
> (1/3 = 1 "< 10k") /// | |
> (4/6 = 2 "< 100k") /// | |
> (7 = 3 "< 500k") /// | |
> (8 = 4 "> 500k") /// | |
> (else = .), gen(city4) | |
(12944 differences between v241 and city4) | |
. la var city4 "City size" | |
. fre city4 | |
city4 -- City size | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 < 10k | 3924 28.98 34.33 34.33 | |
2 < 100k | 3133 23.14 27.41 61.74 | |
3 < 500k | 958 7.07 8.38 70.12 | |
4 > 500k | 3416 25.23 29.88 100.00 | |
Total | 11431 84.42 100.00 | |
Missing . | 2110 15.58 | |
Total | 13541 100.00 | |
-------------------------------------------------------------- | |
. | |
. | |
. * ===================== | |
. * = FINALIZED DATASET = | |
. * ===================== | |
. | |
. | |
. * Finalizing a dataset before analysis involves doing two things. The first | |
. * one consists in subsetting to fully measured data, which means dropping all | |
. * observations with missing values in the variables selected for analysis. | |
. * This restriction is required by the kind of models that we will run later on. | |
. * Prior to that, we will need to subset the data to the countries of interest. | |
. | |
. * Recall how the country variable is coded. | |
. fre country | |
country -- country/region | |
--------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------+-------------------------------------------- | |
Valid 29 Nigeria | 626 4.62 4.62 4.62 | |
38 Pakistan | 1949 14.39 14.39 19.02 | |
67 Saudi Arabia | 1413 10.43 10.43 29.45 | |
69 Bangladesh | 1217 8.99 8.99 38.44 | |
70 Indonesia | 929 6.86 6.86 45.30 | |
89 Egypt | 2970 21.93 21.93 67.23 | |
92 Jordan | 1176 8.68 8.68 75.92 | |
96 Algeria | 1177 8.69 8.69 84.61 | |
97 Iraq | 2084 15.39 15.39 100.00 | |
Total | 13541 100.00 100.00 | |
--------------------------------------------------------------------- | |
. | |
. * Subset to two countries of interest. | |
. keep if inlist(country, 89, 96) | |
(9394 observations deleted) | |
. | |
. * Pattern of missing values. | |
. misstable pat sharia age female edu4 empl married haskids city4 | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Percent | 1 2 3 4 5 6 | |
------------+--------------------- | |
94% | 1 1 1 1 1 1 | |
| | |
4 | 1 1 1 1 1 0 | |
<1 | 1 1 1 1 0 1 | |
<1 | 1 1 1 0 1 0 | |
<1 | 1 1 0 1 1 1 | |
<1 | 1 1 1 1 0 0 | |
<1 | 0 1 1 1 1 1 | |
<1 | 1 0 1 1 1 1 | |
<1 | 0 0 0 1 0 1 | |
<1 | 1 0 1 1 0 0 | |
<1 | 1 0 1 1 1 0 | |
<1 | 1 1 0 1 0 1 | |
------------+--------------------- | |
100% | | |
Variables are (1) age (2) edu4 (3) city4 (4) married (5) empl (6) haskids | |
. | |
. * Studying the pattern of missing values is a crucial requirement: dropping | |
. * observations with missing values might affect the representativeness of the | |
. * data, or even bring it to such a low number of observations that statistical | |
. * power (the capacity of your data to discriminate statistically significant | |
. * relationships from insignificant ones) will be at risk. Adopt a reasonable | |
. * strategy at that stage: find equivalents to variables that damage your sample, | |
. * and adjust your research questions to the available data. Whatever choice you | |
. * end up making, ensure that you understand how your finalized dataset relates | |
. * to the original data with regards to representativeness. | |
. | |
. * Subset to nonmissing observations. | |
. drop if mi(sharia, age, female, edu4, empl, married, haskids, city4) | |
(260 observations deleted) | |
. | |
. * The second and last task is to get the final sample size (in each country). | |
. bys country: count | |
------------------------------------------------------------------------------------ | |
-> country = Egypt | |
2918 | |
------------------------------------------------------------------------------------ | |
-> country = Algeria | |
969 | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require fre | |
. | |
. * Log results. | |
. cap log using code/week4.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 4 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Social Determinants of Adult Obesity in the United States | |
> | |
> - DATA: U.S. National Health Interview Survey (2009) | |
> | |
> - Since last week, you should now know what dataset and variables you plan to | |
> use for your research project. Please register your project online by writing | |
> your names, keywords, data source and class ID to the student projects table. | |
> | |
> - This week focuses on inspecting the normality of your dependent variable. The | |
> DV should be continuous for best results, or at least pseudo-continuous like | |
> a 10-point scale measurement. | |
> | |
> - Avoid selecting variables with four dimensions or less as your DV, unless you | |
> can learn to interpret logistic regression in just a few weeks at the end of | |
> the course. This requires some math and is for the most adventurous only. | |
> | |
> - Assessing the normality of a variable is first and foremost a visual process. | |
> You will need to visualize your DV a lot at that stage of your work. There is | |
> no systematic way to assess normality, but your decision should take skewness | |
> and kurtosis into account. | |
> | |
> Last updated 2013-02-21. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load NHIS dataset. | |
. use data/nhis2009, clear | |
(U.S. National Health Interview Survey 2009) | |
. | |
. * Subset to most recent year. | |
. drop if year != 2009 | |
(227298 observations deleted) | |
. | |
. | |
. * Dependent variable: Body Mass Index | |
. * ----------------------------------- | |
. | |
. * Compute the Body Mass Index. | |
. gen bmi = weight * 703 / height^2 | |
. la var bmi "Body Mass Index" | |
. | |
. * Weight the data with NHIS individual weights. | |
. svyset psu [pw = perweight], strata(strata) | |
pweight: perweight | |
VCE: linearized | |
Single unit: missing | |
Strata 1: strata | |
SU 1: psu | |
FPC 1: <zero> | |
. | |
. | |
. * Independent variables | |
. * --------------------- | |
. | |
. * Inspect some of the variables. | |
. d sex raceb earnings | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
sex byte %8.0g sex_lbl Sex | |
raceb float %9.0g raceb Race | |
earnings byte %23.0g earnings_lbl | |
Person's total earnings, previous | |
calendar year | |
. | |
. * Low-dimensional, categorical variables. | |
. fre sex | |
sex -- Sex | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 Male | 10978 45.19 45.19 45.19 | |
2 Female | 13313 54.81 54.81 100.00 | |
Total | 24291 100.00 100.00 | |
-------------------------------------------------------------- | |
. fre raceb | |
raceb -- Race | |
---------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------+-------------------------------------------- | |
Valid 1 White | 14269 58.74 58.74 58.74 | |
2 Black | 3893 16.03 16.03 74.77 | |
3 Hispanic | 4758 19.59 19.59 94.36 | |
4 Asian | 1371 5.64 5.64 100.00 | |
Total | 24291 100.00 100.00 | |
---------------------------------------------------------------- | |
. fre earnings | |
earnings -- Person's total earnings, previous calendar year | |
--------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------------+-------------------------------------------- | |
Valid 0 NIU | 7683 31.63 31.63 31.63 | |
1 $01 to $4999 | 1081 4.45 4.45 36.08 | |
2 $5000 to $9999 | 923 3.80 3.80 39.88 | |
3 $10000 to $14999 | 1252 5.15 5.15 45.03 | |
4 $15000 to $19999 | 1100 4.53 4.53 49.56 | |
5 $20000 to $24999 | 1235 5.08 5.08 54.65 | |
6 $25000 to $34999 | 2132 8.78 8.78 63.42 | |
7 $35000 to $44999 | 1777 7.32 7.32 70.74 | |
8 $45000 to $54999 | 1397 5.75 5.75 76.49 | |
9 $55000 to $64999 | 885 3.64 3.64 80.13 | |
10 $65000 to $74999 | 603 2.48 2.48 82.61 | |
11 $75000 and over | 1741 7.17 7.17 89.78 | |
97 Unknown-refused | 1292 5.32 5.32 95.10 | |
99 Unknown-don't know | 1190 4.90 4.90 100.00 | |
Total | 24291 100.00 100.00 | |
--------------------------------------------------------------------------- | |
. | |
. * The default -tab- command returns similar results, minus value labels. | |
. tab sex | |
Sex | Freq. Percent Cum. | |
------------+----------------------------------- | |
Male | 10,978 45.19 45.19 | |
Female | 13,313 54.81 100.00 | |
------------+----------------------------------- | |
Total | 24,291 100.00 | |
. | |
. * High-dimensional, continuous variables. | |
. fre bmi, rows(30) | |
bmi -- Body Mass Index | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 15.20329 | 3 0.01 0.01 0.01 | |
15.5041 | 1 0.00 0.00 0.02 | |
15.6605 | 1 0.00 0.00 0.02 | |
15.96345 | 1 0.00 0.00 0.02 | |
16.13866 | 4 0.02 0.02 0.04 | |
16.44353 | 1 0.00 0.00 0.05 | |
16.46143 | 1 0.00 0.00 0.05 | |
16.60013 | 1 0.00 0.00 0.05 | |
16.62282 | 2 0.01 0.01 0.06 | |
16.63905 | 3 0.01 0.01 0.07 | |
16.72362 | 1 0.00 0.00 0.08 | |
16.80544 | 1 0.00 0.00 0.08 | |
16.91334 | 3 0.01 0.01 0.09 | |
16.94559 | 2 0.01 0.01 0.10 | |
16.97183 | 2 0.01 0.01 0.11 | |
: | : : : : | |
47.4525 | 1 0.00 0.00 99.92 | |
47.66102 | 1 0.00 0.00 99.92 | |
47.79871 | 4 0.02 0.02 99.94 | |
47.84306 | 1 0.00 0.00 99.94 | |
47.86297 | 1 0.00 0.00 99.95 | |
47.98764 | 1 0.00 0.00 99.95 | |
48.42889 | 2 0.01 0.01 99.96 | |
48.46883 | 1 0.00 0.00 99.96 | |
48.55442 | 1 0.00 0.00 99.97 | |
48.70874 | 1 0.00 0.00 99.97 | |
48.81944 | 2 0.01 0.01 99.98 | |
49.40528 | 1 0.00 0.00 99.98 | |
49.60056 | 1 0.00 0.00 99.99 | |
50.38167 | 1 0.00 0.00 99.99 | |
50.48837 | 2 0.01 0.01 100.00 | |
Total | 24291 100.00 100.00 | |
-------------------------------------------------------------- | |
. | |
. | |
. * ================ | |
. * = DISTRIBUTION = | |
. * ================ | |
. | |
. | |
. * Obtain summary statistics: | |
. su bmi | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 24291 27.27 5.134197 15.20329 50.48837 | |
. tabstat bmi, s(n mean sd min max) | |
variable | N mean sd min max | |
-------------+-------------------------------------------------- | |
bmi | 24291 27.27 5.134197 15.20329 50.48837 | |
---------------------------------------------------------------- | |
. | |
. su bmi, d | |
Body Mass Index | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 18.30729 15.20329 | |
5% 20.11707 15.20329 | |
10% 21.26276 15.20329 Obs 24291 | |
25% 23.51343 15.5041 Sum of Wgt. 24291 | |
50% 26.57845 Mean 27.27 | |
Largest Std. Dev. 5.134197 | |
75% 30.22843 49.60056 | |
90% 34.32617 50.38167 Variance 26.35998 | |
95% 36.91451 50.48837 Skewness .7207431 | |
99% 41.59763 50.48837 Kurtosis 3.463278 | |
. tabstat bmi, s(p25 median p75 iqr) | |
variable | p25 p50 p75 iqr | |
-------------+---------------------------------------- | |
bmi | 23.51343 26.57845 30.22843 6.715006 | |
------------------------------------------------------ | |
. | |
. * Visualize the distribution: | |
. hist bmi, percent bin(10) | |
(bin=10, start=15.203287, width=3.5285078) | |
. hist bmi, kdensity | |
(bin=43, start=15.203287, width=.82058321) | |
. | |
. * Histogram with normal distribution superimposed. | |
. hist bmi, percent normal /// | |
> name(hist, replace) | |
(bin=43, start=15.203287, width=.82058321) | |
. | |
. * Kernel density. | |
. kdensity bmi, normal legend(row(1)) title("") note("") /// | |
> name(kdens, replace) | |
. | |
. * Box plots. | |
. gr hbox bmi, over(raceb) /// | |
> name(bmi_race, replace) | |
. | |
. gr hbox bmi, over(sex) asyvars over(raceb) /// | |
> name(bmi_race_sex, replace) | |
. | |
. * The next commands use scalars to describe a distribution through its standard | |
. * deviation and outliers. This is a teaching example, not a course requirement. | |
. | |
. * Obtain summary statistics. | |
. su bmi, d | |
Body Mass Index | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 18.30729 15.20329 | |
5% 20.11707 15.20329 | |
10% 21.26276 15.20329 Obs 24291 | |
25% 23.51343 15.5041 Sum of Wgt. 24291 | |
50% 26.57845 Mean 27.27 | |
Largest Std. Dev. 5.134197 | |
75% 30.22843 49.60056 | |
90% 34.32617 50.38167 Variance 26.35998 | |
95% 36.91451 50.48837 Skewness .7207431 | |
99% 41.59763 50.48837 Kurtosis 3.463278 | |
. | |
. * To show the results of a command, Stata saves them first to a temporary space | |
. * in its memory, r(). The results of the last command are readable from there: | |
. ret li | |
scalars: | |
r(N) = 24291 | |
r(sum_w) = 24291 | |
r(mean) = 27.26999735254105 | |
r(Var) = 26.35997953797713 | |
r(sd) = 5.134197068478881 | |
r(skewness) = .7207431027835997 | |
r(kurtosis) = 3.46327812408999 | |
r(sum) = 662415.5056905746 | |
r(min) = 15.20328712463379 | |
r(max) = 50.48836517333984 | |
r(p1) = 18.30729103088379 | |
r(p5) = 20.1170654296875 | |
r(p10) = 21.26276016235352 | |
r(p25) = 23.513427734375 | |
r(p50) = 26.57844924926758 | |
r(p75) = 30.22843360900879 | |
r(p90) = 34.326171875 | |
r(p95) = 36.91451263427734 | |
r(p99) = 41.59763336181641 | |
. | |
. * Let's save some of these statistics to scalars, in order to access them later. | |
. * Scalars and macros are programming commands that you will not need to learn to | |
. * operate Stata at regular user-level. However, they happen to be useful to code | |
. * some teaching examples and demonstrations, as shown below. | |
. | |
. * Save the mean and standard deviation of the summarized variable. | |
. sca de mean = r(mean) | |
. sca de sd = r(sd) | |
. | |
. * Save the 25th and 75th percentiles and compute the interquartile range (IQR), | |
. * which is the range from the first quartile (Q1) to the third quartile (Q3). | |
. sca de q1 = r(p25) | |
. sca de q3 = r(p75) | |
. sca de iqr = q3 - q1 | |
. | |
. * List all saved scalars, which are used in the next sections in combination to | |
. * the -di- command for quick verifications about the distribution of | |
. * the dependent variable (BMI) in the sample. | |
. sca li | |
iqr = 6.7150059 | |
q3 = 30.228434 | |
q1 = 23.513428 | |
sd = 5.1341971 | |
mean = 27.269997 | |
. | |
. | |
. * Standard deviation | |
. * ------------------ | |
. | |
. * We can verify what we learnt about the standard deviation by counting the | |
. * number of BMI observations that fall between (mean - 1sd) and (mean + 1sd), | |
. * and then by checking if this number comes close to 68% of all observations. | |
. count if bmi > mean - sd & bmi < mean + sd | |
16847 | |
. di r(N), "observations out of", _N, "(" 100 * round(r(N) / _N, .01) /// | |
> "% of the sample) are within one standard deviation from the mean." | |
16847 observations out of 24291 (69% of the sample) are within one standard deviatio | |
> n from the mean. | |
. | |
. * The corresponding result is indeed close to 68% of all observations, and the | |
. * same verification with the [mean - 2sd, mean + 2sd] range of BMI values is | |
. * also satisfactorily close to including 95% of all observations. | |
. count if bmi > mean - 2 * sd & bmi < mean + 2 * sd | |
23219 | |
. di r(N), "observations out of", _N, "(" 100 * round(r(N) / _N, .01) /// | |
> "% of the sample) are within 2 standard deviations from the mean." | |
23219 observations out of 24291 (96% of the sample) are within 2 standard deviations | |
> from the mean. | |
. | |
. * The properties shown here hold for continuous variables that approach a | |
. * normal distribution, as discussed below. We could go further and compute | |
. * the [mean - 3sd, mean + 3sd] range, but the most extreme values of a | |
. * distribution are more conveniently captured by the notion of outliers, | |
. * i.e. observations that fall far from the median. | |
. | |
. | |
. * Outliers | |
. * -------- | |
. | |
. * Summarize mild (1.5 IQR) or extreme (3 IQR) outliers below Q1 and above Q3: | |
. su bmi if bmi < q1 - 1.5 * iqr | bmi > q3 + 1.5 * iqr | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 421 42.63485 2.153885 40.31065 50.48837 | |
. su bmi if bmi < q1 - 3 * iqr | bmi > q3 + 3 * iqr | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 3 50.4528 .0616016 50.38167 50.48837 | |
. | |
. | |
. * ============= | |
. * = NORMALITY = | |
. * ============= | |
. | |
. | |
. * Continuous variables are expected to approach a normal distribution, a result | |
. * more easily obtained at higher sample sizes. Let's check if the distribution | |
. * of BMI values approaches normality, and if not, let's transform the variable | |
. * to bring it closer to normality. We start with visual inspection and complete | |
. * the assessment with two statistical measures. | |
. | |
. | |
. * Visual assessment | |
. * ----------------- | |
. | |
. * We draw a histogram with three different elements: the actual bins (bars) | |
. * of the BMI variable, its kernel density, and an overimposed normal curve | |
. * that we draw in a different colour using a few graph options. | |
. hist bmi, bin(15) normal kdensity kdenopts(lp(dash) lc(black) bw(1.5)) /// | |
> note("Normal distribution (solid red) and kernel density (dashed black).") | |
> /// | |
> name(bmi, replace) | |
(bin=15, start=15.203287, width=2.3523385) | |
. | |
. * The histogram shows what we knew from reading the mean and median of the | |
. * BMI values: the distribution is skewed to the left, implying that there are | |
. * more observations below the mean of the distribution than above it. | |
. | |
. * As a result, the distribution is asymmetrical, which we can verify using a | |
. * particular graphical technique that emphasizes deviations from symmetry. | |
. * Perfect symmetry corresponds to the straight red line. | |
. symplot bmi, ti("Symmetry plot") /// | |
> name(bmi_sym, replace) | |
. | |
. * Another visualization plots the quantiles of the variable against those of the | |
. * normal distribution. Perfect correspondence between the two distributions is | |
. * observed at the straight red line. | |
. qnorm bmi, ti("Normal quantile plot") /// | |
> name(bmi_qnorm, replace) | |
. | |
. * The departures observed here are situated at the tails of the distribution, | |
. * which means that there is an excess of observations at these values. | |
. | |
. | |
. * Formal assessment | |
. * ----------------- | |
. | |
. * Moving to statistical measures of normality, we can measure skewness, which | |
. * measures symmetry and approaches 0 in quasi-normal distributions, along with | |
. * kurtosis, which measures the size of the distribution tails and approaches 3 | |
. * in quasi-normal distributions. Use the -summarize- command with the -detail- | |
. * option, respectively abbreviated as -su- and -d-. | |
. su bmi, d | |
Body Mass Index | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 18.30729 15.20329 | |
5% 20.11707 15.20329 | |
10% 21.26276 15.20329 Obs 24291 | |
25% 23.51343 15.5041 Sum of Wgt. 24291 | |
50% 26.57845 Mean 27.27 | |
Largest Std. Dev. 5.134197 | |
75% 30.22843 49.60056 | |
90% 34.32617 50.38167 Variance 26.35998 | |
95% 36.91451 50.48837 Skewness .7207431 | |
99% 41.59763 50.48837 Kurtosis 3.463278 | |
. | |
. * There are more advanced tests to measure normality, but the tests above are | |
. * sufficient to observe that we cannot assume the BMI variable to be normally | |
. * distributed (i.e. we reject our distributional assumption). | |
. | |
. | |
. * Variable transformation | |
. * ----------------------- | |
. | |
. * A technique used to approach normality with a continuous variable consists | |
. * in 'transforming' the variable with a mathematical operator that modifies | |
. * its basic unit of measurement. We learnt that the distribution of BMI for | |
. * its standard unit measurement is not normal, but perhaps the distribution | |
. * of the same values is closer to normality if we take a different measure. | |
. | |
. * The -gladder- command visualizes several common transformations all at once. | |
. gladder bmi, /// | |
> name(gladder, replace) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The logarithm transformation appears to approximate a normal distribution. | |
. * We transform the variable accordingly. | |
. gen logbmi = ln(bmi) | |
. la var logbmi "Body Mass Index (log units)" | |
. | |
. * Looking at skewness and kurtosis for the logged variable. | |
. tabstat bmi logbmi, s(n sk kurtosis min max) c(s) | |
variable | N skewness kurtosis min max | |
-------------+-------------------------------------------------- | |
bmi | 24291 .7207431 3.463278 15.20329 50.48837 | |
logbmi | 24291 .2346392 2.762445 2.721512 3.921743 | |
---------------------------------------------------------------- | |
. | |
. * The log-BMI histogram shows some improvement towards normality. | |
. hist logbmi, normal /// | |
> name(logbmi, replace) | |
(bin=43, start=2.7215116, width=.02791236) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * Comparison plot | |
. * --------------- | |
. | |
. * Running the same graphs with a few options to combine them allows a quick | |
. * visual comparison of the transformation. | |
. | |
. * Part 1/4. | |
. hist bmi, norm xti("") ysc(off) ti("Untransformed (metric)") bin(21) /// | |
> name(bmi1, replace) | |
(bin=21, start=15.203287, width=1.6802418) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Part 2/4. | |
. gr hbox bmi, fysize(25) /// | |
> name(bmi2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Part 3/4. | |
. hist logbmi, norm xti("") ysc(off) ti("Transformed (logged)") bin(21) /// | |
> name(bmi3, replace) | |
(bin=21, start=2.7215116, width=.05715387) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Part 4/4. | |
. gr hbox logbmi, fysize(25) /// | |
> name(bmi4, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Final combined graph. | |
. gr combine bmi1 bmi3 bmi2 bmi4, imargin(small) ysize(3) col(2) /// | |
> name(bmi_comparison, replace) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Drop individual pieces. | |
. gr drop bmi1 bmi2 bmi3 bmi4 | |
. gr di bmi_comparison | |
. | |
. | |
. * ================== | |
. * = SAMPLING ERROR = | |
. * ================== | |
. | |
. | |
. * Sort the data by order of survey collection. | |
. sort serial | |
. | |
. * Now here's a simple issue: if we subsample our data, the average BMI will not | |
. * necessarily reflect the sample mean. | |
. su bmi in 1/10 | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 10 29.10093 3.84851 21.78719 35.42366 | |
. | |
. * The problem applies to our entire sample: how can we confirm that it reflects | |
. * the true population mean? We cannot, but we can enforce a precaution measure, | |
. * following the assumption that the data follow a somewhat normal distribution. | |
. | |
. | |
. * Confidence intervals with means | |
. * ------------------------------- | |
. | |
. * The confidence interval reflects the standard error of the mean (SEM), itself | |
. * a reflection of sample size. We will come back to the SEM equation next week. | |
. | |
. * Mean BMI for the full sample with a 95% CI. | |
. ci bmi | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 24291 27.27 .032942 27.20543 27.33457 | |
. | |
. * Mean BMI for the full sample with a 99% CI (more confidence, less precision). | |
. ci bmi, level(99) | |
Variable | Obs Mean Std. Err. [99% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 24291 27.27 .032942 27.18514 27.35486 | |
. | |
. * Mean BMI for the full sample with survey weights (better representativeness). | |
. svy: mean bmi | |
(running mean on estimation sample) | |
Survey: Mean estimation | |
Number of strata = 300 Number of obs = 24291 | |
Number of PSUs = 600 Population size = 88553487 | |
Design df = 300 | |
-------------------------------------------------------------- | |
| Linearized | |
| Mean Std. Err. [95% Conf. Interval] | |
-------------+------------------------------------------------ | |
bmi | 27.17653 .0430405 27.09183 27.26123 | |
-------------------------------------------------------------- | |
. | |
. * The confidence intervals for the full sample show a high precision, both at | |
. * the 95% (alpha = 0.05) and 99% (alpha = 0.01) levels. This is due to the high | |
. * number of observations provided for the BMI variable. | |
. | |
. * If we compute the average BMI for subsamples of the population, such as one | |
. * category of the population, the total number of observations will drop and | |
. * the confidence interval will widen, as shown here with smaller subsamples: | |
. ci bmi in 1/10 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 10 29.10093 1.217006 26.34787 31.85399 | |
. ci bmi in 1/100 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 100 28.42492 .5618381 27.31011 29.53973 | |
. ci bmi in 1/1000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1000 27.28449 .1658582 26.95902 27.60996 | |
. ci bmi in 1/10000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 10000 27.24431 .0513402 27.14367 27.34495 | |
. | |
. * Confidence bands can become useful to detect spurious relationships. Let's | |
. * take a look, for instance, at the number of years spent in the U.S. | |
. fre yrsinus | |
yrsinus -- Number of years spent in the U.S. | |
----------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------------+-------------------------------------------- | |
Valid 0 NIU | 19456 80.10 80.10 80.10 | |
1 Less than 1 year | 63 0.26 0.26 80.35 | |
2 1 year to less than 5 years | 480 1.98 1.98 82.33 | |
3 5 years to less than 10 | 723 2.98 2.98 85.31 | |
years | | |
4 10 years to less than 15 | 657 2.70 2.70 88.01 | |
years | | |
5 15 years or more | 2912 11.99 11.99 100.00 | |
Total | 24291 100.00 100.00 | |
----------------------------------------------------------------------------------- | |
. replace yrsinus = . if yrsinus == 0 | |
(19456 real changes made, 19456 to missing) | |
. | |
. * We know from previous analysis that BMI varies by gender and ethnicity. | |
. * We now look for the effect of the number of years spent in the U.S. within | |
. * each gender and ethnic categories. | |
. gr dot bmi, over(sex) over(yrsinus) over(raceb) asyvars scale(.7) /// | |
> ti("Body Mass Index by age, sex, race and number of years in the U.S.") // | |
> / | |
> yti("Mean BMI") /// | |
> name(bmi_sex_yrs, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The average BMI of Blacks who spent less than one year in the U.S. shows | |
. * an outstanding difference for males and sexs, but this category holds | |
. * so little observations that the difference should not be considered. | |
. bys sex: ci bmi if raceb == 2 & yrsinus == 1 | |
------------------------------------------------------------------------------------ | |
-> sex = Male | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 2 21.01948 2.712192 -13.44218 55.48114 | |
------------------------------------------------------------------------------------ | |
-> sex = Female | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1 30.34068 . . . | |
. | |
. * Identically, the seemingly clean pattern among male and sex Asians is | |
. * calculated on a low number of observations and requires verification of | |
. * the confidence intervals. The pattern appears to be rather robust. | |
. bys yrsinus: ci bmi if raceb == 4 | |
------------------------------------------------------------------------------------ | |
-> yrsinus = Less than 1 year | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 18 22.80192 .5404272 21.66172 23.94212 | |
------------------------------------------------------------------------------------ | |
-> yrsinus = 1 year to less than 5 years | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 161 23.10644 .2474861 22.61768 23.5952 | |
------------------------------------------------------------------------------------ | |
-> yrsinus = 5 years to less than 10 years | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 142 23.40691 .308459 22.79711 24.01672 | |
------------------------------------------------------------------------------------ | |
-> yrsinus = 10 years to less than 15 years | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 123 23.54123 .3113926 22.9248 24.15767 | |
------------------------------------------------------------------------------------ | |
-> yrsinus = 15 years or more | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 591 24.60317 .1529371 24.30281 24.90354 | |
------------------------------------------------------------------------------------ | |
-> yrsinus = . | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 336 25.17433 .2357523 24.71058 25.63807 | |
. | |
. | |
. * Confidence intervals with proportions | |
. * ------------------------------------- | |
. | |
. * A few things about confidence intervals with proportions, for which confidence | |
. * bands follow a different method of calculation. Basically, categorical data is | |
. * just dummies for a bunch of categories, and the distribution of binary data | |
. * can hardly be normal. The binomial distributions applies instead. | |
. ci sex, binomial | |
-- Binomial Exact -- | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
. | |
. * Categorical variables, which can be described through proportions, also | |
. * come with confidence intervals that reflect the range of values that each | |
. * category might take in the true population. The proportions of ethnic groups | |
. * in the U.S., for instance, are somehwere in these intervals: | |
. prop raceb | |
Proportion estimation Number of obs = 24291 | |
-------------------------------------------------------------- | |
| Proportion Std. Err. [95% Conf. Interval] | |
-------------+------------------------------------------------ | |
raceb | | |
White | .5874192 .0031587 .5812279 .5936105 | |
Black | .1602651 .0023538 .1556514 .1648788 | |
Hispanic | .195875 .0025465 .1908838 .2008662 | |
Asian | .0564407 .0014807 .0535384 .0593429 | |
-------------------------------------------------------------- | |
. | |
. * Actually, if you want to be completely correct, you need to weight the data | |
. * with the svy: prefix to use the weight settings specified earlier. This will | |
. * have a tremendous effect on your data in this case, shifting the proportion | |
. * of White respondents from roughly 60% to roughly 70% of all U.S. adults, the | |
. * reason being that other racial-ethnic groups are oversampled in NHIS data. | |
. svy: prop raceb | |
(running proportion on estimation sample) | |
Survey: Proportion estimation | |
Number of strata = 300 Number of obs = 24291 | |
Number of PSUs = 600 Population size = 88553487 | |
Design df = 300 | |
-------------------------------------------------------------- | |
| Linearized | |
| Proportion Std. Err. [95% Conf. Interval] | |
-------------+------------------------------------------------ | |
raceb | | |
White | .7101575 .0049232 .7004691 .719846 | |
Black | .1264287 .0035906 .1193627 .1334947 | |
Hispanic | .1253064 .0030643 .1192762 .1313367 | |
Asian | .0381073 .0015406 .0350757 .041139 | |
-------------------------------------------------------------- | |
. | |
. * Identically to continuous variables, confidence intervals for categorical | |
. * data will increase when the total number of observations decreases. The | |
. * 95% CI for ethnicity on morbidly obese respondents illustrates that issue. | |
. prop raceb if bmi > 40 | |
Proportion estimation Number of obs = 464 | |
-------------------------------------------------------------- | |
| Proportion Std. Err. [95% Conf. Interval] | |
-------------+------------------------------------------------ | |
raceb | | |
White | .5344828 .0231816 .4889285 .580037 | |
Black | .2586207 .0203499 .2186312 .2986102 | |
Hispanic | .2047414 .0187528 .1678902 .2415926 | |
Asian | .0021552 .0021552 -.00208 .0063903 | |
-------------------------------------------------------------- | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require fre scheme-burd spineplot | |
. | |
. * Log results. | |
. cap log using code/week5.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 5 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Social Determinants of Adult Obesity in the United States | |
> | |
> - DATA: U.S. National Health Interview Survey (2009) | |
> | |
> We study variations in the Body Mass Index (BMI) of insured and uninsured | |
> American adults, in order to show how differences observed between racial | |
> backgrounds echo socioeconomic inequalities in education and health care. | |
> | |
> - (H1): We first expect to observe larger numbers of overweight and obese | |
> respondents among non-White males and among older age groups. | |
> | |
> - (H2): We then expect education to be negatively associated with obesity, as | |
> higher attainment indicates access to prevention and higher income. | |
> | |
> - (H3): We finally expect health insurance coverage to limit health consumption | |
> in poorer households, possibly affecting BMI across the life course. | |
> | |
> Our data come from the most recent year of the National Health Interview | |
> Survey (NHIS). The sample used in the analysis contains = 21,770 individuals | |
> selected through state-level stratified probability sampling. | |
> | |
> - The lines above are a quick example of what you should be planning to write | |
> up in your first draft: a description of your data, followed by a list of | |
> clearly worded and substantively informed hypotheses. | |
> | |
> - Please make sure that your do-file is named like 'Briatte_Petev_1.do' (use | |
> your own family names, in alphabetical order). Name your paper the same way | |
> and print it to PDF format: do not circulate your work in editable formats. | |
> | |
> - To simplify your workflow, the course uses a paper template that you will | |
> share with your research partner(s) using Google Documents. This template | |
> contains more instructions on the first draft. | |
> | |
> - Your first draft must inform the reader about simple things: What is your | |
> research question? Where does your data come from, how large is the sample | |
> and how was it designed? Include references to the data source and codebook. | |
> | |
> - Your paper also explains what choice of variables you have made, and with | |
> what theory to support that choice. You have to substantiate your decisions: | |
> providing a mere description of the measurements is insufficient. | |
> | |
> - In line with that idea, do NOT write your paper as a technical summary of | |
> what your code accomplishes: refer to variables not by names but by what they | |
> actually measure, and explain how they fit in your general reasoning. | |
> | |
> - Remember that you have been provided with example papers: use them to learn | |
> about the writing style and scientific tone to adopt in your own work. This | |
> requirement is covered at more length in the rest of the course material. | |
> | |
> - Your first do-file can imitate the course do-files in its structure. Your | |
> code should assess DV normality and explore differences in the DV with graphs | |
> and confidence intervals over categorical IVs. Analyze results in your paper. | |
> | |
> - Importantly, do NOT produce results in your do-file if you are not going to | |
> interpret them at a later stage: produce meaningful code that leads you to | |
> learn, understand and analyze the data. | |
> | |
> - Use the -stab- command at the end of this do-file to export summary stats | |
> to a simple table. The result will be a plain text file that you can copy | |
> and paste into Google Documents, or import into any other text editor. | |
> | |
> Last updated 2012-11-13. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load NHIS data. | |
. use data/nhis2009, clear | |
(U.S. National Health Interview Survey 2009) | |
. | |
. * Individual survey weights. | |
. svyset psu [pw = perweight], strata(strata) | |
pweight: perweight | |
VCE: linearized | |
Single unit: missing | |
Strata 1: strata | |
SU 1: psu | |
FPC 1: <zero> | |
. | |
. | |
. * Dependent variable: Body Mass Index | |
. * ----------------------------------- | |
. | |
. gen bmi = weight * 703 / height^2 | |
. la var bmi "Body Mass Index" | |
. | |
. * Detailed summary statistics. | |
. su bmi, d | |
Body Mass Index | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 18.30296 14.63388 | |
5% 19.96686 14.92082 | |
10% 21.03148 15.05125 Obs 251589 | |
25% 23.22465 15.06112 Sum of Wgt. 251589 | |
50% 26.07836 Mean 26.8551 | |
Largest Std. Dev. 5.001464 | |
75% 29.75496 51.49813 | |
90% 33.71531 51.70008 Variance 25.01464 | |
95% 36.32167 51.90204 Skewness .7805844 | |
99% 41.19141 52.10399 Kurtosis 3.619894 | |
. | |
. | |
. * Breakdowns | |
. * ---------- | |
. | |
. * Recoding BMI to 6 groups (best method: cutting the data to intervals). | |
. gen bmi6:bmi6 = irecode(bmi, 0, 18.5, 25, 30, 35, 40, .) | |
. la var bmi6 "Body Mass Index (categories)" | |
. | |
. * Define the category labels. | |
. la def bmi6 /// | |
> 1 "Underweight" 2 "Normal" 3 "Overweight" /// | |
> 4 "Obese" 5 "Severely obese" 6 "Morbidly obese", replace | |
. | |
. * Breakdown of mean BMI by groups. | |
. d bmi bmi6 | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
bmi float %9.0g Body Mass Index | |
bmi6 float %14.0g bmi6 Body Mass Index (categories) | |
. tab bmi6, su(bmi) | |
Body Mass | | |
Index | | |
(categories | Summary of Body Mass Index | |
) | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
Underweig | 17.758399 .62717744 3100 | |
Normal | 22.444485 1.6585607 97083 | |
Overweigh | 27.24205 1.436102 92316 | |
Obese | 32.126483 1.4160294 41238 | |
Severely | 37.045664 1.3697775 13874 | |
Morbidly | 42.417655 2.0628916 3978 | |
------------+------------------------------------ | |
Total | 26.855097 5.0014639 251589 | |
. | |
. * Progression of BMI groups over years. | |
. spineplot bmi6 year, scheme(burd6) /// | |
> name(bmi6, replace) | |
. | |
. * Breakdown of BMI to percentiles. | |
. xtile bmi_qt = bmi, nq(100) | |
. | |
. * Verify the BMI of, e.g. the top 10% most obese. | |
. su bmi if bmi_qt == 90 | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
bmi | 2510 33.52061 .1231341 33.2846 33.71531 | |
. | |
. * Compute the mean BMI for each percentile. | |
. bys bmi_qt: egen bmi_qm = mean(bmi) | |
. | |
. * Plot the empirical cumulative distribution function (ECDF) of BMI. | |
. sc bmi_qm bmi_qt, m(o) c(l) xla(0(10)100) /// | |
> yti("Body Mass Index") xti("Percentiles") /// | |
> name(bmi_ecdf, replace) | |
. | |
. | |
. * Independent variables | |
. * --------------------- | |
. | |
. fre age sex raceb educrec1 earnings health uninsured ybarcare, r(10) | |
age -- Age | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 18 18 | 2993 1.19 1.19 1.19 | |
19 19 | 3487 1.39 1.39 2.58 | |
20 20 | 3749 1.49 1.49 4.07 | |
21 21 | 3953 1.57 1.57 5.64 | |
22 22 | 4102 1.63 1.63 7.27 | |
: | : : : : | |
80 80 | 1845 0.73 0.73 97.62 | |
81 81 | 1717 0.68 0.68 98.31 | |
82 82 | 1571 0.62 0.62 98.93 | |
83 83 | 1390 0.55 0.55 99.48 | |
84 84 | 1298 0.52 0.52 100.00 | |
Total | 251589 100.00 100.00 | |
----------------------------------------------------------- | |
sex -- Sex | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 Male | 113182 44.99 44.99 44.99 | |
2 Female | 138407 55.01 55.01 100.00 | |
Total | 251589 100.00 100.00 | |
-------------------------------------------------------------- | |
raceb -- Race | |
---------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------+-------------------------------------------- | |
Valid 1 White | 160581 63.83 63.83 63.83 | |
2 Black | 36030 14.32 14.32 78.15 | |
3 Hispanic | 45842 18.22 18.22 96.37 | |
4 Asian | 9136 3.63 3.63 100.00 | |
Total | 251589 100.00 100.00 | |
---------------------------------------------------------------- | |
educrec1 -- Educational attainment recode, nonintervalled | |
---------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------------------------+-------------------------------------------- | |
Valid 13 Grade 12 | 117721 46.79 46.79 46.79 | |
14 1 to 3 years of college | 72298 28.74 28.74 75.53 | |
15 4 years of | 40548 16.12 16.12 91.64 | |
college/Bachelor's degree | | |
16 5+ years of college | 21022 8.36 8.36 100.00 | |
Total | 251589 100.00 100.00 | |
---------------------------------------------------------------------------------- | |
earnings -- Person's total earnings, previous calendar year | |
-------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------------------------+-------------------------------------------- | |
Valid 0 NIU | 77313 30.73 30.73 30.73 | |
1 $01 to $4999 | 11154 4.43 4.43 35.16 | |
2 $5000 to $9999 | 10468 4.16 4.16 39.32 | |
3 $10000 to $14999 | 13077 5.20 5.20 44.52 | |
4 $15000 to $19999 | 12189 4.84 4.84 49.37 | |
: | : : : : | |
10 $65000 to $74999 | 4992 1.98 1.98 80.75 | |
11 $75000 and over | 12179 4.84 4.84 85.59 | |
97 Unknown-refused | 21877 8.70 8.70 94.29 | |
98 Unknown-not ascertained | 24 0.01 0.01 94.29 | |
99 Unknown-don't know | 14354 5.71 5.71 100.00 | |
Total | 251589 100.00 100.00 | |
-------------------------------------------------------------------------------- | |
health -- Health status | |
----------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------+-------------------------------------------- | |
Valid 1 Excellent | 73004 29.02 29.04 29.04 | |
2 Very Good | 80816 32.12 32.14 61.18 | |
3 Good | 65089 25.87 25.89 87.07 | |
4 Fair | 24564 9.76 9.77 96.84 | |
5 Poor | 7951 3.16 3.16 100.00 | |
Total | 251424 99.93 100.00 | |
Missing . | 165 0.07 | |
Total | 251589 100.00 | |
----------------------------------------------------------------- | |
uninsured -- Health Insurance coverage status | |
-------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------------------+-------------------------------------------- | |
Valid 1 Not covered | 43206 17.17 17.17 17.17 | |
2 Covered | 207537 82.49 82.49 99.66 | |
9 Unknown-don't know | 846 0.34 0.34 100.00 | |
Total | 251589 100.00 100.00 | |
-------------------------------------------------------------------------- | |
ybarcare -- Needed but couldn't afford medical care, past 12 months | |
-------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------------------+-------------------------------------------- | |
Valid 1 No | 231191 91.89 91.89 91.89 | |
2 Yes | 20246 8.05 8.05 99.94 | |
7 Unknown-refused | 52 0.02 0.02 99.96 | |
9 Unknown-don't know | 100 0.04 0.04 100.00 | |
Total | 251589 100.00 100.00 | |
-------------------------------------------------------------------------- | |
. | |
. * Recode age to four groups (slow and risky method: using manual categories). | |
. recode age /// | |
> (18/44 = 1 "18-44") /// | |
> (45/64 = 2 "45-64") /// | |
> (65/74 = 3 "65-74") /// | |
> (75/max = 4 "75+") (else = .), gen(age4) | |
(251589 differences between age and age4) | |
. la var age4 "Age groups (4)" | |
. | |
. * Recode age to eight groups (nifty method: using decades, 10-19, 20-29, etc.). | |
. gen age8 = 10 * floor(age / 10) if !mi(age) | |
. la var age8 "Age groups (8)" | |
. | |
. * Recode sex to dummy. | |
. gen female:female = (sex == 2) if !mi(sex) | |
. la def female 0 "Male" 1 "Female", replace | |
. | |
. * Recode missing values of income. | |
. replace earnings = . if inlist(earnings, 97, 99) | |
(36231 real changes made, 36231 to missing) | |
. | |
. * Recode missing values of insurance and medical care. | |
. mvdecode ybarcare uninsured, mv(9) | |
ybarcare: 100 missing values generated | |
uninsured: 846 missing values generated | |
. | |
. | |
. * Subsetting | |
. * ---------- | |
. | |
. * Select observations from most recent year. | |
. keep if year == 2009 | |
(227298 observations deleted) | |
. | |
. * Patterns of missing values. | |
. misstable pat bmi age female raceb educrec1 earnings health uninsured ybarcare | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Percent | 1 2 3 4 | |
------------+------------- | |
90% | 1 1 1 1 | |
| | |
10 | 1 1 1 0 | |
<1 | 1 1 0 1 | |
<1 | 1 1 0 0 | |
<1 | 0 1 1 1 | |
<1 | 1 0 1 0 | |
<1 | 1 0 1 1 | |
<1 | 1 0 0 1 | |
------------+------------- | |
100% | | |
Variables are (1) ybarcare (2) health (3) uninsured (4) earnings | |
. | |
. * Delete incomplete observations. | |
. drop if mi(bmi, age, female, raceb, educrec1, earnings, uninsured, ybarcare) | |
(2521 observations deleted) | |
. | |
. * Final data, showing final sample size. | |
. codebook bmi age female raceb educrec1 earnings health uninsured ybarcare, c | |
Variable Obs Unique Mean Min Max Label | |
------------------------------------------------------------------------------------ | |
bmi 21770 2091 27.32691 15.20329 50.48837 Body Mass Index | |
age 21770 67 47.13036 18 84 Age | |
female 21770 2 .5550299 0 1 | |
raceb 21770 4 1.708406 1 4 Race | |
educrec1 21770 4 13.91746 13 16 Educational attainment rec... | |
earnings 21770 12 3.984153 0 11 Person's total earnings, p... | |
health 21767 5 2.312675 1 5 Health status | |
uninsured 21770 2 1.817685 1 2 Health Insurance coverage ... | |
ybarcare 21770 2 1.103904 1 2 Needed but couldn't afford... | |
------------------------------------------------------------------------------------ | |
. | |
. | |
. * Normality | |
. * --------- | |
. | |
. hist bmi, bin(20) normal normopts(lp(dash)) /// | |
> kdensity kdenopts(k(biweight) bw(3) lc(black)) /// | |
> name(dv, replace) | |
(bin=20, start=15.203287, width=1.7642539) | |
. | |
. * Transformations (add 'g' to make the command -gladder- for a graphical check). | |
. ladder bmi | |
Transformation formula chi2(2) P(chi2) | |
------------------------------------------------------------------ | |
cubic bmi^3 . . | |
square bmi^2 . . | |
identity bmi . . | |
square root sqrt(bmi) . 0.000 | |
log log(bmi) . 0.000 | |
1/(square root) 1/sqrt(bmi) . 0.000 | |
inverse 1/bmi . 0.000 | |
1/square 1/(bmi^2) . . | |
1/cubic 1/(bmi^3) . . | |
. | |
. * Log-BMI transformation. | |
. gen logbmi = ln(bmi) | |
. la var logbmi "log(BMI)" | |
. | |
. * Inspect improvement in normality. | |
. tabstat bmi logbmi, s(skewness kurtosis) c(s) | |
variable | skewness kurtosis | |
-------------+-------------------- | |
bmi | .7112319 3.426024 | |
logbmi | .2275015 2.748605 | |
---------------------------------- | |
. | |
. | |
. * ======================== | |
. * = CONFIDENCE INTERVALS = | |
. * ======================== | |
. | |
. | |
. * IV: Age | |
. * ------- | |
. | |
. * Plot BMI groups for each age decade. | |
. spineplot bmi6 age8, scheme(burd6) /// | |
> name(age, replace) | |
. | |
. * 95% CI estimates: | |
. tab age4, su(bmi) // mean BMI in each age group | |
Age groups | Summary of Body Mass Index | |
(4) | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
18-44 | 26.814459 5.1771495 10209 | |
45-64 | 28.023031 5.1792507 7428 | |
65-74 | 27.827794 5.0723462 2464 | |
75+ | 26.623902 4.6287582 1669 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys age4: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> age4 = 18-44 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 10209 26.81446 .0512388 26.71402 26.9149 | |
------------------------------------------------------------------------------------ | |
-> age4 = 45-64 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 7428 28.02303 .060094 27.90523 28.14083 | |
------------------------------------------------------------------------------------ | |
-> age4 = 65-74 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 2464 27.82779 .1021853 27.62742 28.02817 | |
------------------------------------------------------------------------------------ | |
-> age4 = 75+ | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1669 26.6239 .1133017 26.40167 26.84613 | |
. | |
. | |
. * IV: Gender | |
. * ---------- | |
. | |
. * Plot mean BMI groups for each gender group, for each age decade. | |
. gr bar bmi, over(female) asyvars over(age8) yline(27) /// | |
> note("Horizontal line at sample mean.") /// | |
> name(sex_age, replace) | |
. | |
. * 95% CI estimates: | |
. tab female, su(bmi) // mean BMI in each gender group | |
| Summary of Body Mass Index | |
female | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
Male | 27.616417 4.4580387 9687 | |
Female | 27.094814 5.650074 12083 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys female: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> female = Male | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 9687 27.61642 .0452949 27.52763 27.7052 | |
------------------------------------------------------------------------------------ | |
-> female = Female | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 12083 27.09481 .0514004 26.99406 27.19557 | |
. | |
. | |
. * IV: Race | |
. * -------- | |
. | |
. * Plot BMI groups for each racial background: | |
. spineplot bmi6 raceb, scheme(burd6) /// | |
> name(race, replace) | |
. | |
. * Histogram by race and gender groups. | |
. hist bmi, bin(10) xline(27) /// | |
> by(raceb female, cols(2) /// | |
> note("Vertical line at sample mean.") legend(off)) /// | |
> name(race_sex, replace) | |
. | |
. * 95% CI estimates: | |
. tab raceb, su(bmi) // mean BMI at each health level | |
| Summary of Body Mass Index | |
Race | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
White | 27.045805 5.0759617 12885 | |
Black | 28.590384 5.5785555 3509 | |
Hispanic | 27.952797 4.9386991 4215 | |
Asian | 24.355698 3.8536771 1161 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys raceb: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> raceb = White | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 12885 27.04581 .0447174 26.95815 27.13346 | |
------------------------------------------------------------------------------------ | |
-> raceb = Black | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 3509 28.59038 .0941738 28.40574 28.77503 | |
------------------------------------------------------------------------------------ | |
-> raceb = Hispanic | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 4215 27.9528 .0760701 27.80366 28.10193 | |
------------------------------------------------------------------------------------ | |
-> raceb = Asian | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1161 24.3557 .1130991 24.1338 24.5776 | |
. | |
. | |
. * IV: Education | |
. * ------------- | |
. | |
. * Shorter labels for a cleaner graph. | |
. la def edu 13 "Grade 12" 14 "Coll 1-3 yrs" 15 "Coll 4" 16 "Coll 5+" | |
. la val educrec1 edu | |
. | |
. * (Reminder on labels: the first command, -la def-, creates new labels for the | |
. * values of a variable; the second command, -la val-, assigns the value label | |
. * to the target variable, which is educrec1 in this example.) | |
. | |
. * Plot BMI groups for each educational level. | |
. spineplot bmi6 educrec1, scheme(burd6) /// | |
> name(edu, replace) | |
. | |
. * Plot racial backgrounds for each educational level. | |
. spineplot raceb educrec1, /// | |
> name(edu_race, replace) | |
. | |
. * 95% CI estimates: | |
. tab educrec1, su(bmi) // mean BMI at each education level | |
Educational | | |
attainment | | |
recode, | | |
noninterval | Summary of Body Mass Index | |
led | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
Grade 12 | 27.852907 5.2562117 9491 | |
Coll 1-3 | 27.445587 5.2305751 6550 | |
Coll 4 | 26.388879 4.7885939 3764 | |
Coll 5+ | 26.187578 4.702525 1965 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys educrec1: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> educrec1 = Grade 12 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 9491 27.85291 .0539532 27.74715 27.95867 | |
------------------------------------------------------------------------------------ | |
-> educrec1 = Coll 1-3 yrs | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 6550 27.44559 .0646292 27.31889 27.57228 | |
------------------------------------------------------------------------------------ | |
-> educrec1 = Coll 4 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 3764 26.38888 .0780519 26.23585 26.54191 | |
------------------------------------------------------------------------------------ | |
-> educrec1 = Coll 5+ | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1965 26.18758 .106084 25.97953 26.39563 | |
. | |
. | |
. * IV: Income | |
. * ---------- | |
. | |
. * Generate variable defined by the income ceiling of each category. | |
. gen inc = 5000 * earnings + 5000 * (earnings - 5) * (earnings > 5) | |
. la var inc "Total earnings ($)" | |
. | |
. * Plot racial backgrounds for each income band. | |
. spineplot raceb inc if inc > 0, xla(,alt axis(2)) /// | |
> name(inc_race, replace) | |
. | |
. * Plot educational levels for each income band. | |
. spineplot educrec1 inc if inc > 0, scheme(burd4) xla(, alt axis(2)) /// | |
> name(inc_edu, replace) | |
. | |
. * Plot income quartiles for each BMI group. | |
. gr box inc if inc > 0, over(bmi6) /// | |
> name(inc, replace) | |
. | |
. * Plot BMI quartiles for each income band (excluding outliers). | |
. gr box bmi if inc > 0, over(inc) noout /// | |
> name(bmi_inc, replace) | |
. | |
. * 95% CI estimates: | |
. tab inc, su(bmi) // mean BMI at each education level | |
Total | | |
earnings | Summary of Body Mass Index | |
($) | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
0 | 27.361109 5.3342393 7662 | |
5000 | 26.298885 5.3198548 1075 | |
10000 | 26.830443 5.1843042 923 | |
15000 | 27.235495 5.4874418 1252 | |
20000 | 27.618259 5.4052868 1097 | |
25000 | 27.440014 5.1785359 1232 | |
35000 | 27.595926 5.0842359 2129 | |
45000 | 27.465086 5.0802835 1775 | |
55000 | 27.467011 4.895058 1397 | |
65000 | 27.582522 4.8332839 885 | |
75000 | 27.516439 4.7213988 603 | |
85000 | 27.098543 4.3958045 1740 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys inc: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> inc = 0 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 7662 27.36111 .0609399 27.24165 27.48057 | |
------------------------------------------------------------------------------------ | |
-> inc = 5000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1075 26.29888 .1622541 25.98051 26.61726 | |
------------------------------------------------------------------------------------ | |
-> inc = 10000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 923 26.83044 .1706435 26.49555 27.16534 | |
------------------------------------------------------------------------------------ | |
-> inc = 15000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1252 27.2355 .1550843 26.93124 27.53975 | |
------------------------------------------------------------------------------------ | |
-> inc = 20000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1097 27.61826 .1631982 27.29804 27.93848 | |
------------------------------------------------------------------------------------ | |
-> inc = 25000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1232 27.44001 .1475372 27.15056 27.72947 | |
------------------------------------------------------------------------------------ | |
-> inc = 35000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 2129 27.59593 .1101889 27.37984 27.81201 | |
------------------------------------------------------------------------------------ | |
-> inc = 45000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1775 27.46509 .1205837 27.22858 27.70159 | |
------------------------------------------------------------------------------------ | |
-> inc = 55000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1397 27.46701 .1309663 27.2101 27.72392 | |
------------------------------------------------------------------------------------ | |
-> inc = 65000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 885 27.58252 .1624691 27.26365 27.90139 | |
------------------------------------------------------------------------------------ | |
-> inc = 75000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 603 27.51644 .1922702 27.13884 27.89404 | |
------------------------------------------------------------------------------------ | |
-> inc = 85000 | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 1740 27.09854 .1053813 26.89186 27.30523 | |
. | |
. | |
. * IV: Health insurance | |
. * -------------------- | |
. | |
. * Plot BMI distribution for groups who have or do not have health coverage. | |
. kdensity bmi if uninsured == 1, addplot(kdensity bmi if uninsured == 2) /// | |
> legend(order(1 "Not covered" 2 "Covered") row(1)) /// | |
> name(uninsured, replace) | |
. | |
. * Exploration: | |
. tab uninsured, su(bmi) // mean BMI at each health level | |
Health | | |
Insurance | | |
coverage | Summary of Body Mass Index | |
status | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
Not cover | 27.298409 5.1020606 3969 | |
Covered | 27.333267 5.1732139 17801 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys uninsured: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> uninsured = Not covered | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 3969 27.29841 .0809851 27.13963 27.45719 | |
------------------------------------------------------------------------------------ | |
-> uninsured = Covered | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 17801 27.33327 .0387738 27.25727 27.40927 | |
. | |
. | |
. * IV: Health affordability | |
. * ------------------------ | |
. | |
. * Plot BMI distribution for groups who could or coult not afford medical care. | |
. kdensity bmi if ybarcare == 1, addplot(kdensity bmi if ybarcare == 2) /// | |
> legend(order(1 "Could afford medical care" 2 "Could not") row(1)) /// | |
> name(ybarcare, replace) | |
. | |
. * Exploration: | |
. tab ybarcare, su(bmi) // mean BMI at each health level | |
Needed but | | |
couldn't | | |
afford | | |
medical | | |
care, past | Summary of Body Mass Index | |
12 months | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
No | 27.254224 5.097839 19508 | |
Yes | 27.953789 5.6321726 2262 | |
------------+------------------------------------ | |
Total | 27.326912 5.1602158 21770 | |
. bys ybarcare: ci bmi // confidence bands | |
------------------------------------------------------------------------------------ | |
-> ybarcare = No | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 19508 27.25422 .0364989 27.18268 27.32576 | |
------------------------------------------------------------------------------------ | |
-> ybarcare = Yes | |
Variable | Obs Mean Std. Err. [95% Conf. Interval] | |
-------------+--------------------------------------------------------------- | |
bmi | 2262 27.95379 .1184213 27.72156 28.18601 | |
. | |
. | |
. * ============================= | |
. * = EXPORT SUMMARY STATISTICS = | |
. * ============================= | |
. | |
. | |
. * The reader of your research does not know your data. A solution at that stage | |
. * is therefore to produce a table that holds descriptive (summary) statistics | |
. * for the variables that you have selected for analysis. This requires using a | |
. * command that was written especially for the course, to make it very easy. | |
. | |
. * The next command is part of the SRQM folder. If Stata returns an error when | |
. * you run it, set the folder as your working directory and type -run profile- | |
. * to run the course setup, then try the command again. If you still experience | |
. * problems with the -stab- command, please send a detailed email on the issue. | |
. | |
. stab using week5_stats.txt, replace /// | |
> mean(bmi age) /// | |
> prop(female raceb educrec1 earnings uninsured ybarcare) | |
installing estout first... | |
checking estout consistency and verifying not already installed... | |
installing into /Users/fr/Library/Application Support/Stata/ado/stbplus/... | |
installation complete. | |
(note: file week5_stats.txt not found) | |
Variable mean sd min max mea | |
> n sd min max mean sd min | |
> max mean sd min max mean | |
> sd min max mean sd min m | |
> ax mean sd min max mean sd | |
> min max mean sd min max | |
> mean sd min max | |
% % % % | |
> % % % % % % | |
Race % % % % | |
> % % % % % % | |
Educational attain~n % % % % | |
> % % % % % % | |
Person's total ear~u % % % % | |
> % % % % % % | |
Health Insurance c~s % % % % | |
> % % % % % % | |
Needed but couldn'~c % % % % | |
> % % % % % % | |
N = 217700 | |
File: week5_stats.txt | |
. | |
. /* Syntax of the -stab- command: | |
> | |
> - using FILE - name of the exported file; plain text (.txt) recommended | |
> - replace - overwrite any previously existing file | |
> - mean() - summarizes a list of continuous variables (mean, sd, min, max) | |
> - prop() - summarizes a list of categorical variables (frequencies) | |
> | |
> In the example above, the -stab- command will export a single file to the | |
> working directory (week5_stats.txt) containing summary statistics for the | |
> final sample, as a plain text file of tab-separated values. */ | |
. | |
. * Last reminder: your code is the technical document, whereas your paper is the | |
. * substantive document. Make sure that the paper is not a descriptive write-up | |
. * of what happens in your code: you need to produce analytical value-added by | |
. * explaining what you are hypothesizing about the relationships in the data. | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Thanks for following! | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require fre renvars scheme-burd spineplot tab_chi | |
. | |
. * Log results. | |
. cap log using code/week6.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 6 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Opposition to Torture in Israel | |
> | |
> - DATA: European Social Survey Round 4 (2008) | |
> | |
> This do-file introduces the topic of significance tests, i.e. statistical | |
> tools to assess whether an association that shows up in the data is different | |
> from the kind of arrangement that might be observed in random data. | |
> | |
> Associations are relationships between two of your variables. They correspond | |
> to real-world relationships, like the association between income and gender. | |
> Significance tests are helpful to observe and measure these phenomena. | |
> | |
> The null hypothesis, which is the kind of hypothesis that gets tested in a | |
> significance test, is different from the substantive hypotheses that you | |
> previously formulated about your data. It is usually denoted "H_0". | |
> | |
> The null hypothesis is the extent to which it is possible to reproduce the | |
> association that you observe in the data by statistical accident. It measures | |
> the consistency of your data with randomness. | |
> | |
> A significance test never proves anything. It can only reject the possibility | |
> that an association in your data is consistent with accidental situations. | |
> The aim of a significance test is therefore to reject the null hypothesis. | |
> | |
> To obtain that kind of proof by contradiction, the significance test will | |
> estimate how likely it is to reach the same kind of association that you | |
> observe from random data. This likelihood is called the p-value of the test. | |
> | |
> A small p-value means that is highly unlikely to produce the same association | |
> as the one you observe out of randomness. Note how far that result is from an | |
> assessment of whether your hypothesis is right or wrong! | |
> | |
> The notions covered in the paragraphs above cannot be introduced technically, | |
> as short comments accompanying Stata commands. They require that you actually | |
> open your textbooks and read at length about statistical estimation. | |
> | |
> There are many different kinds of hypothesis tests: we will cover the t-test, | |
> the proportions test, the Chi-squared test and finally linear correlation. | |
> The Stata Guide also covers these tests. Make sure to read what you need to! | |
> | |
> Last updated 2013-05-29. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load ESS dataset. | |
. use data/ess2008, clear | |
(European Social Survey 2008) | |
. | |
. * Survey weights. | |
. svyset [pw = dweight] // weighting scheme set to country-specific population | |
pweight: dweight | |
VCE: linearized | |
Single unit: missing | |
Strata 1: <one> | |
SU 1: <observations> | |
FPC 1: <zero> | |
. | |
. * Rename variables to short handles. | |
. renvars agea gndr hinctnta eduyrs \ age sex income edu // socio-demographics | |
. renvars rlgdnm lrscale tvpol \ denom pol tv // religion, politics | |
. | |
. * Have a quick look. | |
. codebook cntry age sex income edu denom pol tv, c | |
Variable Obs Unique Mean Min Max Label | |
------------------------------------------------------------------------------------ | |
cntry 56752 29 . . . Country | |
age 56544 87 47.53717 15 123 Age of respondent, calculated | |
sex 56722 2 1.545379 1 2 Gender | |
income 41120 10 5.26177 1 10 Household's total net income, all so... | |
edu 56238 41 11.93741 0 50 Years of full-time education completed | |
denom 37067 8 2.498907 1 8 Religion or denomination belonging t... | |
pol 47569 11 5.1991 0 10 Placement on left right scale | |
tv 54265 8 1.976191 0 7 TV watching, news/politics/current a... | |
------------------------------------------------------------------------------------ | |
. | |
. | |
. * Subsetting | |
. * ---------- | |
. | |
. * Delete incomplete observations. | |
. drop if mi(age, sex, income, edu, denom, pol, tv) | |
(36034 observations deleted) | |
. | |
. | |
. * Dependent variable: Justifiability of torture in event of preventing terrorism | |
. * ------------------------------------------------------------------------------ | |
. | |
. fre trrtort | |
trrtort -- Torture in country never justified even to prevent terrorist attack | |
----------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------------+-------------------------------------------- | |
Valid 1 Agree strongly | 5668 27.36 28.16 28.16 | |
2 Agree | 6639 32.04 32.98 61.13 | |
3 Neither agree nor disagree | 3276 15.81 16.27 77.41 | |
4 Disagree | 3183 15.36 15.81 93.22 | |
5 Disagree strongly | 1365 6.59 6.78 100.00 | |
Total | 20131 97.17 100.00 | |
Missing .a | 24 0.12 | |
.b | 546 2.64 | |
.c | 17 0.08 | |
Total | 587 2.83 | |
Total | 20718 100.00 | |
----------------------------------------------------------------------------------- | |
. | |
. * Generate dummies called 'torture_1 torture_2' etc. for each DV category. | |
. tab trrtort, gen(torture_) | |
Torture in country never | | |
justified even to prevent | | |
terrorist attack | Freq. Percent Cum. | |
---------------------------+----------------------------------- | |
Agree strongly | 5,668 28.16 28.16 | |
Agree | 6,639 32.98 61.13 | |
Neither agree nor disagree | 3,276 16.27 77.41 | |
Disagree | 3,183 15.81 93.22 | |
Disagree strongly | 1,365 6.78 100.00 | |
---------------------------+----------------------------------- | |
Total | 20,131 100.00 | |
. | |
. * Country-level breakdown using stacked bars and 5-pt scale graph scheme. | |
. gr hbar torture_? [aw = dweight], stack /// | |
> over(cntry, sort(1)des lab(labsize(*.8))) /// | |
> yti("Torture is never justified even to prevent terrorism") /// | |
> legend(rows(1) /// | |
> order(1 "Strongly agree" 2 "" 3 "Neither" 4 "" 5 "Strongly disagree")) /// | |
> name(torture1, replace) scheme(burd5) | |
. | |
. * Binary recoding (1 = torture is never justifiable; undecideds removed). | |
. recode trrtort /// | |
> (1/2 = 1 "Never justifiable") /// | |
> (4/5 = 0 "Sometimes justifiable") /// | |
> (3 = .) (else = .), gen(torture) | |
(15050 differences between trrtort and torture) | |
. la var torture "Opposition to torture" | |
. | |
. * Average opposition to torture in Europe. | |
. fre torture | |
torture -- Opposition to torture | |
----------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------+-------------------------------------------- | |
Valid 0 Sometimes justifiable | 4548 21.95 26.98 26.98 | |
1 Never justifiable | 12307 59.40 73.02 100.00 | |
Total | 16855 81.35 100.00 | |
Missing . | 3863 18.65 | |
Total | 20718 100.00 | |
----------------------------------------------------------------------------- | |
. tab torture [aw = dweight * pweight] // weighted by overall European population | |
Opposition to torture | Freq. Percent Cum. | |
----------------------+----------------------------------- | |
Sometimes justifiable | 4,577.4821 27.16 27.16 | |
Never justifiable | 12,277.518 72.84 100.00 | |
----------------------+----------------------------------- | |
Total | 16,855 100.00 | |
. | |
. * Average opposition to torture in each country. | |
. gr dot torture [aw = dweight], over(cntry, sort(1) des) scale(.75) /// | |
> name(torture2, replace) | |
. | |
. * Create a dummy for Israel vs. other European countries. | |
. gen israel:israel = (cntry == "IL") | |
. la def israel 1 "Israel" 0 "Other EU" | |
. | |
. * Estimate DV proportions in Israel. | |
. prop torture if israel | |
Proportion estimation Number of obs = 1039 | |
_prop_1: torture = Sometimes justifiable | |
_prop_2: torture = Never justifiable | |
-------------------------------------------------------------- | |
| Proportion Std. Err. [95% Conf. Interval] | |
-------------+------------------------------------------------ | |
torture | | |
_prop_1 | .4513956 .0154458 .4210871 .4817041 | |
_prop_2 | .5486044 .0154458 .5182959 .5789129 | |
-------------------------------------------------------------- | |
. | |
. * Compare average opposition to torture inside and outside Israel. | |
. prtest torture, by(israel) | |
Two-sample test of proportions Other EU: Number of obs = 15816 | |
Israel: Number of obs = 1039 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
Other EU | .7420966 .0034786 .7352786 .7489146 | |
Israel | .5486044 .0154383 .5183458 .578863 | |
-------------+---------------------------------------------------------------- | |
diff | .1934922 .0158254 .162475 .2245094 | |
| under Ho: .0142156 13.61 0.000 | |
------------------------------------------------------------------------------ | |
diff = prop(Other EU) - prop(Israel) z = 13.6112 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 1.0000 Pr(|Z| < |z|) = 0.0000 Pr(Z > z) = 0.0000 | |
. | |
. * Subset to all European countries but Israel. | |
. keep if israel | |
(19351 observations deleted) | |
. | |
. * Final sample size. | |
. count | |
1367 | |
. | |
. | |
. * ====================== | |
. * = SIGNIFICANCE TESTS = | |
. * ====================== | |
. | |
. | |
. * IV: Age | |
. * ------- | |
. | |
. su age | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
age | 1367 48.46818 18.43854 15 97 | |
. | |
. * Check normality. | |
. hist age, bin(15) normal /// | |
> name(age, replace) | |
(bin=15, start=15, width=5.4666667) | |
. | |
. * Recoding to 4 age groups: | |
. gen age4:age4 = irecode(age, 24, 44, 64) // quick recode | |
. table age4, c(min age max age n age) // check result | |
---------------------------------------------- | |
age4 | min(age) max(age) N(age) | |
----------+----------------------------------- | |
0 | 15 24 151 | |
1 | 25 44 439 | |
2 | 45 64 491 | |
3 | 65 97 286 | |
---------------------------------------------- | |
. la def age4 0 "15-24" 1 "25-44" 2 "45-64" 3 "65+" // value labels | |
. la var age4 "Age (4 groups)" // label result | |
. fre age4 // final result | |
age4 -- Age (4 groups) | |
------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
----------------+-------------------------------------------- | |
Valid 0 15-24 | 151 11.05 11.05 11.05 | |
1 25-44 | 439 32.11 32.11 43.16 | |
2 45-64 | 491 35.92 35.92 79.08 | |
3 65+ | 286 20.92 20.92 100.00 | |
Total | 1367 100.00 100.00 | |
------------------------------------------------------------- | |
. | |
. * Spineplot. | |
. spineplot torture age4, /// | |
> name(dv_age, replace) | |
. | |
. * Comparison of average age in each category. | |
. ttest age, by(torture) | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Sometime | 469 46.07036 .8408054 18.20882 44.41814 47.72258 | |
Never ju | 570 49.4193 .7527722 17.97219 47.94075 50.89785 | |
---------+-------------------------------------------------------------------- | |
combined | 1039 47.9076 .5629982 18.14741 46.80286 49.01235 | |
---------+-------------------------------------------------------------------- | |
diff | -3.348936 1.127112 -5.560616 -1.137255 | |
------------------------------------------------------------------------------ | |
diff = mean(Sometime) - mean(Never ju) t = -2.9713 | |
Ho: diff = 0 degrees of freedom = 1037 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0015 Pr(|T| > |t|) = 0.0030 Pr(T > t) = 0.9985 | |
. | |
. | |
. * IV: Gender | |
. * ---------- | |
. | |
. fre sex | |
sex -- Gender | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 Male | 647 47.33 47.33 47.33 | |
2 Female | 720 52.67 52.67 100.00 | |
Total | 1367 100.00 100.00 | |
-------------------------------------------------------------- | |
. | |
. gen female:female = (sex==2) if !mi(sex) // dummify | |
. la def female 0 "Male" 1 "Female" | |
. la var female "Gender" | |
. | |
. * Conditional probabilities: | |
. tab torture female, col nof // column percentages | |
| Gender | |
Opposition to torture | Male Female | Total | |
----------------------+----------------------+---------- | |
Sometimes justifiable | 47.58 42.72 | 45.14 | |
Never justifiable | 52.42 57.28 | 54.86 | |
----------------------+----------------------+---------- | |
Total | 100.00 100.00 | 100.00 | |
. tab torture female, row nof // rows percentages | |
| Gender | |
Opposition to torture | Male Female | Total | |
----------------------+----------------------+---------- | |
Sometimes justifiable | 52.45 47.55 | 100.00 | |
Never justifiable | 47.54 52.46 | 100.00 | |
----------------------+----------------------+---------- | |
Total | 49.76 50.24 | 100.00 | |
. | |
. * Comparison of proportions in each category. | |
. prtest female, by(torture) | |
Two-sample test of proportions Sometimes ju: Number of obs = 469 | |
Never justif: Number of obs = 570 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
Sometimes ju | .4754797 .0230601 .4302828 .5206767 | |
Never justif | .5245614 .0209174 .483564 .5655588 | |
-------------+---------------------------------------------------------------- | |
diff | -.0490817 .0311337 -.1101025 .0119392 | |
| under Ho: .0311709 -1.57 0.115 | |
------------------------------------------------------------------------------ | |
diff = prop(Sometimes ju) - prop(Never justif) z = -1.5746 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.0577 Pr(|Z| < |z|) = 0.1153 Pr(Z > z) = 0.9423 | |
. | |
. | |
. * IV: Income deciles | |
. * ------------------ | |
. | |
. fre income | |
income -- Household's total net income, all sources | |
------------------------------------------------------------------------ | |
| Freq. Percent Valid Cum. | |
---------------------------+-------------------------------------------- | |
Valid 1 J - 1st decile | 109 7.97 7.97 7.97 | |
2 R - 2nd decile | 157 11.49 11.49 19.46 | |
3 C - 3rd decile | 238 17.41 17.41 36.87 | |
4 M - 4th decile | 173 12.66 12.66 49.52 | |
5 F - 5th decile | 155 11.34 11.34 60.86 | |
6 S - 6th decile | 129 9.44 9.44 70.30 | |
7 K - 7th decile | 109 7.97 7.97 78.27 | |
8 P - 8th decile | 101 7.39 7.39 85.66 | |
9 D - 9th decile | 99 7.24 7.24 92.90 | |
10 H - 10th decile | 97 7.10 7.10 100.00 | |
Total | 1367 100.00 100.00 | |
------------------------------------------------------------------------ | |
. | |
. * Simpler coding (no value labels). | |
. gen inc = income | |
. | |
. * Spineplot. | |
. spineplot torture inc | |
. | |
. * Chi-squared test. | |
. tab inc torture, row nof // row percentages | |
| Opposition to torture | |
inc | Sometimes Never jus | Total | |
-----------+----------------------+---------- | |
1 | 44.44 55.56 | 100.00 | |
2 | 45.90 54.10 | 100.00 | |
3 | 51.69 48.31 | 100.00 | |
4 | 41.67 58.33 | 100.00 | |
5 | 42.48 57.52 | 100.00 | |
6 | 43.14 56.86 | 100.00 | |
7 | 43.37 56.63 | 100.00 | |
8 | 49.37 50.63 | 100.00 | |
9 | 48.24 51.76 | 100.00 | |
10 | 35.44 64.56 | 100.00 | |
-----------+----------------------+---------- | |
Total | 45.14 54.86 | 100.00 | |
. tab inc torture, col nof // column percentages | |
| Opposition to torture | |
inc | Sometimes Never jus | Total | |
-----------+----------------------+---------- | |
1 | 8.53 8.77 | 8.66 | |
2 | 11.94 11.58 | 11.74 | |
3 | 19.62 15.09 | 17.13 | |
4 | 9.59 11.05 | 10.39 | |
5 | 10.23 11.40 | 10.88 | |
6 | 9.38 10.18 | 9.82 | |
7 | 7.68 8.25 | 7.99 | |
8 | 8.32 7.02 | 7.60 | |
9 | 8.74 7.72 | 8.18 | |
10 | 5.97 8.95 | 7.60 | |
-----------+----------------------+---------- | |
Total | 100.00 100.00 | 100.00 | |
. tab inc torture, cell nof // cell percentages | |
| Opposition to torture | |
inc | Sometimes Never jus | Total | |
-----------+----------------------+---------- | |
1 | 3.85 4.81 | 8.66 | |
2 | 5.39 6.35 | 11.74 | |
3 | 8.85 8.28 | 17.13 | |
4 | 4.33 6.06 | 10.39 | |
5 | 4.62 6.26 | 10.88 | |
6 | 4.23 5.58 | 9.82 | |
7 | 3.46 4.52 | 7.99 | |
8 | 3.75 3.85 | 7.60 | |
9 | 3.95 4.23 | 8.18 | |
10 | 2.69 4.91 | 7.60 | |
-----------+----------------------+---------- | |
Total | 45.14 54.86 | 100.00 | |
. tab inc torture, chi2 // Chi-squared test | |
| Opposition to torture | |
inc | Sometimes Never jus | Total | |
-----------+----------------------+---------- | |
1 | 40 50 | 90 | |
2 | 56 66 | 122 | |
3 | 92 86 | 178 | |
4 | 45 63 | 108 | |
5 | 48 65 | 113 | |
6 | 44 58 | 102 | |
7 | 36 47 | 83 | |
8 | 39 40 | 79 | |
9 | 41 44 | 85 | |
10 | 28 51 | 79 | |
-----------+----------------------+---------- | |
Total | 469 570 | 1,039 | |
Pearson chi2(9) = 8.1436 Pr = 0.520 | |
. | |
. | |
. * IV: Education | |
. * ------------- | |
. | |
. fre edu | |
edu -- Years of full-time education completed | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 0 | 7 0.51 0.51 0.51 | |
3 | 5 0.37 0.37 0.88 | |
4 | 10 0.73 0.73 1.61 | |
5 | 9 0.66 0.66 2.27 | |
6 | 13 0.95 0.95 3.22 | |
7 | 13 0.95 0.95 4.17 | |
8 | 88 6.44 6.44 10.61 | |
9 | 35 2.56 2.56 13.17 | |
10 | 71 5.19 5.19 18.36 | |
11 | 65 4.75 4.75 23.12 | |
12 | 501 36.65 36.65 59.77 | |
13 | 40 2.93 2.93 62.69 | |
14 | 86 6.29 6.29 68.98 | |
15 | 111 8.12 8.12 77.10 | |
16 | 157 11.49 11.49 88.59 | |
17 | 50 3.66 3.66 92.25 | |
18 | 46 3.37 3.37 95.61 | |
19 | 23 1.68 1.68 97.29 | |
20 | 18 1.32 1.32 98.61 | |
21 | 6 0.44 0.44 99.05 | |
22 | 4 0.29 0.29 99.34 | |
23 | 1 0.07 0.07 99.41 | |
24 | 2 0.15 0.15 99.56 | |
25 | 4 0.29 0.29 99.85 | |
26 | 2 0.15 0.15 100.00 | |
Total | 1367 100.00 100.00 | |
----------------------------------------------------------- | |
. | |
. * Verify normality. | |
. hist edu, bin(10) normal /// | |
> name(edu, replace) | |
(bin=10, start=0, width=2.6) | |
. | |
. * Comparison of average educational attainment in each category. | |
. ttest edu, by(torture) | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Sometime | 469 13.02559 .1578023 3.41743 12.7155 13.33568 | |
Never ju | 570 12.51579 .1412224 3.371638 12.23841 12.79317 | |
---------+-------------------------------------------------------------------- | |
combined | 1039 12.74591 .1054875 3.400232 12.53892 12.9529 | |
---------+-------------------------------------------------------------------- | |
diff | .5097969 .2114893 .094801 .9247927 | |
------------------------------------------------------------------------------ | |
diff = mean(Sometime) - mean(Never ju) t = 2.4105 | |
Ho: diff = 0 degrees of freedom = 1037 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.9919 Pr(|T| > |t|) = 0.0161 Pr(T > t) = 0.0081 | |
. | |
. | |
. * IV: Religious faith | |
. * ------------------- | |
. | |
. fre denom | |
denom -- Religion or denomination belonging to at present | |
--------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------------+-------------------------------------------- | |
Valid 1 Roman Catholic | 26 1.90 1.90 1.90 | |
2 Protestant | 1 0.07 0.07 1.98 | |
3 Eastern Orthodox | 15 1.10 1.10 3.07 | |
4 Other Christian | 2 0.15 0.15 3.22 | |
denomination | | |
5 Jewish | 1132 82.81 82.81 86.03 | |
6 Islamic | 186 13.61 13.61 99.63 | |
7 Eastern religions | 3 0.22 0.22 99.85 | |
8 Other non-Christian | 2 0.15 0.15 100.00 | |
religions | | |
Total | 1367 100.00 100.00 | |
--------------------------------------------------------------------------- | |
. | |
. * Recoding to simpler groups. | |
. recode denom (1/4 = 1 "Christian") /// | |
> (5 = 2 "Jewish") (6 = 3 "Muslim") (else = .), gen(faith3) | |
(1341 differences between denom and faith3) | |
. la var faith3 "Religious faith" | |
. | |
. * Conditional probabilities: | |
. tab torture faith3, col nof // column percentages | |
| Religious faith | |
Opposition to torture | Christian Jewish Muslim | Total | |
----------------------+---------------------------------+---------- | |
Sometimes justifiable | 60.61 42.42 56.95 | 45.12 | |
Never justifiable | 39.39 57.58 43.05 | 54.88 | |
----------------------+---------------------------------+---------- | |
Total | 100.00 100.00 100.00 | 100.00 | |
. tab torture faith3, row nof // rows percentages | |
| Religious faith | |
Opposition to torture | Christian Jewish Muslim | Total | |
----------------------+---------------------------------+---------- | |
Sometimes justifiable | 4.28 77.30 18.42 | 100.00 | |
Never justifiable | 2.29 86.27 11.44 | 100.00 | |
----------------------+---------------------------------+---------- | |
Total | 3.19 82.22 14.59 | 100.00 | |
. | |
. * Chi-squared test: | |
. tab torture faith3, exp chi2 // expected frequencies | |
+--------------------+ | |
| Key | | |
|--------------------| | |
| frequency | | |
| expected frequency | | |
+--------------------+ | |
| Religious faith | |
Opposition to torture | Christian Jewish Muslim | Total | |
----------------------+---------------------------------+---------- | |
Sometimes justifiable | 20 361 86 | 467 | |
| 14.9 384.0 68.1 | 467.0 | |
----------------------+---------------------------------+---------- | |
Never justifiable | 13 490 65 | 568 | |
| 18.1 467.0 82.9 | 568.0 | |
----------------------+---------------------------------+---------- | |
Total | 33 851 151 | 1,035 | |
| 33.0 851.0 151.0 | 1,035.0 | |
Pearson chi2(2) = 14.2396 Pr = 0.001 | |
. tabchi torture faith3, noe p // Pearson residuals | |
observed frequency | |
Pearson residual | |
------------------------------------------------------- | |
| Religious faith | |
Opposition to torture | Christian Jewish Muslim | |
----------------------+-------------------------------- | |
Sometimes justifiable | 20 361 86 | |
| 1.324 -1.173 2.165 | |
| | |
Never justifiable | 13 490 65 | |
| -1.201 1.063 -1.963 | |
------------------------------------------------------- | |
Pearson chi2(2) = 14.2396 Pr = 0.001 | |
likelihood-ratio chi2(2) = 14.1847 Pr = 0.001 | |
. | |
. * Create a binary variable for each category. | |
. tab faith3, gen(faith_) | |
Religious | | |
faith | Freq. Percent Cum. | |
------------+----------------------------------- | |
Christian | 44 3.23 3.23 | |
Jewish | 1,132 83.11 86.34 | |
Muslim | 186 13.66 100.00 | |
------------+----------------------------------- | |
Total | 1,362 100.00 | |
. d faith_? | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
faith_1 byte %8.0g faith3==Christian | |
faith_2 byte %8.0g faith3==Jewish | |
faith_3 byte %8.0g faith3==Muslim | |
. codebook faith_?, c | |
Variable Obs Unique Mean Min Max Label | |
------------------------------------------------------------------------------------ | |
faith_1 1362 2 .0323054 0 1 faith3==Christian | |
faith_2 1362 2 .8311307 0 1 faith3==Jewish | |
faith_3 1362 2 .1365639 0 1 faith3==Muslim | |
------------------------------------------------------------------------------------ | |
. | |
. * Inspect underlying distribution by country. | |
. tab cntry faith3 | |
| Religious faith | |
Country | Christian Jewish Muslim | Total | |
-----------+---------------------------------+---------- | |
IL | 44 1,132 186 | 1,362 | |
-----------+---------------------------------+---------- | |
Total | 44 1,132 186 | 1,362 | |
. | |
. * Comparing Christian respondents to all others. | |
. prtest torture, by(faith_1) | |
Two-sample test of proportions 0: Number of obs = 1002 | |
1: Number of obs = 33 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
0 | .5538922 .0157036 .5231138 .5846707 | |
1 | .3939394 .0850581 .2272285 .5606502 | |
-------------+---------------------------------------------------------------- | |
diff | .1599528 .0864956 -.0095754 .329481 | |
| under Ho: .0880383 1.82 0.069 | |
------------------------------------------------------------------------------ | |
diff = prop(0) - prop(1) z = 1.8169 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.9654 Pr(|Z| < |z|) = 0.0692 Pr(Z > z) = 0.0346 | |
. | |
. * Comparing Jewish respondents to all others. | |
. prtest torture, by(faith_2) | |
Two-sample test of proportions 0: Number of obs = 184 | |
1: Number of obs = 851 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
0 | .423913 .0364312 .3525092 .4953169 | |
1 | .5757932 .0169417 .542588 .6089983 | |
-------------+---------------------------------------------------------------- | |
diff | -.1518801 .0401778 -.2306271 -.0731331 | |
| under Ho: .0404565 -3.75 0.000 | |
------------------------------------------------------------------------------ | |
diff = prop(0) - prop(1) z = -3.7542 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.0001 Pr(|Z| < |z|) = 0.0002 Pr(Z > z) = 0.9999 | |
. | |
. * Comparing Muslim respondents to all others. | |
. prtest torture, by(faith_3) | |
Two-sample test of proportions 0: Number of obs = 884 | |
1: Number of obs = 151 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
0 | .5690045 .0166559 .5363596 .6016495 | |
1 | .4304636 .040294 .3514888 .5094384 | |
-------------+---------------------------------------------------------------- | |
diff | .1385409 .0436008 .053085 .2239969 | |
| under Ho: .0438175 3.16 0.002 | |
------------------------------------------------------------------------------ | |
diff = prop(0) - prop(1) z = 3.1618 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.9992 Pr(|Z| < |z|) = 0.0016 Pr(Z > z) = 0.0008 | |
. | |
. | |
. * IV: Political positioning | |
. * ------------------------- | |
. | |
. fre pol | |
pol -- Placement on left right scale | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 0 Left | 29 2.12 2.12 2.12 | |
1 1 | 49 3.58 3.58 5.71 | |
2 2 | 100 7.32 7.32 13.02 | |
3 3 | 96 7.02 7.02 20.04 | |
4 4 | 103 7.53 7.53 27.58 | |
5 5 | 303 22.17 22.17 49.74 | |
6 6 | 180 13.17 13.17 62.91 | |
7 7 | 160 11.70 11.70 74.62 | |
8 8 | 153 11.19 11.19 85.81 | |
9 9 | 96 7.02 7.02 92.83 | |
10 Right | 98 7.17 7.17 100.00 | |
Total | 1367 100.00 100.00 | |
-------------------------------------------------------------- | |
. | |
. * Verifying normality. | |
. hist pol, discrete percent addl | |
(start=0, width=1) | |
. | |
. * Recoding to simpler categories | |
. recode pol (0/4 = 1 "Left") (5 = 2 "Centre") (6/10 = 3 "Right"), gen(pol3) | |
(1318 differences between pol and pol3) | |
. la var pol3 "Political positioning" | |
. | |
. * Conditional probabilities: | |
. tab torture pol3, col nof // column percentages | |
| Political positioning | |
Opposition to torture | Left Centre Right | Total | |
----------------------+---------------------------------+---------- | |
Sometimes justifiable | 50.70 42.47 43.26 | 45.14 | |
Never justifiable | 49.30 57.53 56.74 | 54.86 | |
----------------------+---------------------------------+---------- | |
Total | 100.00 100.00 100.00 | 100.00 | |
. tab torture pol3, row nof // rows percentages | |
| Political positioning | |
Opposition to torture | Left Centre Right | Total | |
----------------------+---------------------------------+---------- | |
Sometimes justifiable | 30.92 19.83 49.25 | 100.00 | |
Never justifiable | 24.74 22.11 53.16 | 100.00 | |
----------------------+---------------------------------+---------- | |
Total | 27.53 21.08 51.40 | 100.00 | |
. | |
. * Chi-squared test: | |
. tab torture pol3, exp chi2 // expected frequencies | |
+--------------------+ | |
| Key | | |
|--------------------| | |
| frequency | | |
| expected frequency | | |
+--------------------+ | |
| Political positioning | |
Opposition to torture | Left Centre Right | Total | |
----------------------+---------------------------------+---------- | |
Sometimes justifiable | 145 93 231 | 469 | |
| 129.1 98.9 241.0 | 469.0 | |
----------------------+---------------------------------+---------- | |
Never justifiable | 141 126 303 | 570 | |
| 156.9 120.1 293.0 | 570.0 | |
----------------------+---------------------------------+---------- | |
Total | 286 219 534 | 1,039 | |
| 286.0 219.0 534.0 | 1,039.0 | |
Pearson chi2(2) = 4.9652 Pr = 0.084 | |
. | |
. | |
. * IV: Media exposure | |
. * ------------------ | |
. | |
. fre tv | |
tv -- TV watching, news/politics/current affairs on average weekday | |
----------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------------+-------------------------------------------- | |
Valid 0 No time at all | 147 10.75 10.75 10.75 | |
1 Less than 0,5 hour | 323 23.63 23.63 34.38 | |
2 0,5 hour to 1 hour | 325 23.77 23.77 58.16 | |
3 More than 1 hour, up to 1,5 | 264 19.31 19.31 77.47 | |
hours | | |
4 More than 1,5 hours, up to | 84 6.14 6.14 83.61 | |
2 hours | | |
5 More than 2 hours, up to | 86 6.29 6.29 89.90 | |
2,5 hours | | |
6 More than 2,5 hours, up to | 31 2.27 2.27 92.17 | |
3 hours | | |
7 More than 3 hours | 107 7.83 7.83 100.00 | |
Total | 1367 100.00 100.00 | |
----------------------------------------------------------------------------------- | |
. | |
. * Alternative reading (binary mean). The nolabel (nol) option gets rid of the | |
. * value labels and makes the output table a tad softer on the reader's eye. | |
. tab tv, summ(torture) nol | |
TV | | |
watching, | | |
news/politi | | |
cs/current | | |
affairs on | | |
average | Summary of Opposition to torture | |
weekday | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
0 | .57943925 .49597214 107 | |
1 | .52610442 .50032377 249 | |
2 | .47058824 .50018612 238 | |
3 | .54404145 .49935191 193 | |
4 | .63333333 .4859611 60 | |
5 | .66216216 .47620149 74 | |
6 | .46428571 .5078745 28 | |
7 | .66666667 .47404546 90 | |
------------+------------------------------------ | |
Total | .54860443 .49787165 1039 | |
. | |
. * Alternative reading (plot). | |
. tab tv, plot | |
TV watching, news/politics/current | | |
affairs on average weekday | Freq. | |
-----------------------------------+------------+------------------------------ | |
No time at all | 147 |************** | |
Less than 0,5 hour | 323 |****************************** | |
0,5 hour to 1 hour | 325 |****************************** | |
More than 1 hour, up to 1,5 hours | 264 |************************ | |
More than 1,5 hours, up to 2 hours | 84 |******** | |
More than 2 hours, up to 2,5 hours | 86 |******** | |
More than 2,5 hours, up to 3 hours | 31 |*** | |
More than 3 hours | 107 |********** | |
-----------------------------------+------------+------------------------------ | |
Total | 1,367 | |
. | |
. * Recoding to binary. | |
. recode tv (0/3 = 0 "Low") (4/7 = 1 "High"), gen(media) | |
(1220 differences between tv and media) | |
. la var media "Media exposure" | |
. | |
. * Chi-squared test: | |
. tab torture media, exp chi2 // expected frequencies | |
+--------------------+ | |
| Key | | |
|--------------------| | |
| frequency | | |
| expected frequency | | |
+--------------------+ | |
| Media exposure | |
Opposition to torture | Low High | Total | |
----------------------+----------------------+---------- | |
Sometimes justifiable | 377 92 | 469 | |
| 355.2 113.8 | 469.0 | |
----------------------+----------------------+---------- | |
Never justifiable | 410 160 | 570 | |
| 431.8 138.2 | 570.0 | |
----------------------+----------------------+---------- | |
Total | 787 252 | 1,039 | |
| 787.0 252.0 | 1,039.0 | |
Pearson chi2(1) = 10.0094 Pr = 0.002 | |
. tabchi torture media, noe p // Pearson residuals | |
observed frequency | |
Pearson residual | |
-------------------------------------- | |
| Media exposure | |
Opposition to torture | Low High | |
----------------------+--------------- | |
Sometimes justifiable | 377 92 | |
| 1.154 -2.039 | |
| | |
Never justifiable | 410 160 | |
| -1.047 1.850 | |
-------------------------------------- | |
Pearson chi2(1) = 10.0094 Pr = 0.002 | |
likelihood-ratio chi2(1) = 10.1292 Pr = 0.001 | |
. | |
. * Comparing respondents with high TV exposure to others. | |
. prtest torture, by(media) | |
Two-sample test of proportions Low: Number of obs = 787 | |
High: Number of obs = 252 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
Low | .5209657 .0178074 .4860638 .5558676 | |
High | .6349206 .0303287 .5754776 .6943637 | |
-------------+---------------------------------------------------------------- | |
diff | -.1139549 .03517 -.1828869 -.045023 | |
| under Ho: .0360187 -3.16 0.002 | |
------------------------------------------------------------------------------ | |
diff = prop(Low) - prop(High) z = -3.1638 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.0008 Pr(|Z| < |z|) = 0.0016 Pr(Z > z) = 0.9992 | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require mkcorr renvars | |
. | |
. * Log results. | |
. cap log using code/week7.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 7 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Fertility and Education, Part 1 | |
> | |
> - DATA: Quality of Government (2013) | |
> | |
> This do-file is the last one that we will run on the topic of association. | |
> You are expected to submit the second draft of your work very soon: the draft | |
> paper that you will be submitting will be mostly significance tests, so make | |
> sure that you have done all the necessary readings and practice by then. | |
> | |
> Note that significance tests should not be used blindly: run them only when | |
> you observe a particular association that you want to quantify, such as a | |
> difference in means or proportions. Also remember that a significance test | |
> is not a means to test a substantive hypothesis. | |
> | |
> At that stage, it will become indispensable that you have caught up with the | |
> textbook readings, and that you understand enough about Stata syntax to focus | |
> on interpreting rather than coding. Use the course material to bring yourself | |
> up to speed with both Stata and essential statistical theory. | |
> | |
> Last updated 2013-05-28. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load QOG dataset. | |
. use data/qog2013, clear | |
(Quality of Government 2013) | |
. | |
. * Rename variables to short handles. | |
. renvars wdi_fr bl_asy25mf undp_hdi ti_cpi gid_wip \ births schooling hdi corruptio | |
> n femparl | |
. | |
. * Compute GDP per capita. | |
. gen gdpc = unna_gdp / unna_pop | |
(2 missing values generated) | |
. la var gdpc "Real GDP per capita (constant USD)" | |
. | |
. * Recode to less, shorter labels. | |
. recode ht_region (6/10 = 6), gen(region) | |
(44 differences between ht_region and region) | |
. la var region "Geographical region" | |
. la val region region | |
. la def region 1 "E. Europe and PSU" 2 "Lat. America" /// | |
> 3 "N. Africa and M. East" 4 "Sub-Sah. Africa" /// | |
> 5 "W. Europe and N. America" 6 "Asia, Pacific and Carribean" /// | |
> , replace | |
. | |
. | |
. * Finalized sample | |
. * ---------------- | |
. | |
. * Have a quick look. | |
. codebook births schooling gdpc hdi corruption femparl region, c | |
Variable Obs Unique Mean Min Max Label | |
------------------------------------------------------------------------------------ | |
births 187 179 2.900285 1.149 7.115 Fertility Rate (Births per ... | |
schooling 143 143 7.813079 1.202597 13.27008 Average Schooling Years, Fe... | |
gdpc 191 191 10927.42 137.082 129959.4 Real GDP per capita (consta... | |
hdi 185 162 .6554973 .277 .941 Human Development Index | |
corruption 181 68 3.982868 1.009626 9.4 Corruption Perceptions Index | |
femparl 116 93 16.33103 0 56.3 Women in Parliament (%) | |
region 193 6 3.911917 1 6 Geographical region | |
------------------------------------------------------------------------------------ | |
. | |
. * Check missing values. | |
. misstable pat births schooling gdpc hdi corruption femparl region ccodewb, freq | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Frequency | 1 2 3 4 5 6 7 | |
------------+------------------------ | |
91 | 1 1 1 1 1 1 1 | |
| | |
48 | 1 1 1 1 1 1 0 | |
21 | 1 1 1 1 1 0 1 | |
15 | 1 1 1 1 1 0 0 | |
5 | 1 1 1 1 0 0 0 | |
4 | 1 1 0 0 0 0 0 | |
2 | 1 1 1 0 1 0 1 | |
1 | 0 1 1 1 1 0 0 | |
1 | 0 1 1 1 1 1 0 | |
1 | 1 0 0 0 1 1 0 | |
1 | 1 0 1 1 1 1 1 | |
1 | 1 1 0 1 0 0 0 | |
1 | 1 1 1 0 0 0 0 | |
1 | 1 1 1 1 0 1 1 | |
------------+------------------------ | |
193 | | |
Variables are (1) ccodewb (2) gdpc (3) births (4) hdi (5) corruption | |
(6) schooling (7) femparl | |
. | |
. * You would usually delete incomplete observations at that stage, and then count | |
. * the number of observations in your finalized sample. We exceptionally keep the | |
. * missing values here to illustrate how pairwise and listwise correlation works. | |
. | |
. | |
. | |
. * =============== | |
. * = CORRELATION = | |
. * =============== | |
. | |
. | |
. * (1) Fertility rates and schooling years | |
. * --------------------------------------- | |
. | |
. scatter births schooling, /// | |
> name(fert_edu, replace) | |
. | |
. pwcorr births schooling, obs sig | |
| births school~g | |
-------------+------------------ | |
births | 1.0000 | |
| | |
| 187 | |
| | |
schooling | -0.7394 1.0000 | |
| 0.0000 | |
| 142 143 | |
| | |
. | |
. | |
. * (2) Schooling years and (log) Gross Domestic Product | |
. * ---------------------------------------------------- | |
. | |
. sc gdpc schooling, /// | |
> name(gdpc_edu, replace) | |
. | |
. * A first look at the scatterplot shows no clear linear pattern, but we know | |
. * from a previous session that the logarithmic variable transformation can be | |
. * used to visualize exponential relationships differently. Consequently, we | |
. * try to visualise the same variables with a logarithmic scale for GDP per capita. | |
. sc gdpc schooling, ysc(log) /// | |
> name(gdpc_edu, replace) | |
. | |
. * In this classical case, log units are more informative than metric ones to | |
. * identify the relationship between the dependent and independent variables. | |
. gen log_gdpc = ln(gdpc) | |
(2 missing values generated) | |
. la var log_gdpc "Real GDP per capita (log)" | |
. | |
. * Verify the transformation. | |
. sc log_gdpc schooling, /// | |
> name(gdpc_edu, replace) | |
. | |
. * Obtain summary statistics. | |
. su log_gdpc schooling | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
log_gdpc | 191 8.159699 1.592255 4.920579 11.77498 | |
schooling | 143 7.813079 2.904687 1.202597 13.27008 | |
. | |
. * Visual inspection of the relationship within the mean-mean quadrants. | |
. sc log_gdpc schooling, yline(7.5) xline(6) /// | |
> name(log_gdpc_schooling, replace) | |
. | |
. * Verify inspection computationally. | |
. pwcorr gdpc log_gdpc schooling, obs sig | |
| gdpc log_gdpc school~g | |
-------------+--------------------------- | |
gdpc | 1.0000 | |
| | |
| 191 | |
| | |
log_gdpc | 0.7657 1.0000 | |
| 0.0000 | |
| 191 191 | |
| | |
schooling | 0.5537 0.7732 1.0000 | |
| 0.0000 0.0000 | |
| 141 141 143 | |
| | |
. | |
. | |
. * (3) Corruption and human development | |
. * ------------------------------------ | |
. | |
. * Before graphing the variables, we need to pass a few graph options, because | |
. * the Corruption Perception Index is reverse-coded (0 marks high corruption, | |
. * and 10 marks very low corruption). To enhance visual interpretation, we | |
. * therefore use an inverted axis scale, and add horizontal axis labels to it. | |
. sc corruption hdi, ysc(rev) /// | |
> xla(0 "Low" 1 "High") yla(0 "Highly corrupt" 10 "Lowly corrupt", angle(h)) | |
> /// | |
> name(corruption_hdi, replace) | |
. | |
. * The pattern that appears graphically is not linear: corruption is stationary | |
. * for low to medium values of HDI, and then rapidly drops towards high values. | |
. * Given its shape, this relationship is thus more likely to be quadratic, i.e. | |
. * of the form y = x^n where y is corruption, x is HDI and n > 1 is a power. | |
. * If the correlation coefficient is statistically significant, we might treat | |
. * the relationship between corruption and HDI as approximately linear, but we | |
. * will lose some of the information observed visually by doing so. | |
. pwcorr corruption hdi, obs sig | |
| corrup~n hdi | |
-------------+------------------ | |
corruption | 1.0000 | |
| | |
| 181 | |
| | |
hdi | 0.7244 1.0000 | |
| 0.0000 | |
| 178 185 | |
| | |
. | |
. | |
. * (4) Female government ministers and corruption | |
. * ---------------------------------------------- | |
. | |
. * Obtain summary statistics. | |
. su femparl corruption | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
femparl | 116 16.33103 10.11587 0 56.3 | |
corruption | 181 3.982868 2.089537 1.009626 9.4 | |
. | |
. * Visual inspection of the relationship within the mean-mean quadrants. | |
. sc femparl corruption, yline(15) xline(4) /// | |
> name(femparl_corruption, replace) | |
. | |
. * No clear pattern emerges from the scatterplot above. Never force a pattern | |
. * onto the data: relationships should be apparent, not constructed. If there is | |
. * no straightforward relationship, disregard it. Identically, never include a | |
. * graph in your work if the relationship that it intends to show will not | |
. * strike the reader between the eyes (i.e. run an intra-ocular trauma test). | |
. * Inconclusive visual inspection can come with significant correlations, as is | |
. * the case here if you actually compute the coefficient, but visual inspection | |
. * and theoretical elaboration provide no substantive justification for it. | |
. | |
. | |
. * ================ | |
. * = SCATTERPLOTS = | |
. * ================ | |
. | |
. | |
. * Scatterplot matrixes | |
. * -------------------- | |
. | |
. * Start with visual inspection of the data organized as a scatterplot matrix. | |
. * A scatterplot matrix contains all possible bivariate relationships between | |
. * any number of variables. Building a matrix of your DV and IVs allows to spot | |
. * relationships between IVs, which will be useful later on in your analysis. | |
. * Note that the example below shows the untransformed measure of GDP per capita. | |
. gr mat births schooling log_gdpc corruption femparl, /// | |
> name(gr_matrix, replace) | |
. | |
. * You could also look at a sparser version of the matrix that shows only half of | |
. * all plots for a subset of geographical regions. | |
. gr mat births schooling log_gdpc corruption femparl if inlist(region, 4, 5), half | |
> /// | |
> name(gr_matrix_regions4_5, replace) | |
. | |
. * The most practical way to consider all possible correlations in a list of | |
. * predictors (or independent variables) is to build a correlation matrix out | |
. * of their respective pairwise correlations. "Pair-wise" indicates that the | |
. * correlation coefficient uses only pairs of valid, nonmissing observations, | |
. * and disregards all observations where any of the variables is missing. | |
. pwcorr births schooling log_gdpc corruption femparl | |
| births school~g log_gdpc corrup~n femparl | |
-------------+--------------------------------------------- | |
births | 1.0000 | |
schooling | -0.7394 1.0000 | |
log_gdpc | -0.7001 0.7732 1.0000 | |
corruption | -0.5175 0.6494 0.8033 1.0000 | |
femparl | 0.0066 -0.0436 -0.0674 0.0315 1.0000 | |
. | |
. * The most common way to indicate statistically significant correlations in | |
. * a correlation matrix is to use asterisks (stars) to mark them when their | |
. * p-value is below the level of statistical significance. | |
. pwcorr births schooling log_gdpc corruption femparl, star(.05) | |
| births school~g log_gdpc corrup~n femparl | |
-------------+--------------------------------------------- | |
births | 1.0000 | |
schooling | -0.7394* 1.0000 | |
log_gdpc | -0.7001* 0.7732* 1.0000 | |
corruption | -0.5175* 0.6494* 0.8033* 1.0000 | |
femparl | 0.0066 -0.0436 -0.0674 0.0315 1.0000 | |
. | |
. * For explorative purposes, another option can be used to print out only the | |
. * statistically significant correlations, which comes in handy especially in | |
. * very large matrixes with majorily insignificant correlation coefficients. | |
. pwcorr births schooling log_gdpc corruption femparl, print(.05) | |
| births school~g log_gdpc corrup~n femparl | |
-------------+--------------------------------------------- | |
births | 1.0000 | |
schooling | -0.7394 1.0000 | |
log_gdpc | -0.7001 0.7732 1.0000 | |
corruption | -0.5175 0.6494 0.8033 1.0000 | |
femparl | 1.0000 | |
. | |
. * Export a correlation matrix. | |
. mkcorr births schooling gdpc corruption femparl, /// | |
> lab num sig log("week7_correlations.txt") replace | |
(note: file week7_correlations.txt not found) | |
. | |
. | |
. * Scatterplots with marker labels | |
. * ------------------------------- | |
. | |
. * Stata requires passing a lot of options to produce informative graphs. If you | |
. * are using a set of consistent options on several graphs, you can store these | |
. * in a global macro and apply them by calling the macro with a dollar sign ($). | |
. * The following global macro is a list of graph options to make scatterplots | |
. * more informative by showing country codes instead of anonymous data points: | |
. global ccode "ms(i) mlabpos(0) mlab(ccodewb) legend(off)" | |
. | |
. * The options contained in the global macro make the marker symbol invisible, | |
. * then center the marker label and fill it with the ccodewb variable (holding | |
. * country codes from the World Bank) in replacement of the usual dot markers. | |
. * In the following plots, passing the $ccode option will result in actually | |
. * passing these graph options, stored in the ccode ("country codes") macro. | |
. * Note that this is a hack, and that you would not normally fiddle with global | |
. * macros if you were programming Stata at a more advanced level: you would use | |
. * local macros, which are more complex in usage and therefore avoided here. | |
. | |
. * Improve previous example. | |
. sc births schooling, $ccode /// | |
> name(fert_edu1, replace) | |
. | |
. * Add a color difference to Western states by overlaying multiple scatterplots. | |
. sc births schooling, $ccode || /// | |
> sc births schooling if region == 5, $ccode /// | |
> name(fert_edu2, replace) | |
. | |
. * Add a tone and color difference to subsaharan African states (more options!). | |
. sc births schooling, mlabc(gs10) $ccode || /// | |
> sc births schooling if region == 4, $ccode /// | |
> name(fert_edu3, replace) | |
. | |
. * There are binders full of Stata graph options like these. Have a look at the | |
. * help pages for two-way graphs (h tw) for a list that applies to scatterplots. | |
. | |
. | |
. * Scatterplots with histograms | |
. * ---------------------------- | |
. | |
. * Or, how to combine graphs with insane axis options. | |
. sc births schooling, /// | |
> yti("") xti("") ysc(alt) yla(none, angle(v)) xsc(alt) xla(none, grid gmax) | |
> /// | |
> name(plot2, replace) plotregion(style(none)) | |
. | |
. * Plot 1 is top-left. | |
. tw hist births, /// | |
> xsc(alt rev) xla(none) xti("") horiz fxsize(25) /// | |
> name(plot1, replace) plotregion(style(none)) | |
. | |
. * Plot 3 is bottom-right. | |
. tw hist schooling, /// | |
> ysc(alt rev) yla(none, nogrid angle(v)) yti("") xla(,grid gmax) fysize(25) | |
> /// | |
> name(plot3, replace) plotregion(style(none)) | |
. | |
. * Combined plots with square ratio (y-size = x-size). | |
. gr combine plot1 plot2 plot3, /// | |
> imargin(0 0 0 0) hole(3) ysiz(5) xsiz(5) /// | |
> name(fert_edu4, replace) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Cleanup, focus on result. | |
. gr drop plot1 plot2 plot3 | |
. gr di fert_edu4 | |
. | |
. | |
. * Scatterplots with smoothed lines | |
. * -------------------------------- | |
. | |
. * Another way to visualize the quality of a linear fit is to plot a smoothed fit | |
. * with the -lowess- command, to show departures from linearity in the IV effect: | |
. lowess births schooling, /// | |
> name(fert_edu_lowess, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The LOWESS smoother available with -lowess- in Stata can operate as a moving | |
. * average (running mean) or as a least squares estimator, which is the default. | |
. * The core mechanics of a least squares estimator are on next week's menu. | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Log results. | |
. cap log using code/week8.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 8 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Fertility and Education, Part 2 | |
> | |
> - DATA: Quality of Government (2013) | |
> | |
> This do-file is a continuation from last week's do-file, which we start by | |
> running in the background. This will prepare the data by renaming variables, | |
> logging GDP per capita and recoding geographical regions to less categories | |
> and shorter labels. | |
> | |
> We then explore simple linear regression using a similar set of variables as | |
> the one used last week. Some variables are interpreted on non-linear scales. | |
> Dummies (and categorical variables generally) can also be passed to a simple | |
> linear regression equation, with another slight adjustment in interpretation. | |
> | |
> Our next two sessions will move from these fundamentals about regression to | |
> multiple linear regression, and then to logistic models for binary dependent | |
> variables. Make sure that you understand the logic of ordinary least squares | |
> (OLS) in order to include simple linear regression models in your next draft. | |
> | |
> Last updated 2013-05-28. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Replicate last week and clear graphs. The data left in memory is a modified | |
. * version the Quality of Government dataset, with all necessary recodes and | |
. * renames already performed. It is very common to use different do-files for | |
. * different tasks. In this example, the previous do-file is used for data | |
. * management and the current do-file is used for analysis. | |
. do code/week7.do | |
. | |
. * Check setup. | |
. run setup/require mkcorr renvars | |
. | |
. * Log results. | |
. cap log using code/week7.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 7 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Fertility and Education, Part 1 | |
> | |
> - DATA: Quality of Government (2013) | |
> | |
> This do-file is the last one that we will run on the topic of association. | |
> You are expected to submit the second draft of your work very soon: the draft | |
> paper that you will be submitting will be mostly significance tests, so make | |
> sure that you have done all the necessary readings and practice by then. | |
> | |
> Note that significance tests should not be used blindly: run them only when | |
> you observe a particular association that you want to quantify, such as a | |
> difference in means or proportions. Also remember that a significance test | |
> is not a means to test a substantive hypothesis. | |
> | |
> At that stage, it will become indispensable that you have caught up with the | |
> textbook readings, and that you understand enough about Stata syntax to focus | |
> on interpreting rather than coding. Use the course material to bring yourself | |
> up to speed with both Stata and essential statistical theory. | |
> | |
> Last updated 2013-05-28. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load QOG dataset. | |
. use data/qog2013, clear | |
(Quality of Government 2013) | |
. | |
. * Rename variables to short handles. | |
. renvars wdi_fr bl_asy25mf undp_hdi ti_cpi gid_wip \ births schooling hdi corruptio | |
> n femparl | |
. | |
. * Compute GDP per capita. | |
. gen gdpc = unna_gdp / unna_pop | |
(2 missing values generated) | |
. la var gdpc "Real GDP per capita (constant USD)" | |
. | |
. * Recode to less, shorter labels. | |
. recode ht_region (6/10 = 6), gen(region) | |
(44 differences between ht_region and region) | |
. la var region "Geographical region" | |
. la val region region | |
. la def region 1 "E. Europe and PSU" 2 "Lat. America" /// | |
> 3 "N. Africa and M. East" 4 "Sub-Sah. Africa" /// | |
> 5 "W. Europe and N. America" 6 "Asia, Pacific and Carribean" /// | |
> , replace | |
. | |
. | |
. * Finalized sample | |
. * ---------------- | |
. | |
. * Have a quick look. | |
. codebook births schooling gdpc hdi corruption femparl region, c | |
Variable Obs Unique Mean Min Max Label | |
------------------------------------------------------------------------------------ | |
births 187 179 2.900285 1.149 7.115 Fertility Rate (Births per ... | |
schooling 143 143 7.813079 1.202597 13.27008 Average Schooling Years, Fe... | |
gdpc 191 191 10927.42 137.082 129959.4 Real GDP per capita (consta... | |
hdi 185 162 .6554973 .277 .941 Human Development Index | |
corruption 181 68 3.982868 1.009626 9.4 Corruption Perceptions Index | |
femparl 116 93 16.33103 0 56.3 Women in Parliament (%) | |
region 193 6 3.911917 1 6 Geographical region | |
------------------------------------------------------------------------------------ | |
. | |
. * Check missing values. | |
. misstable pat births schooling gdpc hdi corruption femparl region ccodewb, freq | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Frequency | 1 2 3 4 5 6 7 | |
------------+------------------------ | |
91 | 1 1 1 1 1 1 1 | |
| | |
48 | 1 1 1 1 1 1 0 | |
21 | 1 1 1 1 1 0 1 | |
15 | 1 1 1 1 1 0 0 | |
5 | 1 1 1 1 0 0 0 | |
4 | 1 1 0 0 0 0 0 | |
2 | 1 1 1 0 1 0 1 | |
1 | 0 1 1 1 1 0 0 | |
1 | 0 1 1 1 1 1 0 | |
1 | 1 0 0 0 1 1 0 | |
1 | 1 0 1 1 1 1 1 | |
1 | 1 1 0 1 0 0 0 | |
1 | 1 1 1 0 0 0 0 | |
1 | 1 1 1 1 0 1 1 | |
------------+------------------------ | |
193 | | |
Variables are (1) ccodewb (2) gdpc (3) births (4) hdi (5) corruption | |
(6) schooling (7) femparl | |
. | |
. * You would usually delete incomplete observations at that stage, and then count | |
. * the number of observations in your finalized sample. We exceptionally keep the | |
. * missing values here to illustrate how pairwise and listwise correlation works. | |
. | |
. | |
. | |
. * =============== | |
. * = CORRELATION = | |
. * =============== | |
. | |
. | |
. * (1) Fertility rates and schooling years | |
. * --------------------------------------- | |
. | |
. scatter births schooling, /// | |
> name(fert_edu, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. pwcorr births schooling, obs sig | |
| births school~g | |
-------------+------------------ | |
births | 1.0000 | |
| | |
| 187 | |
| | |
schooling | -0.7394 1.0000 | |
| 0.0000 | |
| 142 143 | |
| | |
. | |
. | |
. * (2) Schooling years and (log) Gross Domestic Product | |
. * ---------------------------------------------------- | |
. | |
. sc gdpc schooling, /// | |
> name(gdpc_edu, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * A first look at the scatterplot shows no clear linear pattern, but we know | |
. * from a previous session that the logarithmic variable transformation can be | |
. * used to visualize exponential relationships differently. Consequently, we | |
. * try to visualise the same variables with a logarithmic scale for GDP per capita. | |
. sc gdpc schooling, ysc(log) /// | |
> name(gdpc_edu, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * In this classical case, log units are more informative than metric ones to | |
. * identify the relationship between the dependent and independent variables. | |
. gen log_gdpc = ln(gdpc) | |
(2 missing values generated) | |
. la var log_gdpc "Real GDP per capita (log)" | |
. | |
. * Verify the transformation. | |
. sc log_gdpc schooling, /// | |
> name(gdpc_edu, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Obtain summary statistics. | |
. su log_gdpc schooling | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
log_gdpc | 191 8.159699 1.592255 4.920579 11.77498 | |
schooling | 143 7.813079 2.904687 1.202597 13.27008 | |
. | |
. * Visual inspection of the relationship within the mean-mean quadrants. | |
. sc log_gdpc schooling, yline(7.5) xline(6) /// | |
> name(log_gdpc_schooling, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Verify inspection computationally. | |
. pwcorr gdpc log_gdpc schooling, obs sig | |
| gdpc log_gdpc school~g | |
-------------+--------------------------- | |
gdpc | 1.0000 | |
| | |
| 191 | |
| | |
log_gdpc | 0.7657 1.0000 | |
| 0.0000 | |
| 191 191 | |
| | |
schooling | 0.5537 0.7732 1.0000 | |
| 0.0000 0.0000 | |
| 141 141 143 | |
| | |
. | |
. | |
. * (3) Corruption and human development | |
. * ------------------------------------ | |
. | |
. * Before graphing the variables, we need to pass a few graph options, because | |
. * the Corruption Perception Index is reverse-coded (0 marks high corruption, | |
. * and 10 marks very low corruption). To enhance visual interpretation, we | |
. * therefore use an inverted axis scale, and add horizontal axis labels to it. | |
. sc corruption hdi, ysc(rev) /// | |
> xla(0 "Low" 1 "High") yla(0 "Highly corrupt" 10 "Lowly corrupt", angle(h)) | |
> /// | |
> name(corruption_hdi, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The pattern that appears graphically is not linear: corruption is stationary | |
. * for low to medium values of HDI, and then rapidly drops towards high values. | |
. * Given its shape, this relationship is thus more likely to be quadratic, i.e. | |
. * of the form y = x^n where y is corruption, x is HDI and n > 1 is a power. | |
. * If the correlation coefficient is statistically significant, we might treat | |
. * the relationship between corruption and HDI as approximately linear, but we | |
. * will lose some of the information observed visually by doing so. | |
. pwcorr corruption hdi, obs sig | |
| corrup~n hdi | |
-------------+------------------ | |
corruption | 1.0000 | |
| | |
| 181 | |
| | |
hdi | 0.7244 1.0000 | |
| 0.0000 | |
| 178 185 | |
| | |
. | |
. | |
. * (4) Female government ministers and corruption | |
. * ---------------------------------------------- | |
. | |
. * Obtain summary statistics. | |
. su femparl corruption | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
femparl | 116 16.33103 10.11587 0 56.3 | |
corruption | 181 3.982868 2.089537 1.009626 9.4 | |
. | |
. * Visual inspection of the relationship within the mean-mean quadrants. | |
. sc femparl corruption, yline(15) xline(4) /// | |
> name(femparl_corruption, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * No clear pattern emerges from the scatterplot above. Never force a pattern | |
. * onto the data: relationships should be apparent, not constructed. If there is | |
. * no straightforward relationship, disregard it. Identically, never include a | |
. * graph in your work if the relationship that it intends to show will not | |
. * strike the reader between the eyes (i.e. run an intra-ocular trauma test). | |
. * Inconclusive visual inspection can come with significant correlations, as is | |
. * the case here if you actually compute the coefficient, but visual inspection | |
. * and theoretical elaboration provide no substantive justification for it. | |
. | |
. | |
. * ================ | |
. * = SCATTERPLOTS = | |
. * ================ | |
. | |
. | |
. * Scatterplot matrixes | |
. * -------------------- | |
. | |
. * Start with visual inspection of the data organized as a scatterplot matrix. | |
. * A scatterplot matrix contains all possible bivariate relationships between | |
. * any number of variables. Building a matrix of your DV and IVs allows to spot | |
. * relationships between IVs, which will be useful later on in your analysis. | |
. * Note that the example below shows the untransformed measure of GDP per capita. | |
. gr mat births schooling log_gdpc corruption femparl, /// | |
> name(gr_matrix, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * You could also look at a sparser version of the matrix that shows only half of | |
. * all plots for a subset of geographical regions. | |
. gr mat births schooling log_gdpc corruption femparl if inlist(region, 4, 5), half | |
> /// | |
> name(gr_matrix_regions4_5, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The most practical way to consider all possible correlations in a list of | |
. * predictors (or independent variables) is to build a correlation matrix out | |
. * of their respective pairwise correlations. "Pair-wise" indicates that the | |
. * correlation coefficient uses only pairs of valid, nonmissing observations, | |
. * and disregards all observations where any of the variables is missing. | |
. pwcorr births schooling log_gdpc corruption femparl | |
| births school~g log_gdpc corrup~n femparl | |
-------------+--------------------------------------------- | |
births | 1.0000 | |
schooling | -0.7394 1.0000 | |
log_gdpc | -0.7001 0.7732 1.0000 | |
corruption | -0.5175 0.6494 0.8033 1.0000 | |
femparl | 0.0066 -0.0436 -0.0674 0.0315 1.0000 | |
. | |
. * The most common way to indicate statistically significant correlations in | |
. * a correlation matrix is to use asterisks (stars) to mark them when their | |
. * p-value is below the level of statistical significance. | |
. pwcorr births schooling log_gdpc corruption femparl, star(.05) | |
| births school~g log_gdpc corrup~n femparl | |
-------------+--------------------------------------------- | |
births | 1.0000 | |
schooling | -0.7394* 1.0000 | |
log_gdpc | -0.7001* 0.7732* 1.0000 | |
corruption | -0.5175* 0.6494* 0.8033* 1.0000 | |
femparl | 0.0066 -0.0436 -0.0674 0.0315 1.0000 | |
. | |
. * For explorative purposes, another option can be used to print out only the | |
. * statistically significant correlations, which comes in handy especially in | |
. * very large matrixes with majorily insignificant correlation coefficients. | |
. pwcorr births schooling log_gdpc corruption femparl, print(.05) | |
| births school~g log_gdpc corrup~n femparl | |
-------------+--------------------------------------------- | |
births | 1.0000 | |
schooling | -0.7394 1.0000 | |
log_gdpc | -0.7001 0.7732 1.0000 | |
corruption | -0.5175 0.6494 0.8033 1.0000 | |
femparl | 1.0000 | |
. | |
. * Export a correlation matrix. | |
. mkcorr births schooling gdpc corruption femparl, /// | |
> lab num sig log("week7_correlations.txt") replace | |
. | |
. | |
. * Scatterplots with marker labels | |
. * ------------------------------- | |
. | |
. * Stata requires passing a lot of options to produce informative graphs. If you | |
. * are using a set of consistent options on several graphs, you can store these | |
. * in a global macro and apply them by calling the macro with a dollar sign ($). | |
. * The following global macro is a list of graph options to make scatterplots | |
. * more informative by showing country codes instead of anonymous data points: | |
. global ccode "ms(i) mlabpos(0) mlab(ccodewb) legend(off)" | |
. | |
. * The options contained in the global macro make the marker symbol invisible, | |
. * then center the marker label and fill it with the ccodewb variable (holding | |
. * country codes from the World Bank) in replacement of the usual dot markers. | |
. * In the following plots, passing the $ccode option will result in actually | |
. * passing these graph options, stored in the ccode ("country codes") macro. | |
. * Note that this is a hack, and that you would not normally fiddle with global | |
. * macros if you were programming Stata at a more advanced level: you would use | |
. * local macros, which are more complex in usage and therefore avoided here. | |
. | |
. * Improve previous example. | |
. sc births schooling, $ccode /// | |
> name(fert_edu1, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Add a color difference to Western states by overlaying multiple scatterplots. | |
. sc births schooling, $ccode || /// | |
> sc births schooling if region == 5, $ccode /// | |
> name(fert_edu2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Add a tone and color difference to subsaharan African states (more options!). | |
. sc births schooling, mlabc(gs10) $ccode || /// | |
> sc births schooling if region == 4, $ccode /// | |
> name(fert_edu3, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * There are binders full of Stata graph options like these. Have a look at the | |
. * help pages for two-way graphs (h tw) for a list that applies to scatterplots. | |
. | |
. | |
. * Scatterplots with histograms | |
. * ---------------------------- | |
. | |
. * Or, how to combine graphs with insane axis options. | |
. sc births schooling, /// | |
> yti("") xti("") ysc(alt) yla(none, angle(v)) xsc(alt) xla(none, grid gmax) | |
> /// | |
> name(plot2, replace) plotregion(style(none)) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Plot 1 is top-left. | |
. tw hist births, /// | |
> xsc(alt rev) xla(none) xti("") horiz fxsize(25) /// | |
> name(plot1, replace) plotregion(style(none)) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Plot 3 is bottom-right. | |
. tw hist schooling, /// | |
> ysc(alt rev) yla(none, nogrid angle(v)) yti("") xla(,grid gmax) fysize(25) | |
> /// | |
> name(plot3, replace) plotregion(style(none)) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Combined plots with square ratio (y-size = x-size). | |
. gr combine plot1 plot2 plot3, /// | |
> imargin(0 0 0 0) hole(3) ysiz(5) xsiz(5) /// | |
> name(fert_edu4, replace) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Cleanup, focus on result. | |
. gr drop plot1 plot2 plot3 | |
. gr di fert_edu4 | |
. | |
. | |
. * Scatterplots with smoothed lines | |
. * -------------------------------- | |
. | |
. * Another way to visualize the quality of a linear fit is to plot a smoothed fit | |
. * with the -lowess- command, to show departures from linearity in the IV effect: | |
. lowess births schooling, /// | |
> name(fert_edu_lowess, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The LOWESS smoother available with -lowess- in Stata can operate as a moving | |
. * average (running mean) or as a least squares estimator, which is the default. | |
. * The core mechanics of a least squares estimator are on next week's menu. | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. gr drop _all | |
. | |
. * Graph macro. If you remember what we did last week, we used a macro to label | |
. * the data points with country codes instead of using anonymous dots. Since we | |
. * have executed last week's do-file in the background, this macro is available | |
. * in memory, so we will be able to use '$ccode' to produce better scatterplots | |
. * in this do-file too. We will also be able to use the following macro, which | |
. * will remove the legend and dash the regression line of our linear fits. | |
. global ci "legend(off) lp(dash)" | |
. | |
. | |
. * ===================== | |
. * = REGRESSION MODELS = | |
. * ===================== | |
. | |
. | |
. * (1) Fertility Rates and Schooling Years | |
. * --------------------------------------- | |
. | |
. * We are looking again at the relationship between fertility and education that | |
. * we already observed in our previous do-file. At that stage, we assume that you | |
. * have a substantive model to explain the relationship that you are studying, or | |
. * the results of the model will land nowhere and serve no analytical purpose. | |
. | |
. * Visual fit. | |
. sc births schooling, $ccode /// | |
> legend(off) yti("Fertility rate (births per woman)") /// | |
> name(fert_edu1, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Linear fit. | |
. tw (sc births schooling, $ccode) (lfit births schooling, $ci), /// | |
> yti("Fertility rate (births per woman)") /// | |
> name(fert_edu2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Add 95% CI. | |
. tw (sc births schooling, $ccode) (lfitci births schooling, $ci), /// | |
> yti("Fertility rate (births per woman)") /// | |
> name(fert_edu3, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Estimate the predicted effect of the education level on the fertility rate. | |
. * Function: number of births = _cons (alpha) + Coef (beta) * schooling years. | |
. * Equation: predicted Y (DV) = alpha + beta X (IV) + epsilon (error term). | |
. reg births schooling | |
Source | SS df MS Number of obs = 142 | |
-------------+------------------------------ F( 1, 140) = 168.81 | |
Model | 148.247435 1 148.247435 Prob > F = 0.0000 | |
Residual | 122.945776 140 .878184113 R-squared = 0.5466 | |
-------------+------------------------------ Adj R-squared = 0.5434 | |
Total | 271.193211 141 1.9233561 Root MSE = .93711 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.3533024 .0271923 -12.99 0.000 -.4070631 -.2995418 | |
_cons | 5.528379 .2259655 24.47 0.000 5.081633 5.975125 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Plotting regression results | |
. * --------------------------- | |
. | |
. * Simple residuals-versus-fitted plot. | |
. rvfplot, yline(0) /// | |
> name(rvfplot, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Get fitted values. | |
. cap drop yhat | |
. predict yhat | |
(option xb assumed; fitted values) | |
(50 missing values generated) | |
. | |
. * Get residuals. | |
. cap drop r | |
. predict r, resid | |
(51 missing values generated) | |
. | |
. * Plot residuals against predicted values of IV. | |
. sc r yhat, yline(0) $ccode /// | |
> name(rvfplot2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Plot DV with observed and predicted values of IV. | |
. sc births schooling || conn yhat schooling, /// | |
> name(dv_yhat, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * Small multiples | |
. * --------------- | |
. | |
. * Draw scatterplots and linear fits for each region. Visualizing small multiples | |
. * requires using an independent variable with a limited number of categories and | |
. * might reveal additional strengths or weaknesses of your model. | |
. sc births schooling || lfit births schooling, by(region) /// | |
> name(lfit_region, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Run the linear regression models for each region. Observe how the standard | |
. * errors and p-values of the regression coefficients widen when the regional | |
. * sample size falls at lower numbers of observations. | |
. bys region: reg births schooling | |
------------------------------------------------------------------------------------ | |
-> region = E. Europe and PSU | |
Source | SS df MS Number of obs = 20 | |
-------------+------------------------------ F( 1, 18) = 2.45 | |
Model | .711741135 1 .711741135 Prob > F = 0.1349 | |
Residual | 5.22767201 18 .290426223 R-squared = 0.1198 | |
-------------+------------------------------ Adj R-squared = 0.0709 | |
Total | 5.93941314 19 .312600692 Root MSE = .53891 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.1997857 .1276207 -1.57 0.135 -.4679069 .0683355 | |
_cons | 3.833607 1.363468 2.81 0.012 .9690679 6.698146 | |
------------------------------------------------------------------------------ | |
------------------------------------------------------------------------------------ | |
-> region = Lat. America | |
Source | SS df MS Number of obs = 20 | |
-------------+------------------------------ F( 1, 18) = 14.09 | |
Model | 3.27708082 1 3.27708082 Prob > F = 0.0015 | |
Residual | 4.1868556 18 .232603089 R-squared = 0.4391 | |
-------------+------------------------------ Adj R-squared = 0.4079 | |
Total | 7.46393642 19 .392838759 Root MSE = .48229 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2559176 .0681812 -3.75 0.001 -.3991609 -.1126743 | |
_cons | 4.50041 .5331816 8.44 0.000 3.380237 5.620583 | |
------------------------------------------------------------------------------ | |
------------------------------------------------------------------------------------ | |
-> region = N. Africa and M. East | |
Source | SS df MS Number of obs = 18 | |
-------------+------------------------------ F( 1, 16) = 3.87 | |
Model | 3.34242722 1 3.34242722 Prob > F = 0.0668 | |
Residual | 13.8214319 16 .863839496 R-squared = 0.1947 | |
-------------+------------------------------ Adj R-squared = 0.1444 | |
Total | 17.1638592 17 1.00963877 Root MSE = .92943 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2038483 .1036317 -1.97 0.067 -.4235377 .0158411 | |
_cons | 4.182561 .7720988 5.42 0.000 2.545785 5.819337 | |
------------------------------------------------------------------------------ | |
------------------------------------------------------------------------------------ | |
-> region = Sub-Sah. Africa | |
Source | SS df MS Number of obs = 32 | |
-------------+------------------------------ F( 1, 30) = 29.16 | |
Model | 23.3848885 1 23.3848885 Prob > F = 0.0000 | |
Residual | 24.0580149 30 .80193383 R-squared = 0.4929 | |
-------------+------------------------------ Adj R-squared = 0.4760 | |
Total | 47.4429034 31 1.53041624 Root MSE = .89551 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.4246033 .0786294 -5.40 0.000 -.5851859 -.2640206 | |
_cons | 6.665125 .4098737 16.26 0.000 5.828051 7.502198 | |
------------------------------------------------------------------------------ | |
------------------------------------------------------------------------------------ | |
-> region = W. Europe and N. America | |
Source | SS df MS Number of obs = 23 | |
-------------+------------------------------ F( 1, 21) = 6.13 | |
Model | .395655645 1 .395655645 Prob > F = 0.0219 | |
Residual | 1.35538605 21 .064542193 R-squared = 0.2260 | |
-------------+------------------------------ Adj R-squared = 0.1891 | |
Total | 1.75104169 22 .079592804 Root MSE = .25405 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | .1026995 .0414793 2.48 0.022 .0164386 .1889605 | |
_cons | .6359472 .4511737 1.41 0.173 -.30232 1.574214 | |
------------------------------------------------------------------------------ | |
------------------------------------------------------------------------------------ | |
-> region = Asia, Pacific and Carribean | |
Source | SS df MS Number of obs = 29 | |
-------------+------------------------------ F( 1, 27) = 5.72 | |
Model | 5.47598716 1 5.47598716 Prob > F = 0.0240 | |
Residual | 25.8508772 27 .957439897 R-squared = 0.1748 | |
-------------+------------------------------ Adj R-squared = 0.1442 | |
Total | 31.3268644 28 1.11881658 Root MSE = .97849 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.1658877 .0693647 -2.39 0.024 -.3082123 -.023563 | |
_cons | 3.682463 .5326461 6.91 0.000 2.589564 4.775363 | |
------------------------------------------------------------------------------ | |
. | |
. * Detailed residuals-versus-fitted plots. | |
. sc r yhat, yline(0) by(region, total) $ccode /// | |
> name(rvfplot2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * Fitting a transformed IV | |
. * ------------------------ | |
. | |
. * The -qfit- command shows that a more advanced model might better explain the | |
. * DV-IV relationship, as it looks less linear than quadratic: Y = a + bX could | |
. * be replaced with Y = a + bX^2 to observe a more correct fit. | |
. tw (sc births schooling, $ccode) (qfit births schooling, $ci), /// | |
> name(fert_edu_qfit, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * In this case, using the square root of the independent variable might provide | |
. * better estimates of its actual effect on the dependent variable. We could have | |
. * diagnosed that earlier by looking at the normality of the schooling variable, | |
. * for which a square root transformation is recommended by the ladder commands. | |
. | |
. * Variable transformation. | |
. gen sqrt_schooling = sqrt(schooling) | |
(50 missing values generated) | |
. la var sqrt_schooling "Average schooling years (sqrt)" | |
. | |
. * Visual inspection. | |
. tw (sc births sqrt_schooling, $ccode) (lfit births sqrt_schooling, $ci), /// | |
> name(fert_edu_qfit, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Regression model of the form Y = alpha + beta sqrt(X). | |
. reg births sqrt_schooling | |
Source | SS df MS Number of obs = 142 | |
-------------+------------------------------ F( 1, 140) = 190.62 | |
Model | 156.35743 1 156.35743 Prob > F = 0.0000 | |
Residual | 114.835781 140 .820255576 R-squared = 0.5766 | |
-------------+------------------------------ Adj R-squared = 0.5735 | |
Total | 271.193211 141 1.9233561 Root MSE = .90568 | |
-------------------------------------------------------------------------------- | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
---------------+---------------------------------------------------------------- | |
sqrt_schooling | -1.857605 .1345453 -13.81 0.000 -2.123608 -1.591601 | |
_cons | 7.85353 .375534 20.91 0.000 7.111079 8.595981 | |
-------------------------------------------------------------------------------- | |
. | |
. * Reading the regression coefficient for schooling is less intuitive when it is | |
. * computed on the square root of the variable: it requires a short equation to | |
. * produce real-world examples of what the model means. However, more variance | |
. * in the data is explained when the model is written in this more complex form. | |
. | |
. * Visualization with solved square root units. | |
. tw (sc births sqrt_schooling, $ccode) (lfit births sqrt_schooling, $ci), /// | |
> xla(1 "1" 1.5 "2.25" 2 "4" 2.5 "6.25" 3 "9" 3.5 "12.25") /// | |
> xti("Average schooling years") note("Horizontal axis in squared units.") /// | |
> name(fert_edu_sqrt, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * (2) Fertility Rates and (Log) Gross Domestic Product | |
. * ---------------------------------------------------- | |
. | |
. * As always, start with a visual inspection of the relationship. | |
. tw (sc births log_gdpc, $ccode) (lfit births log_gdpc, $ci), /// | |
> name(fert_gdpc, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The interpretation of the coefficient for GDP per capita is going to be less | |
. * intuitive due to its logarithmic units, but the transformation was necessary | |
. * to identify the linear relationship between the two variables. | |
. | |
. * Regression model of the form Y = alpha + beta ln(X). | |
. reg births log_gdpc | |
Source | SS df MS Number of obs = 186 | |
-------------+------------------------------ F( 1, 184) = 176.93 | |
Model | 186.917353 1 186.917353 Prob > F = 0.0000 | |
Residual | 194.390066 184 1.05646775 R-squared = 0.4902 | |
-------------+------------------------------ Adj R-squared = 0.4874 | |
Total | 381.307418 185 2.06112118 Root MSE = 1.0278 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
log_gdpc | -.6375227 .0479291 -13.30 0.000 -.7320839 -.5429615 | |
_cons | 8.065113 .396594 20.34 0.000 7.282656 8.847569 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Fitting 'lin-log' equations | |
. * --------------------------- | |
. | |
. * The relationship is a 'lin-log' equation, such that a 1% increase in X (IV) is | |
. * associated with a 0.01 * beta unit increase in Y (DV). In this model, it means | |
. * that a 15% increase in GDP per cap. is associated with -.74 * log(1.15) = -.16 | |
. * births per woman. For GDP per capita to reduce fertility by 1 birth per woman, | |
. * this model would require exp(100/74) = 3.8, a 280% increase in GDP per capita. | |
. * This is easy to observe from the reverse equation: -.74 * log(3.8) = -1. | |
. | |
. * Why is that number so high? Recall how linear regression works: by computing | |
. * the average marginal change that occurs in the DV (the coefficient) for each | |
. * unit of the IV. This is the average marginal effect, computed over the whole | |
. * sample. If GDP per capita expresses decreasing returns on fertility, then the | |
. * average effect is bound to be higher than what is actually required at lower | |
. * levels of GDP per capita. What an econometrician would do in that case is to | |
. * compute semi-elasticities (because the model is semi-logarithmic), but if you | |
. * only need to quantify the average relationship, converting by hand is enough. | |
. | |
. | |
. * (3) Corruption and Human Development | |
. * ------------------------------------ | |
. | |
. * Visualizing a nonlinear, quadratic fit with corruption as the DV. | |
. tw (sc corruption hdi, $ccode) (qfit corruption hdi, $ci), /// | |
> ysc(rev) yla(0 "High" 10 "Low") yti("Level of corruption") /// | |
> name(cpi_hdi, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Before interpreting the model, deal with the reverse-coding issue. | |
. gen corrupt = 10 - corruption | |
(12 missing values generated) | |
. la var corrupt "Corruption Perception Index" | |
. | |
. * Regression model in first approximation (linear form). | |
. reg corrupt hdi | |
Source | SS df MS Number of obs = 178 | |
-------------+------------------------------ F( 1, 176) = 194.29 | |
Model | 401.935273 1 401.935273 Prob > F = 0.0000 | |
Residual | 364.107271 176 2.06879131 R-squared = 0.5247 | |
-------------+------------------------------ Adj R-squared = 0.5220 | |
Total | 766.042544 177 4.32792398 Root MSE = 1.4383 | |
------------------------------------------------------------------------------ | |
corrupt | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
hdi | -8.630701 .6191935 -13.94 0.000 -9.852701 -7.408702 | |
_cons | 11.61443 .4174372 27.82 0.000 10.7906 12.43825 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Fitting a quadratic term | |
. * ------------------------ | |
. | |
. * A more thorough exploration of residuals will be covered in later sessions | |
. * on regression diagnostics, but here is a snapshot of what we can do and | |
. * understand by studying residuals in a bit more depth. | |
. cap drop yhat | |
. predict yhat | |
(option xb assumed; fitted values) | |
(8 missing values generated) | |
. | |
. * Plot of linear fitted values. | |
. sc corrupt yhat hdi, yla(0 "Lowly corrupt" 10 "Highly corrupt") /// | |
> connect(i l) sort(yhat) /// | |
> name(r_linear, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The curvilinearity approaches the function f: y = x^2 and can be taken care | |
. * of by squaring HDI and fitting the model again with the quadratic term. The | |
. * final mode is therefore a the equation Y = alpha + beta_1 X + beta_2 X^2. | |
. gen hdi2 = hdi^2 | |
(8 missing values generated) | |
. | |
. * Regression model in second approximation (added quadratic term). | |
. reg corrupt hdi hdi2 | |
Source | SS df MS Number of obs = 178 | |
-------------+------------------------------ F( 2, 175) = 204.26 | |
Model | 536.306972 2 268.153486 Prob > F = 0.0000 | |
Residual | 229.735572 175 1.3127747 R-squared = 0.7001 | |
-------------+------------------------------ Adj R-squared = 0.6967 | |
Total | 766.042544 177 4.32792398 Root MSE = 1.1458 | |
------------------------------------------------------------------------------ | |
corrupt | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
hdi | 27.96508 3.650673 7.66 0.000 20.76007 35.1701 | |
hdi2 | -29.54748 2.92053 -10.12 0.000 -35.31148 -23.78349 | |
_cons | 1.209076 1.080905 1.12 0.265 -.9242118 3.342363 | |
------------------------------------------------------------------------------ | |
. | |
. * Residuals of the quadratic model. | |
. cap drop yhat2 | |
. predict yhat2 | |
(option xb assumed; fitted values) | |
(8 missing values generated) | |
. | |
. * Comparison of both fits. | |
. sc corrupt yhat2 hdi, yla(0 "Highly corrupt" 10 "Lowly corrupt") /// | |
> c(i l) sort(yhat) || sc yhat hdi, c(l) legend(order(2 3) /// | |
> lab(2 "Quadratic fit") lab(3 "Linear fit")) /// | |
> name(r_curvilinear, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * (4) Fertility and Democracy | |
. * --------------------------- | |
. | |
. * Create dummy. | |
. gen democracy:democracy = (chga_hinst < 3) if !mi(chga_hinst) | |
(1 missing value generated) | |
. la def democracy 0 "Dictatorship" 1 "Democracy", replace | |
. | |
. * Visualization of the difference in mean of the DV. | |
. gr bar births, over(democracy) asyvars over(region, lab(alt)) /// | |
> name(fert_democ, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * Fitting a dummy predictor | |
. * ------------------------- | |
. | |
. * Visualization of the "linear" fit using the dummy. | |
. sc births democracy || lfit births democracy, $ci /// | |
> xsc(r(-.5 1.5)) xla(0 "Dictatorship" 1 "Democracy") xti("") /// | |
> name(fert_democ, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * You actually know this result in a different form: | |
. ttest births, by(democracy) | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Dictator | 74 3.427811 .171837 1.478198 3.08534 3.770282 | |
Democrac | 113 2.554826 .1240722 1.318906 2.308992 2.800659 | |
---------+-------------------------------------------------------------------- | |
combined | 187 2.900285 .1056745 1.445077 2.69181 3.10876 | |
---------+-------------------------------------------------------------------- | |
diff | .8729851 .2069604 .4646792 1.281291 | |
------------------------------------------------------------------------------ | |
diff = mean(Dictator) - mean(Democrac) t = 4.2181 | |
Ho: diff = 0 degrees of freedom = 185 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000 | |
. | |
. * This is actually identical to the following model: | |
. reg births i.democracy | |
Source | SS df MS Number of obs = 187 | |
-------------+------------------------------ F( 1, 185) = 17.79 | |
Model | 34.0786397 1 34.0786397 Prob > F = 0.0000 | |
Residual | 354.335548 185 1.91532729 R-squared = 0.0877 | |
-------------+------------------------------ Adj R-squared = 0.0828 | |
Total | 388.414188 186 2.08824832 Root MSE = 1.384 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.democracy | -.8729851 .2069604 -4.22 0.000 -1.281291 -.4646792 | |
_cons | 3.427811 .1608813 21.31 0.000 3.110413 3.745209 | |
------------------------------------------------------------------------------ | |
. | |
. * In this model, democracy is understood as a categorical variable because we | |
. * added the "i." prefix to it. The coefficient reveals that the fertility rates | |
. * of democracies is, on average, significantly lower than in non-democracies. | |
. * There is no regression coefficient for dictatorships: since democracy is a | |
. * dummy, it takes only two values, 0 or 1. The coefficient is therefore null | |
. * when democracy equals 0. Let's look at null models (Y = alpha) for a proof. | |
. | |
. * Y = alpha + beta (democracy = 0) = alpha. | |
. reg births if !democracy | |
Source | SS df MS Number of obs = 74 | |
-------------+------------------------------ F( 0, 73) = 0.00 | |
Model | 0 0 . Prob > F = . | |
Residual | 159.51009 73 2.18506972 R-squared = 0.0000 | |
-------------+------------------------------ Adj R-squared = 0.0000 | |
Total | 159.51009 73 2.18506972 Root MSE = 1.4782 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
_cons | 3.427811 .171837 19.95 0.000 3.08534 3.770282 | |
------------------------------------------------------------------------------ | |
. | |
. * Y = alpha + beta (democracy = 1) = alpha + beta. | |
. reg births if democracy | |
Source | SS df MS Number of obs = 113 | |
-------------+------------------------------ F( 0, 112) = 0.00 | |
Model | 0 0 . Prob > F = . | |
Residual | 194.825458 112 1.73951302 R-squared = 0.0000 | |
-------------+------------------------------ Adj R-squared = 0.0000 | |
Total | 194.825458 112 1.73951302 Root MSE = 1.3189 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
_cons | 2.554826 .1240722 20.59 0.000 2.308992 2.800659 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require estout fre leanout mkcorr renvars | |
. | |
. * Log results. | |
. cap log using code/week9.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 9 ------------------- | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Fertility and Education, Part 3 | |
> | |
> - DATA: Quality of Government (2013) | |
> | |
> This is our final do-file with the Quality of Government example that we have | |
> been running over three sessions. It explains how to build on correlation and | |
> simple linear regression to produce complete linear regression models. | |
> | |
> The code contains details on several aspects of multiple linear regression. | |
> It also shows how to use the -estout- command to store and export the results | |
> of regression models. | |
> | |
> For your second draft, go as far as possible with multiple linear regression. | |
> Start with correlations if applicable, then go forward with simple linear | |
> regressions (add scatterplots if your predictors are continuous). | |
> | |
> Follow the instructions from the draft paper template. If you manage to go as | |
> far as diagnosing your model, discuss them and add interaction terms if you | |
> detect issues of multicollinearity. | |
> | |
> The next sessions will provide another way to model the data for dependent | |
> variables that are (or are closer to being) categorical in nature, and will | |
> go deeper into the core mechanics of regression modelling. | |
> | |
> Last updated 2013-05-28. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load QOG dataset. | |
. use data/qog2013, clear | |
(Quality of Government 2013) | |
. | |
. * Rename variables to short handles. | |
. renvars wdi_fr bl_asy25mf wdi_hiv ciri_wosoc \ births schooling hiv womenrights | |
. | |
. * Transformation of real GDP per capita to logged units. | |
. gen log_gdpc = ln(unna_gdp / unna_pop) | |
(2 missing values generated) | |
. la var log_gdpc "Real GDP/capita (constant USD, logged)" | |
. | |
. * Dummy for the highest quartile of HIV/AIDS prevalence. | |
. su hiv, d | |
Prevalence of HIV (% of Population Aged 15-49) | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% .1 .1 | |
5% .1 .1 | |
10% .1 .1 Obs 147 | |
25% .2 .1 Sum of Wgt. 147 | |
50% .4 Mean 1.922449 | |
Largest Std. Dev. 4.32927 | |
75% 1.3 17.2 | |
90% 4.8 23 Variance 18.74257 | |
95% 11.3 24.1 Skewness 3.769518 | |
99% 24.1 25.8 Kurtosis 17.76868 | |
. gen aids = (hiv > 1.5) if !mi(hiv) | |
(46 missing values generated) | |
. la var aids "Highest HIV/AIDS prevalence quartile" | |
. | |
. * Recode regions to less, shorter labels. | |
. recode ht_region (6/10 = 6), gen(region) | |
(44 differences between ht_region and region) | |
. la var region "Geographical region" | |
. la val region region | |
. la def region 1 "E. Europe and PSU" 2 "Lat. America" /// | |
> 3 "N. Africa and M. East" 4 "Sub-Sah. Africa" /// | |
> 5 "W. Europe and N. America" 6 "Asia, Pacific and Carribean" /// | |
> , replace | |
. | |
. | |
. * Subsetting | |
. * ---------- | |
. | |
. * Check missing values. | |
. misstable pat births schooling log_gdpc aids, freq | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Frequency | 1 2 3 4 | |
------------+------------- | |
124 | 1 1 1 1 | |
| | |
23 | 1 1 0 0 | |
22 | 1 1 1 0 | |
17 | 1 1 0 1 | |
5 | 1 0 0 0 | |
1 | 0 0 0 1 | |
1 | 0 1 1 1 | |
------------+------------- | |
193 | | |
Variables are (1) log_gdpc (2) births (3) aids (4) schooling | |
. | |
. * Check sampling bias due to low availability of schooling years. | |
. gen mi = mi(schooling) | |
. gr hbar (count) schooling (count) mi, over(region, sort(2)des) stack /// | |
> legend(order(1 "N(schooling)" 2 "Missing data")) /// | |
> name(mi, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Delete incomplete observations. | |
. drop if mi(births, schooling, log_gdpc, aids, womenrights) | |
(74 observations deleted) | |
. | |
. * Final sample size. | |
. count | |
119 | |
. | |
. | |
. * Export summary statistics | |
. * ------------------------- | |
. | |
. * The next command is part of the SRQM folder. If Stata returns an error when | |
. * you run it, set the folder as your working directory and type -run profile- | |
. * to run the course setup, then try the command again. If you still experience | |
. * problems with the -stab- command, please send a detailed email on the issue. | |
. | |
. stab using week9_stats.txt, replace /// | |
> mean(births schooling log_gdpc) /// | |
> prop(aids region) | |
(note: file week9_stats.txt not found) | |
Variable mean sd min max mea | |
> n sd min max mean sd min | |
> max mean sd min max mean | |
> sd min max mean sd min m | |
> ax mean sd min max mean sd | |
> min max mean sd min max | |
> mean sd min max | |
Highest HIV/AIDS p~r % % % % | |
> % % % % % % | |
Geographical region % % % % | |
> % % % % % % | |
N = 1190 | |
File: week9_stats.txt | |
. | |
. /* Syntax of the -stab- command: | |
> | |
> - using FILE - name of the exported file; plain text (.txt) recommended | |
> - replace - overwrite any previously existing file | |
> - mean() - summarizes a list of continuous variables (mean, sd, min, max) | |
> - prop() - summarizes a list of categorical variables (frequencies) | |
> | |
> In the example above, the -stab- command will export two files to the working | |
> directory, containing summary statistics (week9_stats.txt) and a correlation | |
> matrix (week9_correlations.txt) created with the -corr()- argument. */ | |
. | |
. | |
. * ===================== | |
. * = ASSOCIATION TESTS = | |
. * ===================== | |
. | |
. | |
. * Coefficients matrix. | |
. corr births schooling log_gdpc | |
(obs=119) | |
| births school~g log_gdpc | |
-------------+--------------------------- | |
births | 1.0000 | |
schooling | -0.7473 1.0000 | |
log_gdpc | -0.7311 0.8013 1.0000 | |
. | |
. * Scatterplot matrix. | |
. gr mat births schooling log_gdpc, half /// | |
> name(mat, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Export method using -mkcorr-. | |
. mkcorr births schooling log_gdpc, /// | |
> lab num sig log("week9_mkcorr.txt") replace | |
(note: file week9_mkcorr.txt not found) | |
. | |
. * Export method using -estout-. | |
. eststo clear | |
. qui estpost correlate births schooling log_gdpc, matrix listwise | |
. esttab using "week9_estpost.txt", unstack not compress label replace | |
(note: file week9_estpost.txt not found) | |
(output written to week9_estpost.txt) | |
. | |
. | |
. * ===================== | |
. * = REGRESSION MODELS = | |
. * ===================== | |
. | |
. | |
. * Simple linear regressions | |
. * ------------------------- | |
. | |
. * We have covered simple linear regression last week, and we briefly mentioned | |
. * 'lin-log' equations then. There are more situations to cover in theory, so | |
. * review both notions together. Recall, first, the regression equation in the | |
. * simplest case, where all variables are linear: Y = a + BX. | |
. | |
. * IV: Education. | |
. sc births schooling || lfit births schooling, /// | |
> name(simplereg1, replace) | |
(note: scheme burd not found, using s2color) | |
. reg births schooling | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 1, 117) = 147.96 | |
Model | 132.04738 1 132.04738 Prob > F = 0.0000 | |
Residual | 104.415906 117 .892443638 R-squared = 0.5584 | |
-------------+------------------------------ Adj R-squared = 0.5547 | |
Total | 236.463285 118 2.00392615 Root MSE = .94469 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.3511008 .0288641 -12.16 0.000 -.4082646 -.2939371 | |
_cons | 5.503641 .2408313 22.85 0.000 5.026687 5.980595 | |
------------------------------------------------------------------------------ | |
. | |
. * An increase in one unit of schooling (years) is associated to a negative | |
. * variation of -.4 births, or rather, 2-3 additional years of schooling are | |
. * associated with birth rates that are one child lower on average. When the | |
. * IV is logged, things get complex because the association rule changes. | |
. | |
. * IV: Real GDP per capita. | |
. sc births log_gdpc || lfit births log_gdpc, /// | |
> name(simplereg2, replace) | |
(note: scheme burd not found, using s2color) | |
. reg births log_gdpc | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 1, 117) = 134.38 | |
Model | 126.40484 1 126.40484 Prob > F = 0.0000 | |
Residual | 110.058446 117 .940670476 R-squared = 0.5346 | |
-------------+------------------------------ Adj R-squared = 0.5306 | |
Total | 236.463285 118 2.00392615 Root MSE = .96988 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
log_gdpc | -.6289964 .0542607 -11.59 0.000 -.7364567 -.521536 | |
_cons | 7.916398 .4527607 17.48 0.000 7.01973 8.813067 | |
------------------------------------------------------------------------------ | |
. | |
. * In this 'lin-log' equation, a 1% increase in GDP per capita is associated to a | |
. * 0.01 * -.8 variation in the birth rate, or more exactly, -.8 * log(1.01). The | |
. * mathematical trick is now to reverse the equation to understand the mechanism: | |
. | |
. * Inverting the terms. | |
. reg log_gdpc births | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 1, 117) = 134.38 | |
Model | 170.79196 1 170.79196 Prob > F = 0.0000 | |
Residual | 148.705522 117 1.27098737 R-squared = 0.5346 | |
-------------+------------------------------ Adj R-squared = 0.5306 | |
Total | 319.497482 118 2.70760578 Root MSE = 1.1274 | |
------------------------------------------------------------------------------ | |
log_gdpc | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
births | -.8498687 .0733143 -11.59 0.000 -.9950639 -.7046736 | |
_cons | 10.53596 .2278731 46.24 0.000 10.08467 10.98725 | |
------------------------------------------------------------------------------ | |
. | |
. * The equation is now log-linear ('log-lin') instead of being 'lin-log'. The | |
. * interpretation is: an increase in one child per woman is associated to GDP | |
. * per capita that is 100 * -.8 = 80% lower (remember: on average). | |
. | |
. * Illustrate the principle with two regions different by one child per woman. | |
. tab region if region > 4, su(births) | |
| Summary of Fertility Rate (Births | |
Geographica | per Woman) | |
l region | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
W. Europe | 1.7452913 .28212197 23 | |
Asia, Pac | 2.465 1.0945649 24 | |
------------+------------------------------------ | |
Total | 2.1128021 .87712753 47 | |
. tab region if region > 4, su(wdi_gdpc) | |
| Summary of GDP per Capita, PPP | |
Geographica | (Constant International USD) | |
l region | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
W. Europe | 33611.455 9581.3219 23 | |
Asia, Pac | 8293.5917 10993.016 22 | |
------------+------------------------------------ | |
Total | 21233.833 16351.978 45 | |
. | |
. * In 'lin-log' and 'log-lin' equations, changes are proportionate rather than | |
. * absolute. In a 'log-log' model, interpretation is proportionate on both sides | |
. * of the equation: a 1% change in X is associated to a B% change in Y. | |
. | |
. * IV-IV interaction. | |
. sc schooling log_gdpc || lfit schooling log_gdpc, /// | |
> name(simplereg3, replace) | |
(note: scheme burd not found, using s2color) | |
. reg schooling log_gdpc | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 1, 117) = 209.83 | |
Model | 687.719811 1 687.719811 Prob > F = 0.0000 | |
Residual | 383.469122 117 3.27751387 R-squared = 0.6420 | |
-------------+------------------------------ Adj R-squared = 0.6390 | |
Total | 1071.18893 118 9.07787231 Root MSE = 1.8104 | |
------------------------------------------------------------------------------ | |
schooling | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
log_gdpc | 1.467142 .1012835 14.49 0.000 1.266555 1.667728 | |
_cons | -4.218189 .8451274 -4.99 0.000 -5.89192 -2.544458 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Multiple linear regression | |
. * -------------------------- | |
. | |
. * With schooling in metric units. | |
. reg births schooling log_gdpc | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 2, 116) = 89.72 | |
Model | 143.622115 2 71.8110573 Prob > F = 0.0000 | |
Residual | 92.8411708 116 .80035492 R-squared = 0.6074 | |
-------------+------------------------------ Adj R-squared = 0.6006 | |
Total | 236.463285 118 2.00392615 Root MSE = .89463 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2118932 .0456853 -4.64 0.000 -.3023786 -.1214078 | |
log_gdpc | -.318119 .0836518 -3.80 0.000 -.483802 -.152436 | |
_cons | 7.022593 .459947 15.27 0.000 6.11161 7.933576 | |
------------------------------------------------------------------------------ | |
. | |
. * Recall the last model. | |
. reg | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 2, 116) = 89.72 | |
Model | 143.622115 2 71.8110573 Prob > F = 0.0000 | |
Residual | 92.8411708 116 .80035492 R-squared = 0.6074 | |
-------------+------------------------------ Adj R-squared = 0.6006 | |
Total | 236.463285 118 2.00392615 Root MSE = .89463 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2118932 .0456853 -4.64 0.000 -.3023786 -.1214078 | |
log_gdpc | -.318119 .0836518 -3.80 0.000 -.483802 -.152436 | |
_cons | 7.022593 .459947 15.27 0.000 6.11161 7.933576 | |
------------------------------------------------------------------------------ | |
. | |
. * Recall the last model, with cleaner output. | |
. leanout: | |
Dependent variable: births | |
Variable Coef SE 95% CI | |
----------------------------------------------- | |
schooling -0.2 0.0 ( -0.3, -0.1) | |
log_gdpc -0.3 0.1 ( -0.5, -0.2) | |
_cons 7.0 0.5 ( 6.1, 7.9) | |
----------------------------------------------- | |
Number of observations = 119 | |
Root Mean Squared Error = 0.9 | |
. | |
. | |
. * Standardised ('beta') coefficients | |
. * ---------------------------------- | |
. | |
. * With standardised, or 'beta', coefficients (abbreviated to -b- hereinafter). | |
. reg births schooling log_gdpc, beta | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 2, 116) = 89.72 | |
Model | 143.622115 2 71.8110573 Prob > F = 0.0000 | |
Residual | 92.8411708 116 .80035492 R-squared = 0.6074 | |
-------------+------------------------------ Adj R-squared = 0.6006 | |
Total | 236.463285 118 2.00392615 Root MSE = .89463 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| Beta | |
-------------+---------------------------------------------------------------- | |
schooling | -.2118932 .0456853 -4.64 0.000 -.4509913 | |
log_gdpc | -.318119 .0836518 -3.80 0.000 -.3697784 | |
_cons | 7.022593 .459947 15.27 0.000 . | |
------------------------------------------------------------------------------ | |
. | |
. * Proof of concept: Each variable in the equation has a different distribution | |
. * and therefore a different standard deviation. As such, regression with metric | |
. * coefficients cannot inform us of how variables perform against each other in | |
. * explaining variance, because different metrics make coefficients uncomparable. | |
. * One unit of births, for instance, is one child, while one log-unit of GDP per | |
. * capita is, after unlogging, millions of U.S. dollars: their coefficient are | |
. * produced in these units and their values are therefore incommensurable. | |
. su births schooling log_gdpc | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
births | 119 2.770129 1.415601 1.149 7.115 | |
schooling | 119 7.785548 3.012951 1.202597 13.27008 | |
log_gdpc | 119 8.181716 1.64548 5.194377 11.30008 | |
. | |
. * If each variable had a mean of 0 and variance of 1, then the coefficients | |
. * would become comparable because they would be following the unique metric | |
. * of a standard normal distribution. Standardising is the name of that process | |
. * that loses the metric, sensible units of variables to create a fictional view | |
. * of coefficients that indicates which coefficient produces the biggest effect | |
. * on the dependent variable and thus explains most variance within the model. | |
. egen std_births = std(births) | |
. egen std_schooling = std(schooling) | |
. egen std_log_gdpc = std(log_gdpc) | |
. su std_* | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
std_births | 119 -6.38e-10 1 -1.145187 3.069277 | |
std_school~g | 119 4.76e-09 1 -2.184885 1.820319 | |
std_log_gdpc | 119 -2.85e-09 1 -1.815482 1.895109 | |
. | |
. * Compare both regression outputs. The first one is the linear regression that | |
. * produces identical coefficients to the right hand column of the second one. | |
. reg std_* | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 2, 116) = 89.72 | |
Model | 71.6703624 2 35.8351812 Prob > F = 0.0000 | |
Residual | 46.3296378 116 .39939343 R-squared = 0.6074 | |
-------------+------------------------------ Adj R-squared = 0.6006 | |
Total | 118 118 1 Root MSE = .63198 | |
------------------------------------------------------------------------------- | |
std_births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
--------------+---------------------------------------------------------------- | |
std_schooling | -.4509913 .097236 -4.64 0.000 -.6435796 -.2584031 | |
std_log_gdpc | -.3697784 .097236 -3.80 0.000 -.5623666 -.1771901 | |
_cons | 4.53e-10 .0579331 0.00 1.000 -.1147439 .1147439 | |
------------------------------------------------------------------------------- | |
. reg births schooling log_gdpc, b | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 2, 116) = 89.72 | |
Model | 143.622115 2 71.8110573 Prob > F = 0.0000 | |
Residual | 92.8411708 116 .80035492 R-squared = 0.6074 | |
-------------+------------------------------ Adj R-squared = 0.6006 | |
Total | 236.463285 118 2.00392615 Root MSE = .89463 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| Beta | |
-------------+---------------------------------------------------------------- | |
schooling | -.2118932 .0456853 -4.64 0.000 -.4509913 | |
log_gdpc | -.318119 .0836518 -3.80 0.000 -.3697784 | |
_cons | 7.022593 .459947 15.27 0.000 . | |
------------------------------------------------------------------------------ | |
. | |
. * Using the second command shown above is much quicker than using the 'std_*' | |
. * trick that is featured here only as a teaching example. Note, finally, that | |
. * you should NOT report standardized coefficients: their use is controversial, | |
. * and their interpretation is less substantive than unstandardized ones. Your | |
. * focus should always be on results expressed in meaningful units. | |
. | |
. | |
. * Dummies (categorical variables) | |
. * ------------------------------- | |
. | |
. * Visualizing two categories (Asia and Africa) within the sample. | |
. tw (sc births schooling if region == 4, ms(O)) /// | |
> (sc births schooling if region == 6, ms(O)) /// | |
> (sc births schooling if !inlist(region,4,6), mc(gs10)) /// | |
> (lfit births schooling, lc(gs10)), /// | |
> legend(order(1 "African countries" 3 "Rest of sample" /// | |
> 2 "Asian countries" 4 "Fitted values") row(2)) yti("Fertility rate") /// | |
> name(reg_geo1, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Previous regression model. | |
. reg births schooling log_gdpc | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 2, 116) = 89.72 | |
Model | 143.622115 2 71.8110573 Prob > F = 0.0000 | |
Residual | 92.8411708 116 .80035492 R-squared = 0.6074 | |
-------------+------------------------------ Adj R-squared = 0.6006 | |
Total | 236.463285 118 2.00392615 Root MSE = .89463 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2118932 .0456853 -4.64 0.000 -.3023786 -.1214078 | |
log_gdpc | -.318119 .0836518 -3.80 0.000 -.483802 -.152436 | |
_cons | 7.022593 .459947 15.27 0.000 6.11161 7.933576 | |
------------------------------------------------------------------------------ | |
. | |
. * Previous regression model with geographical region and HIV dummies. | |
. reg births schooling log_gdpc i.region | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 7, 111) = 49.80 | |
Model | 179.35545 7 25.6222071 Prob > F = 0.0000 | |
Residual | 57.1078356 111 .514485006 R-squared = 0.7585 | |
-------------+------------------------------ Adj R-squared = 0.7433 | |
Total | 236.463285 118 2.00392615 Root MSE = .71728 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.0848665 .0461435 -1.84 0.069 -.1763029 .0065699 | |
log_gdpc | -.4523074 .0892088 -5.07 0.000 -.6290805 -.2755342 | |
| | |
region | | |
2 | .3142904 .269067 1.17 0.245 -.2188838 .8474646 | |
3 | .6165616 .3584902 1.72 0.088 -.0938108 1.326934 | |
4 | 1.499294 .2951472 5.08 0.000 .9144403 2.084148 | |
5 | .9249256 .2878394 3.21 0.002 .3545527 1.495298 | |
6 | -.0055409 .2628438 -0.02 0.983 -.5263834 .5153016 | |
| | |
_cons | 6.48954 .5903739 10.99 0.000 5.319675 7.659405 | |
------------------------------------------------------------------------------ | |
. | |
. * Proof of concept: A dummy simply codes for a particular category against all | |
. * others. Running a dummy in a regression models adds a component to the linear | |
. * equation for which the variable is equal either 0 or 1. Consequently, its | |
. * coefficient indicates how each category performs in relation to the baseline. | |
. * The baseline is, by default, the first category in the variable. Looking at | |
. * predicted values, we can draw parallel regression lines for dummies. | |
. | |
. * Bivariate regression model, for demonstration purposes. | |
. reg births schooling i.region | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 6, 112) = 44.09 | |
Model | 166.129565 6 27.6882609 Prob > F = 0.0000 | |
Residual | 70.3337198 112 .627979641 R-squared = 0.7026 | |
-------------+------------------------------ Adj R-squared = 0.6866 | |
Total | 236.463285 118 2.00392615 Root MSE = .79245 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2413827 .0378915 -6.37 0.000 -.3164598 -.1663056 | |
| | |
region | | |
2 | .0798961 .2928465 0.27 0.785 -.5003418 .6601339 | |
3 | -.0090852 .3718599 -0.02 0.981 -.745878 .7277076 | |
4 | 1.413288 .3255417 4.34 0.000 .7682689 2.058307 | |
5 | .0439706 .2535334 0.17 0.863 -.4583734 .5463145 | |
6 | -.1728805 .2880932 -0.60 0.550 -.7437003 .3979393 | |
| | |
_cons | 4.308699 .4467725 9.64 0.000 3.423477 5.193922 | |
------------------------------------------------------------------------------ | |
. | |
. * Storing fitted (predicted) values. | |
. cap drop yhat | |
. predict yhat | |
(option xb assumed; fitted values) | |
. | |
. * Regression lines for the predicted values of Asia and Africa. | |
. tw (sc births schooling if region == 4, mc(blue) ms(O)) /// | |
> (sc births schooling if region == 6, mc(red) ms(O)) /// | |
> (sc births schooling if !inlist(region,4,6), mc(gs10)) /// | |
> (rcap yhat births schooling if region == 4, /// | |
> c(l) lc(blue) lp(dash) msize(tiny)) /// | |
> (rcap yhat births schooling if region == 6, /// | |
> c(l) lc(red) lp(dash) msize(tiny)) /// | |
> (sc yhat schooling if region == 4, c(l) ms(i) mc(blue) lc(blue)) /// | |
> (sc yhat schooling if region == 6, c(l) ms(i) mc(red) lc(red)), /// | |
> legend(order(1 "African countries" 6 "Fitted values (Africa)" /// | |
> 4 "Residuals (Africa)" /// | |
> 2 "Asian countries" 7 "Fitted values (Asia)" /// | |
> 5 "Residuals (Asia)") row(2)) /// | |
> name(reg_geo2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The example above is just a teaching demonstration: geographical continents | |
. * are not appropriate as predictors. Let's now run some substantive examples, | |
. * using a dummy and a 4-level categorical predictor. | |
. | |
. * Visualizing HIV/AIDS dummy within the sample. | |
. tw (sc births schooling if !aids, ms(O)) /// | |
> (sc births schooling if aids, ms(O)) /// | |
> (lfit births schooling, lc(gs10)), /// | |
> legend(order(2 "High AIDS prevalence" 1 "Rest of sample") row(1)) /// | |
> yti("Fertility rate") /// | |
> name(reg_aids1, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Regression line for the HIV/AIDS dummy. | |
. tw (sc yhat aids) (lfit yhat aids), xlab(0 "Low" 1 "High") /// | |
> name(reg_aids2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Comparison of t-test and regression results for a single dummy. | |
. ttest yhat, by(aids) | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
0 | 96 2.417 .1005316 .9850044 2.21742 2.616581 | |
1 | 23 4.244056 .1541236 .7391509 3.924423 4.563689 | |
---------+-------------------------------------------------------------------- | |
combined | 119 2.770129 .10877 1.18654 2.554734 2.985523 | |
---------+-------------------------------------------------------------------- | |
diff | -1.827056 .2190775 -2.260927 -1.393184 | |
------------------------------------------------------------------------------ | |
diff = mean(0) - mean(1) t = -8.3398 | |
Ho: diff = 0 degrees of freedom = 117 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 | |
. reg yhat i.aids | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 1, 117) = 69.55 | |
Model | 61.937793 1 61.937793 Prob > F = 0.0000 | |
Residual | 104.191771 117 .890527959 R-squared = 0.3728 | |
-------------+------------------------------ Adj R-squared = 0.3675 | |
Total | 166.129564 118 1.40787766 Root MSE = .94368 | |
------------------------------------------------------------------------------ | |
yhat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.aids | 1.827056 .2190775 8.34 0.000 1.393184 2.260927 | |
_cons | 2.417 .0963137 25.10 0.000 2.226256 2.607744 | |
------------------------------------------------------------------------------ | |
. | |
. * Switching to fertility and women's rights. | |
. fre womenrights | |
womenrights -- Women's Social Rights | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 0 | 27 22.69 22.69 22.69 | |
1 | 44 36.97 36.97 59.66 | |
2 | 27 22.69 22.69 82.35 | |
3 | 21 17.65 17.65 100.00 | |
Total | 119 100.00 100.00 | |
----------------------------------------------------------- | |
. | |
. * We start by visualizing the average fertility for each level of rights. The | |
. * plot contains a LOWESS smoothed trend to show the DV mean at each level. | |
. sc births womenrights, yti("Fertility rate") || lowess births womenrights, /// | |
> name(fert_womenrights, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Regression model. | |
. reg births i.womenrights | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 3, 115) = 13.41 | |
Model | 61.2711874 3 20.4237291 Prob > F = 0.0000 | |
Residual | 175.192098 115 1.52340955 R-squared = 0.2591 | |
-------------+------------------------------ Adj R-squared = 0.2398 | |
Total | 236.463285 118 2.00392615 Root MSE = 1.2343 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
womenrights | | |
1 | -.5677339 .3017375 -1.88 0.062 -1.165418 .02995 | |
2 | -1.565604 .3359243 -4.66 0.000 -2.231005 -.9002023 | |
3 | -1.937111 .3591182 -5.39 0.000 -2.648455 -1.225767 | |
| | |
_cons | 3.677111 .2375344 15.48 0.000 3.206601 4.147621 | |
------------------------------------------------------------------------------ | |
. | |
. * The baseline category here is womenrights = 0 (no women's rights). Compared | |
. * to countries in this category, other countries have lower mean fertility rates | |
. * and the effect increases as women's rights increases from categories 1 to 3. | |
. | |
. * Change the baseline category to highest level "3" of women's rights. This is | |
. * convenient when you need to compare from a reference category that is not the | |
. * first one in the coding of the variable, which is the Stata default. | |
. reg births ib1.womenrights | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 3, 115) = 13.41 | |
Model | 61.2711874 3 20.4237291 Prob > F = 0.0000 | |
Residual | 175.192098 115 1.52340955 R-squared = 0.2591 | |
-------------+------------------------------ Adj R-squared = 0.2398 | |
Total | 236.463285 118 2.00392615 Root MSE = 1.2343 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
womenrights | | |
0 | .5677339 .3017375 1.88 0.062 -.02995 1.165418 | |
2 | -.9978699 .3017375 -3.31 0.001 -1.595554 -.4001859 | |
3 | -1.369377 .3273626 -4.18 0.000 -2.01782 -.720935 | |
| | |
_cons | 3.109377 .1860724 16.71 0.000 2.740804 3.477951 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * ========================= | |
. * = REGRESSION DIAGNOSTICS = | |
. * ========================= | |
. | |
. | |
. * Rerun regression model. Note that the "i." prefix is optional for dummies. | |
. reg births schooling log_gdpc aids i.womenrights | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 6, 112) = 38.25 | |
Model | 158.915569 6 26.4859282 Prob > F = 0.0000 | |
Residual | 77.547716 112 .692390321 R-squared = 0.6721 | |
-------------+------------------------------ Adj R-squared = 0.6545 | |
Total | 236.463285 118 2.00392615 Root MSE = .8321 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2113981 .0454762 -4.65 0.000 -.3015035 -.1212928 | |
log_gdpc | -.307614 .0913113 -3.37 0.001 -.4885357 -.1266923 | |
aids | .8945646 .217741 4.11 0.000 .4631387 1.32599 | |
| | |
womenrights | | |
1 | .268684 .2215173 1.21 0.228 -.1702241 .7075921 | |
2 | .1263621 .2696793 0.47 0.640 -.4079728 .660697 | |
3 | .6720761 .331005 2.03 0.045 .0162321 1.32792 | |
| | |
_cons | 6.513273 .5711178 11.40 0.000 5.381676 7.644869 | |
------------------------------------------------------------------------------ | |
. | |
. * Storing fitted (predicted) values. | |
. cap drop yhat | |
. predict yhat | |
(option xb assumed; fitted values) | |
. | |
. | |
. * (1) Standardized residuals | |
. * -------------------------- | |
. | |
. * Store the unstandardized (metric) residuals. | |
. cap drop r | |
. predict r, resid | |
. | |
. * Assess the normality of residuals. | |
. kdensity r, norm legend(off) ti("") /// | |
> name(diag_kdens, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Homoskedasticity of the residuals versus fitted values (DV). | |
. rvfplot, yline(0) ms(i) mlab(ccodewb) name(diag_rvf, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * Store the standardized residuals. | |
. cap drop rsta | |
. predict rsta, rsta | |
. | |
. * Identify outliers beyond 2 standard deviation units. | |
. sc rsta yhat, yline(-2 2) || sc rsta yhat if abs(rsta) > 2, /// | |
> ylab(-3(1)3) mlab(ccodewb) legend(lab(2 "Outliers")) /// | |
> name(diag_rsta, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * (2) Heteroskedasticity | |
. * ---------------------- | |
. | |
. * Homoskedasticity of the residuals versus one predictor (IV), also showing the | |
. * outliers above two standard deviation units (standardised residuals). This is | |
. * a more complex diagnostic that shows how one variable influences the model in | |
. * the background of the main regression equation. It might show some predictors | |
. * are responsible for the overall sampling distribution of the residuals, which | |
. * means that the model is captive of a restricted number of predictors. | |
. sc r schooling, /// | |
> yline(0) mlab(ccodewb) legend(lab(2 "Outliers")) /// | |
> name(diag_edu1, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. * The trend in the error term can be visualized as a LOWESS curve to show when | |
. * and how departures from homogenous variance occur throughout the sample as a | |
. * function of the predictor. The trend reflects the influence of outliers with | |
. * reference to that particular predictor: if the error term of the model shows | |
. * a pattern in its standard errors, the LOWESS curve will show it by deviating | |
. * from the null y-axis at values of the IV where the residuals are "clustered" | |
. * above or below the expected mean of zero (which indicates homoskedasticity). | |
. lowess rsta schooling, bw(.5) yline(0) /// | |
> name(diag_edu2, replace) | |
(note: scheme burd not found, using s2color) | |
. | |
. | |
. * (3) Variance inflation and interaction terms | |
. * -------------------------------------------- | |
. | |
. * The Variance Inflation Factor (VIF) diagnoses an issue with 'kitchen sink' | |
. * models that use high numbers of correlated variables together in the model, | |
. * which measures several times the same effect and creates multicollinearity. | |
. * This problem renders the regression coefficients useless. Critical cut-off | |
. * points for variance inflation are VIF > 10 or 1/VIF < .1 (tolerance). Each | |
. * VIF is computed as the reciprocal of the inverse R-squared, 1/(1-R^2), for | |
. * each predictor in the model (that is, the R-squared of that variable minus | |
. * the R-squared of the entire model without it). | |
. vif | |
Variable | VIF 1/VIF | |
-------------+---------------------- | |
schooling | 3.20 0.312547 | |
log_gdpc | 3.85 0.259917 | |
aids | 1.27 0.787079 | |
womenrights | | |
1 | 1.97 0.508825 | |
2 | 2.19 0.456091 | |
3 | 2.74 0.365413 | |
-------------+---------------------- | |
Mean VIF | 2.54 | |
. | |
. * Adding an interaction term is a technique to account for the variance that | |
. * two variables explain in each other. The effect is calculated by multiplying | |
. * the two variables together and throwing that product in the regression model. | |
. * The regression coefficient for this product is the interaction effect. If that | |
. * effect is significantly large, the model accounts for it by isolating it and | |
. * reading other coefficients. | |
. gen schoolingXlog_gdpc = schooling * log_gdpc | |
. la var schoolingXlog_gdpc "GDP * Education" | |
. | |
. * Regression model. | |
. reg births schooling log_gdpc aids | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 3, 115) = 72.95 | |
Model | 155.009323 3 51.6697745 Prob > F = 0.0000 | |
Residual | 81.4539618 115 .70829532 R-squared = 0.6555 | |
-------------+------------------------------ Adj R-squared = 0.6465 | |
Total | 236.463285 118 2.00392615 Root MSE = .8416 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
schooling | -.2028221 .0430371 -4.71 0.000 -.2880703 -.1175739 | |
log_gdpc | -.2464903 .0806962 -3.05 0.003 -.4063338 -.0866467 | |
aids | .8600229 .2144907 4.01 0.000 .435158 1.284888 | |
_cons | 6.1997 .4788918 12.95 0.000 5.251108 7.148293 | |
------------------------------------------------------------------------------ | |
. | |
. * Regression model with an interaction term. | |
. reg births schooling log_gdpc schoolingXlog_gdpc aids | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 4, 114) = 74.29 | |
Model | 170.899523 4 42.7248808 Prob > F = 0.0000 | |
Residual | 65.5637622 114 .575120721 R-squared = 0.7227 | |
-------------+------------------------------ Adj R-squared = 0.7130 | |
Total | 236.463285 118 2.00392615 Root MSE = .75837 | |
----------------------------------------------------------------------------------- | |
births | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
------------------+---------------------------------------------------------------- | |
schooling | -.8492374 .1289475 -6.59 0.000 -1.104681 -.5937934 | |
log_gdpc | -1.012535 .1628702 -6.22 0.000 -1.33518 -.6898907 | |
schoolingXlog_g~c | .087058 .0165624 5.26 0.000 .054248 .119868 | |
aids | .8010199 .193603 4.14 0.000 .4174938 1.184546 | |
_cons | 11.62292 1.118353 10.39 0.000 9.407471 13.83837 | |
----------------------------------------------------------------------------------- | |
. | |
. * Standardised coefficients reveal the extent to which the interaction actually | |
. * influences the model, in comparison to all other included predictors (IVs). | |
. reg births schooling log_gdpc schoolingXlog_gdpc aids, b | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 4, 114) = 74.29 | |
Model | 170.899523 4 42.7248808 Prob > F = 0.0000 | |
Residual | 65.5637622 114 .575120721 R-squared = 0.7227 | |
-------------+------------------------------ Adj R-squared = 0.7130 | |
Total | 236.463285 118 2.00392615 Root MSE = .75837 | |
----------------------------------------------------------------------------------- | |
births | Coef. Std. Err. t P>|t| Beta | |
------------------+---------------------------------------------------------------- | |
schooling | -.8492374 .1289475 -6.59 0.000 -1.807508 | |
log_gdpc | -1.012535 .1628702 -6.22 0.000 -1.176961 | |
schoolingXlog_g~c | .087058 .0165624 5.26 0.000 2.165221 | |
aids | .8010199 .193603 4.14 0.000 .2243817 | |
_cons | 11.62292 1.118353 10.39 0.000 . | |
----------------------------------------------------------------------------------- | |
. | |
. * Last, a shorter way to write up an interaction for two continuous predictors. | |
. reg births c.schooling##c.log_gdpc aids, b | |
Source | SS df MS Number of obs = 119 | |
-------------+------------------------------ F( 4, 114) = 74.29 | |
Model | 170.899526 4 42.7248815 Prob > F = 0.0000 | |
Residual | 65.5637592 114 .575120695 R-squared = 0.7227 | |
-------------+------------------------------ Adj R-squared = 0.7130 | |
Total | 236.463285 118 2.00392615 Root MSE = .75837 | |
------------------------------------------------------------------------------ | |
births | Coef. Std. Err. t P>|t| Beta | |
-------------+---------------------------------------------------------------- | |
schooling | -.8492375 .1289475 -6.59 0.000 -1.807508 | |
log_gdpc | -1.012536 .1628702 -6.22 0.000 -1.176961 | |
| | |
c.schooling#| | |
c.log_gdpc | .087058 .0165624 5.26 0.000 2.165222 | |
| | |
aids | .8010198 .193603 4.14 0.000 .2243817 | |
_cons | 11.62292 1.118353 10.39 0.000 . | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * ======================== | |
. * = EXPORT MODEL RESULTS = | |
. * ======================== | |
. | |
. | |
. * This section shows how to export regression results, in order to avoid having | |
. * to copy out the results by hand, copy-paste or any other risky (non)technique | |
. * that you might come up with at that stage. Exporting regression results also | |
. * make it easier to build several regression models based on varying sets of | |
. * covariates (independent variables), in order to compare their coefficients. | |
. | |
. * The next commands require that you install the -estout- package first. Another | |
. * frequently used command for the same task is the -outreg- or -outreg2- command | |
. * that can be downloaded with -ssc install-. | |
. | |
. * Wipe any previous regression estimates. | |
. eststo clear | |
. | |
. * Model 1: 'Baseline model'. | |
. eststo M1: qui reg births schooling log_gdpc | |
. | |
. * Re-read, in simplified form. | |
. leanout: | |
Dependent variable: births | |
Variable Coef SE 95% CI | |
----------------------------------------------- | |
schooling -0.2 0.0 ( -0.3, -0.1) | |
log_gdpc -0.3 0.1 ( -0.5, -0.2) | |
_cons 7.0 0.5 ( 6.1, 7.9) | |
----------------------------------------------- | |
Number of observations = 119 | |
Root Mean Squared Error = 0.9 | |
. | |
. * Model 2: Adding the HIV/AIDS dummy. | |
. eststo M2: qui reg births schooling log_gdpc aids | |
. | |
. * Re-read, in simplified form. | |
. leanout: | |
Dependent variable: births | |
Variable Coef SE 95% CI | |
----------------------------------------------- | |
schooling -0.2 0.0 ( -0.3, -0.1) | |
log_gdpc -0.2 0.1 ( -0.4, -0.1) | |
aids 0.9 0.2 ( 0.4, 1.3) | |
_cons 6.2 0.5 ( 5.3, 7.1) | |
----------------------------------------------- | |
Number of observations = 119 | |
Root Mean Squared Error = 0.8 | |
. | |
. * Model 3: Adding the interaction between education and wealth. | |
. eststo M3: qui reg births c.schooling##c.log_gdpc aids | |
. | |
. * Re-read, in simplified form. | |
. leanout: | |
Dependent variable: births | |
Variable Coef SE 95% CI | |
----------------------------------------------- | |
schooling -0.8 0.1 ( -1.1, -0.6) | |
log_gdpc -1.0 0.2 ( -1.3, -0.7) | |
c.schooling# | |
c.log_gdpc 0.1 0.0 ( 0.1, 0.1) | |
aids 0.8 0.2 ( 0.4, 1.2) | |
_cons 11.6 1.1 ( 9.4, 13.8) | |
----------------------------------------------- | |
Number of observations = 119 | |
Root Mean Squared Error = 0.8 | |
. | |
. * Compare all models on screen. | |
. esttab M1 M2 M3, lab b(1) se(1) sca(rmse) /// | |
> mti("Baseline" "Control" "Interaction") | |
-------------------------------------------------------------------- | |
(1) (2) (3) | |
Baseline Control Interaction | |
-------------------------------------------------------------------- | |
Average Schooling ~ -0.2*** -0.2*** -0.8*** | |
(0.0) (0.0) (0.1) | |
Real GDP/capita (c~l -0.3*** -0.2** -1.0*** | |
(0.1) (0.1) (0.2) | |
Highest HIV/AIDS p~r 0.9*** 0.8*** | |
(0.2) (0.2) | |
c.schooling#c.log_~c 0.1*** | |
(0.0) | |
Constant 7.0*** 6.2*** 11.6*** | |
(0.5) (0.5) (1.1) | |
-------------------------------------------------------------------- | |
Observations 119 119 119 | |
rmse 0.9 0.8 0.8 | |
-------------------------------------------------------------------- | |
Standard errors in parentheses | |
* p<0.05, ** p<0.01, *** p<0.001 | |
. | |
. * Export all models for comparison and reporting. | |
. esttab M1 M2 M3 using week9_regressions.txt, replace /// | |
> lab b(1) se(1) sca(rmse) /// | |
> mti("Baseline" "Controls" "Interactions") | |
(note: file week9_regressions.txt not found) | |
(output written to week9_regressions.txt) | |
. | |
. /* Basic usage of -estout- commands: | |
> | |
> - The -estout- commands work by storing model estimates with -eststo- and then | |
> putting them into tables with -esttab-. Use these commands at the end of your | |
> models: start with -reg- and -leanout-, then use -eststo- and -esttab-. | |
> | |
> - The -estout- command is especially practical when you run many models, as | |
> shown here when we compare the model between country cases and then check | |
> how the DV model compares to other satisfaction measures (covariates). | |
> | |
> - Check the -estout- online documentation for more examples. */ | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require estout fre tab_chi renvars scheme-burd | |
. | |
. * Log results. | |
. cap log using code/week10.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 10 ------------------ | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Attitudes Towards Immigration in Europe | |
> | |
> - DATA: European Social Survey Round 4 (2008) | |
> | |
> This do-file complements the series that we finished running last week using | |
> the Quality of Government dataset. It shows how multiple regression can apply | |
> to survey data, and introduces a different form of regression model. | |
> | |
> Survey data commonly feature response items that are discrete rather than | |
> continuous. This means that linear regression models will be of limited use | |
> with this type of data. | |
> | |
> When the dependent variable cannot be normaly distributed, a solution is to | |
> simplify it to a dummy and to estimate a logistic regression model, which is | |
> a generalization of the linear model. | |
> | |
> This do-file introduces logistic models. For your own work, decide whether a | |
> logistic estimator is more appropriate than a linear one, and include draft | |
> models in your revised draft. | |
> | |
> Last updated 2013-05-31. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load ESS dataset. | |
. use data/ess2008, clear | |
(European Social Survey 2008) | |
. | |
. * Subsetting to respondents age 25+ with full data. | |
. drop if agea < 25 | mi(imdfetn, agea, gndr, brncntr, eduyrs, hinctnta, lrscale) | |
(25884 observations deleted) | |
. | |
. * Survey weights (design weight by country, multiplied by population weight). | |
. gen dpw = dweight * pweight | |
. la var dpw "Survey weight (population*design)" | |
. | |
. * Country dummies (used for clustered standard errors). | |
. encode cntry, gen(cid) | |
. | |
. | |
. * DV: Allow many/few immigrants of different race/ethnic group from majority | |
. * -------------------------------------------------------------------------- | |
. | |
. fre imdfetn | |
imdfetn -- Allow many/few immigrants of different race/ethnic group from majority | |
----------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------------------------------+-------------------------------------------- | |
Valid 1 Allow many to come and live | 3959 12.83 12.83 12.83 | |
here | | |
2 Allow some | 11913 38.59 38.59 51.42 | |
3 Allow a few | 10386 33.65 33.65 85.07 | |
4 Allow none | 4610 14.93 14.93 100.00 | |
Total | 30868 100.00 100.00 | |
----------------------------------------------------------------------------------- | |
. | |
. * Relabel for concise legends in graphs. | |
. la def imdfetn 1 "Many" 2 "Some" 3 "Few" 4 "None", replace | |
. | |
. * Normality: distribution shows symmetricality but the reduced number of items | |
. * on a 4-point scale limits variability and will create postestimation issues. | |
. hist imdfetn, discrete percent addl /// | |
> name(dv, replace) | |
(start=1, width=1) | |
. | |
. * Dummy: 1 = allow many/some immigrants. | |
. gen diff = (imdfetn < 3) | |
. la var diff "Allow many/some migrants of different race/ethnicity from majority" | |
. | |
. | |
. * IVs: age, gender, country of birth, education, income, left-right scale | |
. * ----------------------------------------------------------------------- | |
. | |
. d agea gndr brncntr eduyrs hinctnta lrscale | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
agea int %3.0f agea Age of respondent, calculated | |
gndr byte %1.0f gndr Gender | |
brncntr byte %1.0f brncntr Born in country | |
eduyrs byte %2.0f eduyrs Years of full-time education completed | |
hinctnta byte %2.0f hinctnta Household's total net income, all | |
sources | |
lrscale byte %2.0f lrscale Placement on left right scale | |
. | |
. * Renaming. | |
. renvars agea hinctnta lrscale \ age income rightwing | |
. | |
. * Create age groups. | |
. gen cohort = irecode(age, 24, 34, 44, 54, 64, 74) | |
. replace cohort = 15 + 10 * cohort | |
(30868 real changes made) | |
. | |
. * Dummify sex. | |
. gen female:sex = (gndr == 2) | |
. la def sex 0 "Male" 1 "Female", replace | |
. | |
. * Dummify country of birth. | |
. gen born:born = (brncntr == 1) | |
. la def born 0 "Foreign-born" 1 "Born in country", replace | |
. | |
. * Recode education years. | |
. su eduyrs, d | |
Years of full-time education completed | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 1 0 | |
5% 5 0 | |
10% 7 0 Obs 30868 | |
25% 10 0 Sum of Wgt. 30868 | |
50% 12 Mean 12.41982 | |
Largest Std. Dev. 4.257545 | |
75% 15 39 | |
90% 18 40 Variance 18.12669 | |
95% 19 43 Skewness -.0218349 | |
99% 22 48 Kurtosis 3.770817 | |
. xtile edu3 = eduyrs if eduyrs < 22, nq(3) | |
. la var edu3 "Education level" | |
. la def edu3 1 "Low" 2 "Medium" 3 "High" | |
. la val edu3 edu3 | |
. | |
. | |
. * Export summary statistics | |
. * ------------------------- | |
. | |
. * The next command is part of the SRQM folder. If Stata returns an error when | |
. * you run it, set the folder as your working directory and type -run profile- | |
. * to run the course setup, then try the command again. If you still experience | |
. * problems with the -stab- command, please send a detailed email on the issue. | |
. | |
. stab using week10_stats.txt, replace /// | |
> mean(age rightwing) /// | |
> prop(imdfetn female born edu3 income) | |
(note: file week10_stats.txt not found) | |
Variable mean sd min max mea | |
> n sd min max mean sd min | |
> max | |
Allow many/few imm~f % % % | |
% % % | |
% % % | |
Education level % % % | |
Household's total ~l % % % | |
N = 30399 (excluding 469 incomplete observations) | |
File: week10_stats.txt | |
. | |
. /* Syntax of the -stab- command: | |
> | |
> - using FILE - name of the exported file; plain text (.txt) recommended | |
> - replace - overwrite any previously existing file | |
> - mean() - summarizes a list of continuous variables (mean, sd, min, max) | |
> - prop() - summarizes a list of categorical variables (frequencies) | |
> | |
> In the example above, the -stab- command will export one file to the working | |
> directory, containing summary statistics for the full European sample. */ | |
. | |
. | |
. * ===================== | |
. * = ASSOCIATION TESTS = | |
. * ===================== | |
. | |
. | |
. * Dummify the DV categories. | |
. tab imdfetn, gen(immig_) | |
Allow | | |
many/few | | |
immigrants | | |
of | | |
different | | |
race/ethnic | | |
group from | | |
majority | Freq. Percent Cum. | |
------------+----------------------------------- | |
Many | 3,959 12.83 12.83 | |
Some | 11,913 38.59 51.42 | |
Few | 10,386 33.65 85.07 | |
None | 4,610 14.93 100.00 | |
------------+----------------------------------- | |
Total | 30,868 100.00 | |
. | |
. * Crossvisualize DV with basic demographics. | |
. gr bar immig_*, stack percent over(cohort) by(female born, note("")) yti("") /// | |
> legend(order(1 "Many" 2 "Some" 3 "Few" 4 "None") row(1)) /// | |
> scheme(burd4) name(demog, replace) | |
. | |
. * Crosstabulation: DV by gender. | |
. tab female imdfetn, row nof chi2 // Chi-squared test | |
| Allow many/few immigrants of different | |
| race/ethnic group from majority | |
female | Many Some Few None | Total | |
-----------+--------------------------------------------+---------- | |
Male | 13.28 38.76 33.34 14.63 | 100.00 | |
Female | 12.41 38.44 33.93 15.21 | 100.00 | |
-----------+--------------------------------------------+---------- | |
Total | 12.83 38.59 33.65 14.93 | 100.00 | |
Pearson chi2(3) = 7.2476 Pr = 0.064 | |
. tabchi female imdfetn, p noo noe // Pearson residuals | |
Pearson residual | |
------------------------------------------ | |
| Allow many/few immigrants of | |
| different race/ethnic group | |
| from majority | |
female | Many Some Few None | |
----------+------------------------------- | |
Male | 1.526 0.328 -0.650 -0.965 | |
Female | -1.457 -0.313 0.621 0.922 | |
------------------------------------------ | |
Pearson chi2(3) = 7.2476 Pr = 0.064 | |
likelihood-ratio chi2(3) = 7.2453 Pr = 0.064 | |
. | |
. * Crosstabulation: DV by country of birth. | |
. tab born imdfetn, row nof chi2 | |
| Allow many/few immigrants of different | |
| race/ethnic group from majority | |
born | Many Some Few None | Total | |
----------------+--------------------------------------------+---------- | |
Foreign-born | 17.48 42.11 30.28 10.13 | 100.00 | |
Born in country | 12.33 38.22 34.01 15.45 | 100.00 | |
----------------+--------------------------------------------+---------- | |
Total | 12.83 38.59 33.65 14.93 | 100.00 | |
Pearson chi2(3) = 129.0198 Pr = 0.000 | |
. tabchi born imdfetn, p noo noe | |
Pearson residual | |
------------------------------------------------ | |
| Allow many/few immigrants of | |
| different race/ethnic group | |
| from majority | |
born | Many Some Few None | |
----------------+------------------------------- | |
Foreign-born | 7.109 3.098 -3.174 -6.805 | |
Born in country | -2.329 -1.015 1.040 2.229 | |
------------------------------------------------ | |
Pearson chi2(3) = 129.0198 Pr = 0.000 | |
likelihood-ratio chi2(3) = 129.8765 Pr = 0.000 | |
. | |
. * Crosstabulation: DV by age cohort. | |
. tab cohort imdfetn, row nof chi2 | |
| Allow many/few immigrants of different | |
| race/ethnic group from majority | |
cohort | Many Some Few None | Total | |
-----------+--------------------------------------------+---------- | |
25 | 15.39 42.12 30.57 11.92 | 100.00 | |
35 | 14.60 40.67 31.45 13.28 | 100.00 | |
45 | 14.15 39.81 32.06 13.98 | 100.00 | |
55 | 11.77 37.80 34.72 15.71 | 100.00 | |
65 | 9.45 34.73 37.81 18.01 | 100.00 | |
75 | 7.93 31.45 39.89 20.73 | 100.00 | |
-----------+--------------------------------------------+---------- | |
Total | 12.83 38.59 33.65 14.93 | 100.00 | |
Pearson chi2(15) = 456.2647 Pr = 0.000 | |
. tabchi cohort imdfetn, p noo noe | |
Pearson residual | |
------------------------------------------ | |
| Allow many/few immigrants of | |
| different race/ethnic group | |
| from majority | |
cohort | Many Some Few None | |
----------+------------------------------- | |
25 | 5.421 4.300 -4.014 -5.910 | |
35 | 3.929 2.649 -3.000 -3.396 | |
45 | 2.884 1.529 -2.134 -1.928 | |
55 | -2.218 -0.966 1.392 1.520 | |
65 | -6.177 -4.069 4.702 5.207 | |
75 | -7.173 -6.026 5.645 7.861 | |
------------------------------------------ | |
Pearson chi2(15) = 456.2647 Pr = 0.000 | |
likelihood-ratio chi2(15) = 460.9260 Pr = 0.000 | |
. | |
. * Dummify educational attainment. | |
. tab edu3, gen(edu_) | |
Education | | |
level | Freq. Percent Cum. | |
------------+----------------------------------- | |
Low | 12,017 39.53 39.53 | |
Medium | 9,199 30.26 69.79 | |
High | 9,183 30.21 100.00 | |
------------+----------------------------------- | |
Total | 30,399 100.00 | |
. | |
. * Clarify x-axis by dropping labels on income deciles. | |
. la def inc10 1 "D1" 10 "D10", replace | |
. la val income inc10 | |
. | |
. * Visualization of education with income, sex and country of birth. | |
. gr bar edu_*, stack percent over(income) by(female born, note("")) yti("") /// | |
> legend(order(1 "Low" 2 "Medium" 3 "High") row(1) pos(11)) /// | |
> scheme(burd3) name(edu_inc, replace) | |
. | |
. * Simplified political scale. | |
. recode rightwing /// | |
> (0/4 = 1 "Left-wing") /// | |
> (5 = 2 "Centre") /// | |
> (6/11 = 3 "Right-wing") /// | |
> (else = .), gen(wing) | |
(30123 differences between rightwing and wing) | |
. tab wing, gen(wing_) | |
RECODE of | | |
rightwing | | |
(Placement | | |
on left | | |
right | | |
scale) | Freq. Percent Cum. | |
------------+----------------------------------- | |
Left-wing | 9,633 31.21 31.21 | |
Centre | 9,860 31.94 63.15 | |
Right-wing | 11,375 36.85 100.00 | |
------------+----------------------------------- | |
Total | 30,868 100.00 | |
. | |
. * Visualization of left-right political leaning by income decile and age cohort. | |
. gr bar wing_*, stack percent over(income) by(cohort, note("")) yti("") /// | |
> legend(order(1 "Left-wing" 2 "Centre" 3 "Right-wing") row(1)) /// | |
> scheme(burd3) name(pol_inc, replace) | |
. | |
. * Crosstabulation. | |
. tab income wing, row nof chi2 | |
Household' | | |
s total | | |
net | | |
income, | RECODE of rightwing (Placement | |
all | on left right scale) | |
sources | Left-wing Centre Right-win | Total | |
-----------+---------------------------------+---------- | |
D1 | 31.69 36.71 31.60 | 100.00 | |
2 | 29.92 35.66 34.42 | 100.00 | |
3 | 31.89 33.83 34.27 | 100.00 | |
4 | 32.42 33.58 33.99 | 100.00 | |
5 | 31.03 33.11 35.85 | 100.00 | |
6 | 30.87 32.43 36.70 | 100.00 | |
7 | 30.96 32.39 36.65 | 100.00 | |
8 | 33.42 28.13 38.45 | 100.00 | |
9 | 30.66 28.27 41.07 | 100.00 | |
D10 | 28.85 24.41 46.73 | 100.00 | |
-----------+---------------------------------+---------- | |
Total | 31.21 31.94 36.85 | 100.00 | |
Pearson chi2(18) = 252.3559 Pr = 0.000 | |
. tabchi income wing, p noo noe | |
Pearson residual | |
---------------------------------------------- | |
Household | | |
's total | | |
net | | |
income, | RECODE of rightwing (Placement on | |
all | left right scale) | |
sources | Left-wing Centre Right-wing | |
----------+----------------------------------- | |
D1 | 0.404 3.950 -4.049 | |
2 | -1.323 3.777 -2.298 | |
3 | 0.744 2.025 -2.570 | |
4 | 1.314 1.748 -2.836 | |
5 | -0.179 1.193 -0.946 | |
6 | -0.335 0.480 -0.139 | |
7 | -0.248 0.444 -0.185 | |
8 | 2.167 -3.685 1.436 | |
9 | -0.527 -3.469 3.714 | |
D10 | -2.198 -6.954 8.496 | |
---------------------------------------------- | |
Pearson chi2(18) = 252.3559 Pr = 0.000 | |
likelihood-ratio chi2(18) = 251.4758 Pr = 0.000 | |
. | |
. | |
. * ===================== | |
. * = REGRESSION MODELS = | |
. * ===================== | |
. | |
. | |
. * Linear regression | |
. * ----------------- | |
. | |
. global bl "age i.female i.born i.edu3 income rightwing" // store IV names | |
. | |
. * Baseline OLS model. | |
. reg imdfetn $bl [pw = dpw] | |
(sum of wgt is 3.1650e+04) | |
Linear regression Number of obs = 30399 | |
F( 7, 30391) = 111.42 | |
Prob > F = 0.0000 | |
R-squared = 0.0746 | |
Root MSE = .86365 | |
------------------------------------------------------------------------------ | |
| Robust | |
imdfetn | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0006248 .0005947 1.05 0.293 -.0005408 .0017905 | |
1.female | .0455134 .0175361 2.60 0.009 .0111419 .0798848 | |
1.born | .24353 .0313883 7.76 0.000 .1820076 .3050524 | |
| | |
edu3 | | |
2 | -.1804094 .0221801 -8.13 0.000 -.2238834 -.1369353 | |
3 | -.3767914 .0225044 -16.74 0.000 -.4209009 -.3326819 | |
| | |
income | -.033461 .0033486 -9.99 0.000 -.0400244 -.0268976 | |
rightwing | .0374404 .0042896 8.73 0.000 .0290325 .0458483 | |
_cons | 2.388615 .0542555 44.03 0.000 2.282272 2.494958 | |
------------------------------------------------------------------------------ | |
. | |
. * Adjusted OLS model: observations clustered by country. | |
. reg imdfetn $bl [pw = dpw], vce(cluster cid) | |
(sum of wgt is 3.1650e+04) | |
Linear regression Number of obs = 30399 | |
F( 7, 25) = 94.48 | |
Prob > F = 0.0000 | |
R-squared = 0.0746 | |
Root MSE = .86365 | |
(Std. Err. adjusted for 26 clusters in cid) | |
------------------------------------------------------------------------------ | |
| Robust | |
imdfetn | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0006248 .0012625 0.49 0.625 -.0019754 .0032251 | |
1.female | .0455134 .0161814 2.81 0.009 .0121871 .0788397 | |
1.born | .24353 .0505925 4.81 0.000 .1393329 .3477271 | |
| | |
edu3 | | |
2 | -.1804094 .0738232 -2.44 0.022 -.332451 -.0283677 | |
3 | -.3767914 .0747884 -5.04 0.000 -.530821 -.2227617 | |
| | |
income | -.033461 .0053758 -6.22 0.000 -.0445326 -.0223894 | |
rightwing | .0374404 .0106904 3.50 0.002 .0154231 .0594577 | |
_cons | 2.388615 .1149658 20.78 0.000 2.151838 2.625391 | |
------------------------------------------------------------------------------ | |
. | |
. * The last option reads as 'variance-covariance estimation is clustered by cid'. | |
. * This specification enforces robust standard errors into the model. It uses the | |
. * respondents' country of residence as a panel variable in the estimation of all | |
. * regression coefficients. Panel variables are variables at which level we might | |
. * observe some form of within-sample clustering, which violates the assumption | |
. * that the error term is independently distributed across the observations. | |
. | |
. * Variance inflation. | |
. vif | |
Variable | VIF 1/VIF | |
-------------+---------------------- | |
age | 1.09 0.914717 | |
1.female | 1.01 0.993207 | |
1.born | 1.01 0.994724 | |
edu3 | | |
2 | 1.32 0.754937 | |
3 | 1.50 0.668159 | |
income | 1.19 0.837507 | |
rightwing | 1.01 0.989781 | |
-------------+---------------------- | |
Mean VIF | 1.16 | |
. | |
. * Inspect residuals. | |
. predict r, resid | |
(469 missing values generated) | |
. | |
. * Diagnostic plots. | |
. hist r, normal /// | |
> name(r, replace) // distribution of residuals | |
(bin=44, start=-2.0704632, width=.09725008) | |
. rvfplot, yli(0) /// | |
> name(rvf, replace) // residuals vs. fitted values | |
. | |
. * Export. | |
. eststo clear | |
. eststo lin_1: reg imdfetn $bl [pw = dpw] | |
(sum of wgt is 3.1650e+04) | |
Linear regression Number of obs = 30399 | |
F( 7, 30391) = 111.42 | |
Prob > F = 0.0000 | |
R-squared = 0.0746 | |
Root MSE = .86365 | |
------------------------------------------------------------------------------ | |
| Robust | |
imdfetn | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0006248 .0005947 1.05 0.293 -.0005408 .0017905 | |
1.female | .0455134 .0175361 2.60 0.009 .0111419 .0798848 | |
1.born | .24353 .0313883 7.76 0.000 .1820076 .3050524 | |
| | |
edu3 | | |
2 | -.1804094 .0221801 -8.13 0.000 -.2238834 -.1369353 | |
3 | -.3767914 .0225044 -16.74 0.000 -.4209009 -.3326819 | |
| | |
income | -.033461 .0033486 -9.99 0.000 -.0400244 -.0268976 | |
rightwing | .0374404 .0042896 8.73 0.000 .0290325 .0458483 | |
_cons | 2.388615 .0542555 44.03 0.000 2.282272 2.494958 | |
------------------------------------------------------------------------------ | |
. eststo lin_2: reg imdfetn $bl [pw = dpw], vce(cluster cid) | |
(sum of wgt is 3.1650e+04) | |
Linear regression Number of obs = 30399 | |
F( 7, 25) = 94.48 | |
Prob > F = 0.0000 | |
R-squared = 0.0746 | |
Root MSE = .86365 | |
(Std. Err. adjusted for 26 clusters in cid) | |
------------------------------------------------------------------------------ | |
| Robust | |
imdfetn | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0006248 .0012625 0.49 0.625 -.0019754 .0032251 | |
1.female | .0455134 .0161814 2.81 0.009 .0121871 .0788397 | |
1.born | .24353 .0505925 4.81 0.000 .1393329 .3477271 | |
| | |
edu3 | | |
2 | -.1804094 .0738232 -2.44 0.022 -.332451 -.0283677 | |
3 | -.3767914 .0747884 -5.04 0.000 -.530821 -.2227617 | |
| | |
income | -.033461 .0053758 -6.22 0.000 -.0445326 -.0223894 | |
rightwing | .0374404 .0106904 3.50 0.002 .0154231 .0594577 | |
_cons | 2.388615 .1149658 20.78 0.000 2.151838 2.625391 | |
------------------------------------------------------------------------------ | |
. esttab lin_? using week10_regressions.txt, mti("OLS" "Adj. OLS") replace | |
(note: file week10_regressions.txt not found) | |
(output written to week10_regressions.txt) | |
. | |
. * The diagnostics clearly identify the issue here: the limited number of levels | |
. * in the DV is causing residuals to follow a low-dimensional pattern that does | |
. * not approximate a normal distribution. The residuals, for instance, follow a | |
. * quadrimodal distribution that reflect the number of levels in the DV. The data | |
. * therefore fail to fit the assumptions of the model by design. | |
. | |
. * We turn to a logistic regression (logit) model, which accepts only dichotomous | |
. * outcomes. The binary/dummy recoding of the DV was computed earlier as follows: | |
. tab diff imdfetn | |
Allow | | |
many/some | | |
migrants | | |
of | | |
different | | |
race/ethni | Allow many/few immigrants of different | |
city from | race/ethnic group from majority | |
majority | Many Some Few None | Total | |
-----------+--------------------------------------------+---------- | |
0 | 0 0 10,386 4,610 | 14,996 | |
1 | 3,959 11,913 0 0 | 15,872 | |
-----------+--------------------------------------------+---------- | |
Total | 3,959 11,913 10,386 4,610 | 30,868 | |
. | |
. * You are very welcome to consult the UCLA Stata FAQ pages to learn how logistic | |
. * regression works if you are interested in estimating a logit model. Otherwise, | |
. * just follow the code and comments below to get some basic ideas. The following | |
. * is a very short demo: it would take a full course to explain logistic models | |
. * properly, and you are very welcome to ask for one :) | |
. | |
. | |
. * Logistic regression | |
. * ------------------- | |
. | |
. * Binarize the DV again to have 1 = no immigrants. | |
. gen nomigrants = (imdfetn > 2) | |
. | |
. * Column percentages (conditional probabilities). | |
. tab cohort nomigrants, col nof | |
| nomigrants | |
cohort | 0 1 | Total | |
-----------+----------------------+---------- | |
25 | 20.77 16.24 | 18.57 | |
35 | 21.92 18.78 | 20.39 | |
45 | 20.81 18.80 | 19.83 | |
55 | 17.75 19.11 | 18.41 | |
65 | 11.93 15.96 | 13.89 | |
75 | 6.82 11.12 | 8.91 | |
-----------+----------------------+---------- | |
Total | 100.00 100.00 | 100.00 | |
. | |
. * Log-odds of f = ln(Y = 1). | |
. tabodds nomigrants cohort | |
-------------------------------------------------------------------------- | |
cohort | cases controls odds [95% Conf. Interval] | |
------------+------------------------------------------------------------- | |
25 | 2435 3296 0.73877 0.70108 0.77850 | |
35 | 2816 3479 0.80943 0.77020 0.85066 | |
45 | 2819 3303 0.85347 0.81163 0.89745 | |
55 | 2866 2817 1.01739 0.96584 1.07170 | |
65 | 2393 1894 1.26346 1.18955 1.34197 | |
75 | 1667 1083 1.53924 1.42589 1.66161 | |
-------------------------------------------------------------------------- | |
Test of homogeneity (equal odds): chi2(5) = 395.42 | |
Pr>chi2 = 0.0000 | |
Score test for trend of odds: chi2(1) = 372.18 | |
Pr>chi2 = 0.0000 | |
. | |
. * Odds ratios: magnitude of success-failure rate. | |
. tabodds nomigrants cohort, or | |
--------------------------------------------------------------------------- | |
cohort | Odds Ratio chi2 P>chi2 [95% Conf. Interval] | |
-------------+------------------------------------------------------------- | |
25 | 1.000000 . . . . | |
35 | 1.095636 6.15 0.0131 1.019308 1.177681 | |
45 | 1.155247 15.19 0.0001 1.074309 1.242282 | |
55 | 1.377138 72.37 0.0000 1.278854 1.482976 | |
65 | 1.710216 174.57 0.0000 1.577840 1.853698 | |
75 | 2.083509 244.56 0.0000 1.896433 2.289040 | |
--------------------------------------------------------------------------- | |
Test of homogeneity (equal odds): chi2(5) = 395.42 | |
Pr>chi2 = 0.0000 | |
Score test for trend of odds: chi2(1) = 372.18 | |
Pr>chi2 = 0.0000 | |
. | |
. * Logistic regression with log-odds. | |
. logit nomigrants i.cohort | |
Iteration 0: log likelihood = -21383.636 | |
Iteration 1: log likelihood = -21185.227 | |
Iteration 2: log likelihood = -21185.21 | |
Iteration 3: log likelihood = -21185.21 | |
Logistic regression Number of obs = 30868 | |
LR chi2(5) = 396.85 | |
Prob > chi2 = 0.0000 | |
Log likelihood = -21185.21 Pseudo R2 = 0.0093 | |
------------------------------------------------------------------------------ | |
nomigrants | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
cohort | | |
35 | .0913354 .0368324 2.48 0.013 .0191452 .1635256 | |
45 | .1443139 .0370347 3.90 0.000 .0717273 .2169005 | |
55 | .3200077 .0376561 8.50 0.000 .2462031 .3938123 | |
65 | .5366197 .0407424 13.17 0.000 .456766 .6164733 | |
75 | .7340535 .0473003 15.52 0.000 .6413466 .8267603 | |
| | |
_cons | -.3027629 .0267222 -11.33 0.000 -.3551374 -.2503883 | |
------------------------------------------------------------------------------ | |
. | |
. * Logistic regression with odds ratios. | |
. logit nomigrants i.cohort, or | |
Iteration 0: log likelihood = -21383.636 | |
Iteration 1: log likelihood = -21185.227 | |
Iteration 2: log likelihood = -21185.21 | |
Iteration 3: log likelihood = -21185.21 | |
Logistic regression Number of obs = 30868 | |
LR chi2(5) = 396.85 | |
Prob > chi2 = 0.0000 | |
Log likelihood = -21185.21 Pseudo R2 = 0.0093 | |
------------------------------------------------------------------------------ | |
nomigrants | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
cohort | | |
35 | 1.095636 .040355 2.48 0.013 1.01933 1.177656 | |
45 | 1.155247 .0427842 3.90 0.000 1.074362 1.242221 | |
55 | 1.377138 .0518577 8.50 0.000 1.279159 1.482622 | |
65 | 1.710216 .0696783 13.17 0.000 1.578959 1.852384 | |
75 | 2.083509 .0985506 15.52 0.000 1.899036 2.285901 | |
| | |
_cons | .7387743 .0197417 -11.33 0.000 .7010771 .7784984 | |
------------------------------------------------------------------------------ | |
. | |
. * Baseline model. | |
. logit nomigrants $bl [pw = dpw] // coefficients are log-odds | |
Iteration 0: log pseudolikelihood = -21922.411 | |
Iteration 1: log pseudolikelihood = -20962.96 | |
Iteration 2: log pseudolikelihood = -20960.853 | |
Iteration 3: log pseudolikelihood = -20960.853 | |
Logistic regression Number of obs = 30399 | |
Wald chi2(7) = 572.04 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -20960.853 Pseudo R2 = 0.0439 | |
------------------------------------------------------------------------------ | |
| Robust | |
nomigrants | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0020519 .0013714 1.50 0.135 -.0006359 .0047398 | |
1.female | .1106589 .0407043 2.72 0.007 .0308799 .1904379 | |
1.born | .4735991 .0783621 6.04 0.000 .3200122 .6271859 | |
| | |
edu3 | | |
2 | -.345609 .0500825 -6.90 0.000 -.4437688 -.2474492 | |
3 | -.7982442 .0537806 -14.84 0.000 -.9036522 -.6928362 | |
| | |
income | -.0678676 .0079785 -8.51 0.000 -.0835052 -.05223 | |
rightwing | .0752383 .0094785 7.94 0.000 .0566609 .0938158 | |
_cons | -.3394748 .1285284 -2.64 0.008 -.5913859 -.0875636 | |
------------------------------------------------------------------------------ | |
. | |
. * Log-odds are variations in the probability of the DV. Negative log-odds imply | |
. * that an increase in the IV, or the presence of it, reduces the probability of | |
. * the DV being equal to 1. Log-odds can be compared by magnitude, but at that | |
. * stage, it is usually simpler to read only the sign of the coefficient and its | |
. * significance level (p-value, closeness of confidence interval bounds to zero). | |
. | |
. * Odds ratios. | |
. logit nomigrants $bl [pw = dpw], or | |
Iteration 0: log pseudolikelihood = -21922.411 | |
Iteration 1: log pseudolikelihood = -20962.96 | |
Iteration 2: log pseudolikelihood = -20960.853 | |
Iteration 3: log pseudolikelihood = -20960.853 | |
Logistic regression Number of obs = 30399 | |
Wald chi2(7) = 572.04 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -20960.853 Pseudo R2 = 0.0439 | |
------------------------------------------------------------------------------ | |
| Robust | |
nomigrants | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | 1.002054 .0013742 1.50 0.135 .9993643 1.004751 | |
1.female | 1.117014 .0454673 2.72 0.007 1.031362 1.209779 | |
1.born | 1.605763 .1258309 6.04 0.000 1.377145 1.872334 | |
| | |
edu3 | | |
2 | .7077892 .0354478 -6.90 0.000 .6416137 .7807899 | |
3 | .4501186 .0242076 -14.84 0.000 .4050875 .5001555 | |
| | |
income | .9343842 .007455 -8.51 0.000 .9198863 .9491106 | |
rightwing | 1.078141 .0102191 7.94 0.000 1.058297 1.098357 | |
_cons | .7121443 .0915308 -2.64 0.008 .5535596 .9161606 | |
------------------------------------------------------------------------------ | |
. | |
. * Odds ratios provide an easier means of comparison between coefficients: for | |
. * example, in this model, completing upper secondary education increases the | |
. * likelihood of allowing migrants from different groups by a factor of 2.03, | |
. * i.e. higher-educated respondents are twice more likely than others to have | |
. * answered "Some" or "Many" to the original question. | |
. | |
. * Adjusted model. | |
. logit nomigrants $bl [pw = dpw], vce(cluster cid) | |
Iteration 0: log pseudolikelihood = -21922.411 | |
Iteration 1: log pseudolikelihood = -20962.96 | |
Iteration 2: log pseudolikelihood = -20960.853 | |
Iteration 3: log pseudolikelihood = -20960.853 | |
Logistic regression Number of obs = 30399 | |
Wald chi2(7) = 329.05 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -20960.853 Pseudo R2 = 0.0439 | |
(Std. Err. adjusted for 26 clusters in cid) | |
------------------------------------------------------------------------------ | |
| Robust | |
nomigrants | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0020519 .0030196 0.68 0.497 -.0038664 .0079703 | |
1.female | .1106589 .0423726 2.61 0.009 .0276101 .1937077 | |
1.born | .4735991 .1093272 4.33 0.000 .2593217 .6878765 | |
| | |
edu3 | | |
2 | -.345609 .1412374 -2.45 0.014 -.6224291 -.0687888 | |
3 | -.7982442 .1598086 -5.00 0.000 -1.111463 -.4850251 | |
| | |
income | -.0678676 .0123987 -5.47 0.000 -.0921686 -.0435666 | |
rightwing | .0752383 .0258223 2.91 0.004 .0246275 .1258492 | |
_cons | -.3394748 .2798738 -1.21 0.225 -.8880173 .2090677 | |
------------------------------------------------------------------------------ | |
. | |
. * Odds ratios. | |
. logit nomigrants $bl [pw = dpw], vce(cluster cid) or | |
Iteration 0: log pseudolikelihood = -21922.411 | |
Iteration 1: log pseudolikelihood = -20962.96 | |
Iteration 2: log pseudolikelihood = -20960.853 | |
Iteration 3: log pseudolikelihood = -20960.853 | |
Logistic regression Number of obs = 30399 | |
Wald chi2(7) = 329.05 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -20960.853 Pseudo R2 = 0.0439 | |
(Std. Err. adjusted for 26 clusters in cid) | |
------------------------------------------------------------------------------ | |
| Robust | |
nomigrants | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | 1.002054 .0030258 0.68 0.497 .9961411 1.008002 | |
1.female | 1.117014 .0473308 2.61 0.009 1.027995 1.213741 | |
1.born | 1.605763 .1755536 4.33 0.000 1.296051 1.989486 | |
| | |
edu3 | | |
2 | .7077892 .0999663 -2.45 0.014 .5366393 .9335238 | |
3 | .4501186 .0719328 -5.00 0.000 .329077 .6156818 | |
| | |
income | .9343842 .0115851 -5.47 0.000 .9119514 .9573688 | |
rightwing | 1.078141 .0278401 2.91 0.004 1.024933 1.134111 | |
_cons | .7121443 .1993105 -1.21 0.225 .4114708 1.232528 | |
------------------------------------------------------------------------------ | |
. | |
. * Export. | |
. eststo clear | |
. eststo log_1: logit nomigrants $bl [pw = dpw] | |
Iteration 0: log pseudolikelihood = -21922.411 | |
Iteration 1: log pseudolikelihood = -20962.96 | |
Iteration 2: log pseudolikelihood = -20960.853 | |
Iteration 3: log pseudolikelihood = -20960.853 | |
Logistic regression Number of obs = 30399 | |
Wald chi2(7) = 572.04 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -20960.853 Pseudo R2 = 0.0439 | |
------------------------------------------------------------------------------ | |
| Robust | |
nomigrants | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0020519 .0013714 1.50 0.135 -.0006359 .0047398 | |
1.female | .1106589 .0407043 2.72 0.007 .0308799 .1904379 | |
1.born | .4735991 .0783621 6.04 0.000 .3200122 .6271859 | |
| | |
edu3 | | |
2 | -.345609 .0500825 -6.90 0.000 -.4437688 -.2474492 | |
3 | -.7982442 .0537806 -14.84 0.000 -.9036522 -.6928362 | |
| | |
income | -.0678676 .0079785 -8.51 0.000 -.0835052 -.05223 | |
rightwing | .0752383 .0094785 7.94 0.000 .0566609 .0938158 | |
_cons | -.3394748 .1285284 -2.64 0.008 -.5913859 -.0875636 | |
------------------------------------------------------------------------------ | |
. eststo log_2: logit nomigrants $bl [pw = dpw], vce(cluster cid) | |
Iteration 0: log pseudolikelihood = -21922.411 | |
Iteration 1: log pseudolikelihood = -20962.96 | |
Iteration 2: log pseudolikelihood = -20960.853 | |
Iteration 3: log pseudolikelihood = -20960.853 | |
Logistic regression Number of obs = 30399 | |
Wald chi2(7) = 329.05 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -20960.853 Pseudo R2 = 0.0439 | |
(Std. Err. adjusted for 26 clusters in cid) | |
------------------------------------------------------------------------------ | |
| Robust | |
nomigrants | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0020519 .0030196 0.68 0.497 -.0038664 .0079703 | |
1.female | .1106589 .0423726 2.61 0.009 .0276101 .1937077 | |
1.born | .4735991 .1093272 4.33 0.000 .2593217 .6878765 | |
| | |
edu3 | | |
2 | -.345609 .1412374 -2.45 0.014 -.6224291 -.0687888 | |
3 | -.7982442 .1598086 -5.00 0.000 -1.111463 -.4850251 | |
| | |
income | -.0678676 .0123987 -5.47 0.000 -.0921686 -.0435666 | |
rightwing | .0752383 .0258223 2.91 0.004 .0246275 .1258492 | |
_cons | -.3394748 .2798738 -1.21 0.225 -.8880173 .2090677 | |
------------------------------------------------------------------------------ | |
. esttab log_? using week10_logits.txt, mti("Logit" "Adj. logit") replace | |
(note: file week10_logits.txt not found) | |
(output written to week10_logits.txt) | |
. | |
. | |
. * Marginal effects | |
. * ---------------- | |
. | |
. * Marginal effects of political attitude: estimated probability of DV at each | |
. * level of the 10-point left/right scale used in the model, all other factors | |
. * kept constant (demographics, education and income). | |
. margins, at(rightwing = (0(1)10)) | |
Predictive margins Number of obs = 30399 | |
Model VCE : Robust | |
Expression : Pr(nomigrants), predict() | |
1._at : rightwing = 0 | |
2._at : rightwing = 1 | |
3._at : rightwing = 2 | |
4._at : rightwing = 3 | |
5._at : rightwing = 4 | |
6._at : rightwing = 5 | |
7._at : rightwing = 6 | |
8._at : rightwing = 7 | |
9._at : rightwing = 8 | |
10._at : rightwing = 9 | |
11._at : rightwing = 10 | |
------------------------------------------------------------------------------ | |
| Delta-method | |
| Margin Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
_at | | |
1 | .3941882 .0538393 7.32 0.000 .2886651 .4997112 | |
2 | .4113932 .0494438 8.32 0.000 .3144852 .5083012 | |
3 | .4288029 .0449954 9.53 0.000 .3406135 .5169923 | |
4 | .4463777 .0405952 11.00 0.000 .3668126 .5259428 | |
5 | .4640767 .0363763 12.76 0.000 .3927805 .5353728 | |
6 | .4818579 .0325163 14.82 0.000 .4181271 .5455887 | |
7 | .4996788 .0292476 17.08 0.000 .4423546 .557003 | |
8 | .5174964 .0268496 19.27 0.000 .4648722 .5701205 | |
9 | .5352678 .0255948 20.91 0.000 .4851029 .5854327 | |
10 | .5529507 .0256391 21.57 0.000 .502699 .6032025 | |
11 | .5705037 .0269283 21.19 0.000 .5177252 .6232822 | |
------------------------------------------------------------------------------ | |
. marginsplot, xla(minmax) recast(line) recastci(rarea) ciopts(col(*.6)) /// | |
> name(mfx_right, replace) | |
Variables that uniquely identify margins: rightwing | |
. | |
. * Marginal effects of educational attainment, by gender and country of birth. | |
. * The margins command will generate estimate for all possible permutations of | |
. * the IV list provided, and then plot them as confidence intervals. | |
. margins born#female, at(edu3 = (1(1)3)) | |
Predictive margins Number of obs = 30399 | |
Model VCE : Robust | |
Expression : Pr(nomigrants), predict() | |
1._at : edu3 = 1 | |
2._at : edu3 = 2 | |
3._at : edu3 = 3 | |
--------------------------------------------------------------------------------- | |
| Delta-method | |
| Margin Std. Err. z P>|z| [95% Conf. Interval] | |
----------------+---------------------------------------------------------------- | |
_at#born#female | | |
1 0 0 | .44634 .0258202 17.29 0.000 .3957333 .4969468 | |
1 0 1 | .4733903 .0293437 16.13 0.000 .4158777 .5309029 | |
1 1 0 | .5623557 .0273294 20.58 0.000 .508791 .6159204 | |
1 1 1 | .5889661 .0317194 18.57 0.000 .5267972 .6511349 | |
2 0 0 | .364512 .0412211 8.84 0.000 .2837202 .4453038 | |
2 0 1 | .3901174 .0459254 8.49 0.000 .3001052 .4801296 | |
2 1 0 | .477645 .0412908 11.57 0.000 .3967165 .5585734 | |
2 1 1 | .5048604 .0466906 10.81 0.000 .4133486 .5963722 | |
3 0 0 | .2684944 .0343446 7.82 0.000 .2011804 .3358085 | |
3 0 1 | .2904831 .0391938 7.41 0.000 .2136647 .3673016 | |
3 1 0 | .36931 .0379644 9.73 0.000 .2949012 .4437188 | |
3 1 1 | .3950411 .0440971 8.96 0.000 .3086122 .4814699 | |
--------------------------------------------------------------------------------- | |
. marginsplot, xla(minmax) by(female born) /// | |
> name(mfx_demog, replace) | |
Variables that uniquely identify margins: edu3 born female | |
. | |
. * Effect of increasing age on the probability of the DV being equal to 1, by sex | |
. * and country of birth. The overlap in confidence intervals illustrates the weak | |
. * value of age as a predictor for the DV: the marginal effect of age is residual | |
. * in the model, at least in comparison to other predictors. | |
. margins born#female, at(age=(25(5)85)) | |
Predictive margins Number of obs = 30399 | |
Model VCE : Robust | |
Expression : Pr(nomigrants), predict() | |
1._at : age = 25 | |
2._at : age = 30 | |
3._at : age = 35 | |
4._at : age = 40 | |
5._at : age = 45 | |
6._at : age = 50 | |
7._at : age = 55 | |
8._at : age = 60 | |
9._at : age = 65 | |
10._at : age = 70 | |
11._at : age = 75 | |
12._at : age = 80 | |
13._at : age = 85 | |
--------------------------------------------------------------------------------- | |
| Delta-method | |
| Margin Std. Err. z P>|z| [95% Conf. Interval] | |
----------------+---------------------------------------------------------------- | |
_at#born#female | | |
1 0 0 | .3584472 .0367548 9.75 0.000 .2864091 .4304852 | |
1 0 1 | .3830138 .0395727 9.68 0.000 .3054528 .4605748 | |
1 1 0 | .4670019 .039842 11.72 0.000 .3889129 .5450908 | |
1 1 1 | .4931773 .0432812 11.39 0.000 .4083478 .5780069 | |
2 0 0 | .3606976 .0347461 10.38 0.000 .2925966 .4287987 | |
2 0 1 | .3853225 .0378633 10.18 0.000 .3111117 .4595332 | |
2 1 0 | .4694238 .0374101 12.55 0.000 .3961014 .5427462 | |
2 1 1 | .4956078 .0412816 12.01 0.000 .4146974 .5765182 | |
3 0 0 | .3629539 .0329269 11.02 0.000 .2984183 .4274895 | |
3 0 1 | .387636 .0363767 10.66 0.000 .316339 .458933 | |
3 1 0 | .4718471 .0351654 13.42 0.000 .4029242 .54077 | |
3 1 1 | .4980384 .0395035 12.61 0.000 .420613 .5754638 | |
4 0 0 | .3652159 .0313362 11.65 0.000 .3037982 .4266337 | |
4 0 1 | .3899543 .0351455 11.10 0.000 .3210704 .4588383 | |
4 1 0 | .4742716 .0331478 14.31 0.000 .409303 .5392401 | |
4 1 1 | .5004691 .0379786 13.18 0.000 .4260324 .5749057 | |
5 0 0 | .3674836 .030016 12.24 0.000 .3086534 .4263138 | |
5 0 1 | .3922774 .0342019 11.47 0.000 .3252429 .4593119 | |
5 1 0 | .4766972 .0314029 15.18 0.000 .4151485 .5382458 | |
5 1 1 | .5028997 .0367388 13.69 0.000 .4308929 .5749065 | |
6 0 0 | .3697568 .0290093 12.75 0.000 .3128996 .4266139 | |
6 0 1 | .394605 .0335745 11.75 0.000 .3288002 .4604098 | |
6 1 0 | .4791237 .02998 15.98 0.000 .420364 .5378835 | |
6 1 1 | .5053301 .0358141 14.11 0.000 .4351357 .5755245 | |
7 0 0 | .3720355 .0283555 13.12 0.000 .3164597 .4276112 | |
7 0 1 | .3969372 .0332854 11.93 0.000 .3316989 .4621754 | |
7 1 0 | .4815512 .0289281 16.65 0.000 .4248533 .5382492 | |
7 1 1 | .5077602 .0352293 14.41 0.000 .438712 .5768084 | |
8 0 0 | .3743195 .0280851 13.33 0.000 .3192737 .4293653 | |
8 0 1 | .3992738 .0333477 11.97 0.000 .3339135 .464634 | |
8 1 0 | .4839795 .0282897 17.11 0.000 .4285327 .5394262 | |
8 1 1 | .51019 .0350013 14.58 0.000 .4415887 .5787912 | |
9 0 0 | .3766089 .0282149 13.35 0.000 .3213088 .431909 | |
9 0 1 | .4016147 .0337632 11.90 0.000 .33544 .4677893 | |
9 1 0 | .4864084 .0280941 17.31 0.000 .431345 .5414718 | |
9 1 1 | .5126192 .0351367 14.59 0.000 .4437526 .5814857 | |
10 0 0 | .3789035 .0287447 13.18 0.000 .3225649 .4352422 | |
10 0 1 | .4039598 .034523 11.70 0.000 .3362961 .4716236 | |
10 1 0 | .4888379 .0283512 17.24 0.000 .4332706 .5444051 | |
10 1 1 | .5150478 .0356307 14.46 0.000 .4452128 .5848828 | |
11 0 0 | .3812033 .0296584 12.85 0.000 .3230738 .4393327 | |
11 0 1 | .4063091 .0356083 11.41 0.000 .3365181 .4761001 | |
11 1 0 | .4912678 .0290494 16.91 0.000 .434332 .5482036 | |
11 1 1 | .5174757 .0364683 14.19 0.000 .4459992 .5889522 | |
12 0 0 | .383508 .0309267 12.40 0.000 .3228929 .4441232 | |
12 0 1 | .4086624 .0369938 11.05 0.000 .3361559 .4811689 | |
12 1 0 | .4936981 .0301585 16.37 0.000 .4345886 .5528076 | |
12 1 1 | .5199027 .0376254 13.82 0.000 .4461582 .5936473 | |
13 0 0 | .3858177 .0325124 11.87 0.000 .3220947 .4495408 | |
13 0 1 | .4110196 .03865 10.63 0.000 .3352671 .4867722 | |
13 1 0 | .4961286 .0316351 15.68 0.000 .4341249 .5581323 | |
13 1 1 | .5223289 .0390729 13.37 0.000 .4457474 .5989103 | |
--------------------------------------------------------------------------------- | |
. marginsplot, by(female) recast(line) recastci(rarea) ciopts(col(*.6)) /// | |
> name(mfx_age, replace) | |
Variables that uniquely identify margins: age born female | |
. | |
. | |
. * Sensitivity analysis | |
. * -------------------- | |
. | |
. * Ordered logistic regression, to test the cut point that we chose when recoding | |
. * the DV to a dummy. The results should show identical signs on the coefficients | |
. * and their order of magnitude should also stay stable. If not, then the model | |
. * is sensitive to the choice of cutoff point that we made earlier. Note that in | |
. * our example, the signs of the coefficients should actually be the same for the | |
. * OLS (linear regression) and ordered logit, not for the logit (the logit codes | |
. * the dummy in reverse order to the original variable). | |
. ologit imdfetn $bl [pw = dpw], vce(cluster cid) | |
Iteration 0: log pseudolikelihood = -40526.845 | |
Iteration 1: log pseudolikelihood = -39309.792 | |
Iteration 2: log pseudolikelihood = -39302.192 | |
Iteration 3: log pseudolikelihood = -39302.188 | |
Iteration 4: log pseudolikelihood = -39302.188 | |
Ordered logistic regression Number of obs = 30399 | |
Wald chi2(7) = 443.81 | |
Prob > chi2 = 0.0000 | |
Log pseudolikelihood = -39302.188 Pseudo R2 = 0.0302 | |
(Std. Err. adjusted for 26 clusters in cid) | |
------------------------------------------------------------------------------ | |
| Robust | |
imdfetn | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age | .0012223 .0027681 0.44 0.659 -.0042031 .0066477 | |
1.female | .100056 .0331588 3.02 0.003 .0350661 .165046 | |
1.born | .4944239 .1045674 4.73 0.000 .2894756 .6993722 | |
| | |
edu3 | | |
2 | -.3890632 .1585103 -2.45 0.014 -.6997377 -.0783887 | |
3 | -.8055187 .1636851 -4.92 0.000 -1.126336 -.4847017 | |
| | |
income | -.0710706 .0116204 -6.12 0.000 -.0938461 -.0482951 | |
rightwing | .0839352 .0273557 3.07 0.002 .0303189 .1375515 | |
-------------+---------------------------------------------------------------- | |
/cut1 | -1.794026 .1873578 -2.161241 -1.426812 | |
/cut2 | .3108741 .2636075 -.2057871 .8275354 | |
/cut3 | 2.053481 .3593456 1.349176 2.757785 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Export model results | |
. * -------------------- | |
. | |
. eststo clear | |
. eststo lin_1: qui reg imdfetn $bl [pw = dpw], b | |
. eststo lin_2: qui reg imdfetn $bl [pw = dpw], vce(cluster cid) | |
. eststo log_1: qui logit nomigrants $bl [pw = dpw] | |
. eststo log_2: qui logit nomigrants $bl [pw = dpw], vce(cluster cid) | |
. eststo log_3: qui ologit imdfetn $bl [pw = dpw], vce(cluster cid) | |
. esttab lin_* log_* using week10_models.txt, constant label beta(2) se(2) r2(2) /// | |
> mti("OLS" "Adj. OLS" "Logit" "Adj. logit" "Ord. logit") replace | |
(note: file week10_models.txt not found) | |
(output written to week10_models.txt) | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Thanks for following! And all the best for the future. | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require estout fre leanout renvars scheme-burd spineplot | |
. | |
. * Log results. | |
. cap log using code/week11.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 11 ------------------ | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Satisfaction with Health Services in Britain and France | |
> | |
> - DATA: European Social Survey Round 4 (2008) | |
> | |
> We explore patterns of satisfaction with the state of health services in | |
> the UK and France, two countries with extensive public healthcare systems | |
> and where health services play different roles in political competition. | |
> | |
> - (H1): We expect to observe high satisfaction on average, except among those | |
> in ill health, who we expect to report lower satisfaction regardless of age, | |
> sex, income or political views. | |
> | |
> - (H2): We also expect respondents in political opposition to the government to | |
> report less satisfaction with the state of health services in the country, | |
> independently of all other characteristics. | |
> | |
> - (H3): We finally expect to find lower patterns of satisfaction among those | |
> who report financial difficulties, as evidence of an income effect that | |
> we expect to exist in isolation of all others. | |
> | |
> We use data from the European Social Survey (ESS) Round 4. The sample used in | |
> the analysis contains N = 1,942 French and N = 2,079 UK individuals selected | |
> through stratified probability sampling and interviewed face-to-face in 2008. | |
> | |
> We run linear regressions for each country to assess whether satisfaction | |
> with health services can be predicted from political views, independently | |
> of age, sex, health status and financial situation. | |
> | |
> Last updated 2013-05-31. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load ESS dataset. | |
. use data/ess2008, clear | |
(European Social Survey 2008) | |
. | |
. * Country-specific design weight, multiplied by country-level population weight. | |
. gen dpw = dweight * pweight | |
. la var dpw "Survey weight (population * design)" | |
. | |
. * Survey weights. | |
. svyset [pw = dpw] | |
pweight: dpw | |
VCE: linearized | |
Single unit: missing | |
Strata 1: <one> | |
SU 1: <observations> | |
FPC 1: <zero> | |
. | |
. * Country dummies (used for clustered standard errors). | |
. encode cntry, gen(cid) | |
. | |
. | |
. * Dependent variable | |
. * ------------------ | |
. | |
. d stf* | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
stflife byte %2.0f stflife How satisfied with life as a whole | |
stfeco byte %2.0f stfeco How satisfied with present state of | |
economy in country | |
stfgov byte %2.0f stfgov How satisfied with the national | |
government | |
stfdem byte %2.0f stfdem How satisfied with the way democracy | |
works in country | |
stfedu byte %2.0f stfedu State of education in country nowadays | |
stfhlth byte %2.0f stfhlth State of health services in country | |
nowadays | |
. | |
. * Rename DV and a bunch of covariates | |
. renvars stfhlth stfedu stfgov \ hsat esat gsat | |
. | |
. * Country-specific distributions. | |
. tab cntry, su(hsat) | |
| Summary of State of health services | |
| in country nowadays | |
Country | Mean Std. Dev. Freq. | |
------------+------------------------------------ | |
BE | 7 2 1758 | |
BG | 3 2 2163 | |
CH | 7 2 1811 | |
CY | 6 2 1184 | |
CZ | 5 2 1999 | |
DE | 5 2 2723 | |
DK | 6 2 1593 | |
EE | 5 2 1631 | |
ES | 6 2 2552 | |
FI | 7 2 2190 | |
FR | 6 2 2065 | |
GB | 6 2 2343 | |
GR | 3 2 2057 | |
HR | 4 2 1467 | |
HU | 4 2 1519 | |
IE | 4 2 1758 | |
IL | 6 2 2421 | |
LV | 4 2 1948 | |
NL | 6 2 1764 | |
NO | 6 2 1548 | |
PL | 4 2 1601 | |
PT | 4 2 2334 | |
RO | 4 3 2105 | |
RU | 4 2 2461 | |
SE | 6 2 1812 | |
SI | 5 2 1271 | |
SK | 4 2 1796 | |
TR | 5 3 2363 | |
UA | 2 2 1789 | |
------------+------------------------------------ | |
Total | 5 3 56026 | |
. hist hsat, discrete by(cntry, note("")) /// | |
> name(dv_bins, replace) | |
. | |
. * Detailed summary statistics. | |
. su hsat, d | |
State of health services in country nowadays | |
------------------------------------------------------------- | |
Percentiles Smallest | |
1% 0 0 | |
5% 0 0 | |
10% 1 0 Obs 56026 | |
25% 3 0 Sum of Wgt. 56026 | |
50% 5 Mean 4.999893 | |
Largest Std. Dev. 2.607701 | |
75% 7 10 | |
90% 8 10 Variance 6.800107 | |
95% 9 10 Skewness -.1756378 | |
99% 10 10 Kurtosis 2.165036 | |
. | |
. | |
. * Cross-country comparisons | |
. * ------------------------- | |
. | |
. * Cross-country visualization (mean). | |
. gr dot hsat, over(cntry, sort(1)des) yla(0 "Min" 10 "Max") /// | |
> yti("Satisfaction in health services") /// | |
> name(dv_dots, replace) | |
. | |
. * Cross-country visualization (median). | |
. gr box hsat, noout over(cntry, sort(1)des) yla(0 "Min" 10 "Max") /// | |
> yti("Satisfaction in health services") /// | |
> name(dv_boxes, replace) | |
. | |
. * Generate dummies for the full 11-pt scale DV. | |
. cap drop hsat11_* | |
. tab hsat, gen(hsat11_) | |
State of | | |
health | | |
services in | | |
country | | |
nowadays | Freq. Percent Cum. | |
---------------+----------------------------------- | |
Extremely bad | 3,298 5.89 5.89 | |
1 | 3,022 5.39 11.28 | |
2 | 4,618 8.24 19.52 | |
3 | 6,050 10.80 30.32 | |
4 | 5,767 10.29 40.62 | |
5 | 8,318 14.85 55.46 | |
6 | 6,326 11.29 66.75 | |
7 | 7,571 13.51 80.27 | |
8 | 6,865 12.25 92.52 | |
9 | 2,725 4.86 97.38 | |
Extremely good | 1,466 2.62 100.00 | |
---------------+----------------------------------- | |
Total | 56,026 100.00 | |
. | |
. * Cross-country visualization (proportions). | |
. gr hbar hsat11_*, over(cntry, sort(1)des) stack legend(off) /// | |
> yti("Satisfaction in health services") /// | |
> scheme(burd11) name(dv_bars, replace) | |
. | |
. | |
. * Independent variables | |
. * --------------------- | |
. | |
. fre agea gndr health hincfel lrscale, r(10) | |
agea -- Age of respondent, calculated | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 15 | 356 0.63 0.63 0.63 | |
16 | 623 1.10 1.10 1.73 | |
17 | 729 1.28 1.29 3.02 | |
18 | 674 1.19 1.19 4.21 | |
19 | 824 1.45 1.46 5.67 | |
: | : : : : | |
97 | 4 0.01 0.01 99.99 | |
98 | 2 0.00 0.00 99.99 | |
99 | 2 0.00 0.00 99.99 | |
105 | 2 0.00 0.00 100.00 | |
123 | 1 0.00 0.00 100.00 | |
Total | 56544 99.63 100.00 | |
Missing .a | 208 0.37 | |
Total | 56752 100.00 | |
----------------------------------------------------------- | |
gndr -- Gender | |
--------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------+-------------------------------------------- | |
Valid 1 Male | 25787 45.44 45.46 45.46 | |
2 Female | 30935 54.51 54.54 100.00 | |
Total | 56722 99.95 100.00 | |
Missing .a | 30 0.05 | |
Total | 56752 100.00 | |
--------------------------------------------------------------- | |
health -- Subjective general health | |
------------------------------------------------------------------ | |
| Freq. Percent Valid Cum. | |
---------------------+-------------------------------------------- | |
Valid 1 Very good | 12245 21.58 21.61 21.61 | |
2 Good | 22949 40.44 40.49 62.10 | |
3 Fair | 15863 27.95 27.99 90.09 | |
4 Bad | 4633 8.16 8.17 98.27 | |
5 Very bad | 983 1.73 1.73 100.00 | |
Total | 56673 99.86 100.00 | |
Missing .a | 10 0.02 | |
.b | 53 0.09 | |
.c | 16 0.03 | |
Total | 79 0.14 | |
Total | 56752 100.00 | |
------------------------------------------------------------------ | |
hincfel -- Feeling about household's income nowadays | |
---------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------------------------+-------------------------------------------- | |
Valid 1 Living comfortably on | 13135 23.14 23.41 23.41 | |
present income | | |
2 Coping on present income | 24544 43.25 43.74 67.15 | |
3 Difficult on present | 12681 22.34 22.60 89.76 | |
income | | |
4 Very difficult on present | 5748 10.13 10.24 100.00 | |
income | | |
Total | 56108 98.87 100.00 | |
Missing .a | 97 0.17 | |
.b | 442 0.78 | |
.c | 105 0.19 | |
Total | 644 1.13 | |
Total | 56752 100.00 | |
---------------------------------------------------------------------------------- | |
lrscale -- Placement on left right scale | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 0 Left | 1589 2.80 3.34 3.34 | |
1 1 | 1261 2.22 2.65 5.99 | |
2 2 | 2590 4.56 5.44 11.44 | |
3 3 | 4644 8.18 9.76 21.20 | |
4 4 | 4695 8.27 9.87 31.07 | |
5 5 | 15412 27.16 32.40 63.47 | |
6 6 | 4498 7.93 9.46 72.92 | |
7 7 | 4960 8.74 10.43 83.35 | |
8 8 | 4102 7.23 8.62 91.97 | |
9 9 | 1601 2.82 3.37 95.34 | |
10 Right | 2217 3.91 4.66 100.00 | |
Total | 47569 83.82 100.00 | |
Missing .a | 856 1.51 | |
.b | 8191 14.43 | |
.c | 136 0.24 | |
Total | 9183 16.18 | |
Total | 56752 100.00 | |
-------------------------------------------------------------- | |
. | |
. * Recode sex to dummy. | |
. gen female:female = (gndr == 2) if !mi(gndr) | |
(30 missing values generated) | |
. la def female 0 "Male" 1 "Female", replace | |
. la var female "Gender" | |
. | |
. * Fix age variable name. | |
. ren agea age | |
. | |
. * Generate six age groups (15-24, 25-34, ..., 65+). | |
. gen age6:age6 = irecode(age, 24, 34, 44, 54, 64, .) | |
(208 missing values generated) | |
. replace age6 = 10 * age6 + 15 | |
(56544 real changes made) | |
. la def age6 15 "15-24" 25 "25-34" 35 "35-44" /// | |
> 45 "45-54" 55 "55-64" 65 "65+", replace | |
. la var age6 "Age groups" | |
. | |
. * Subjective low income dummy. | |
. gen lowinc = (hincfel > 2) if !mi(hincfel) | |
(644 missing values generated) | |
. la var lowinc "Subjective low income" | |
. | |
. * Recode left-right scale. | |
. recode lrscale (0/4 = 1 "Left") (5 = 2 "Centre") (6/10 = 3 "Right"), gen(pol3) | |
(46308 differences between lrscale and pol3) | |
. la var pol3 "Political views (left-right)" | |
. | |
. | |
. * Subsetting | |
. * ---------- | |
. | |
. * Check missing values. | |
. misstable pat hsat age6 female health pol3 lowinc if cntry == "FR" | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Percent | 1 2 3 | |
------------+------------- | |
94% | 1 1 1 | |
| | |
6 | 1 1 0 | |
<1 | 0 1 1 | |
<1 | 1 0 0 | |
<1 | 1 0 1 | |
<1 | 0 1 0 | |
<1 | 0 0 0 | |
------------+------------- | |
100% | | |
Variables are (1) lowinc (2) hsat (3) pol3 | |
. misstable pat hsat age6 female health pol3 lowinc if cntry == "GB" | |
Missing-value patterns | |
(1 means complete) | |
| Pattern | |
Percent | 1 2 3 4 5 6 | |
------------+--------------------- | |
88% | 1 1 1 1 1 1 | |
| | |
10 | 1 1 1 1 1 0 | |
<1 | 1 1 1 1 0 1 | |
<1 | 1 1 0 0 0 1 | |
<1 | 1 0 1 1 1 1 | |
<1 | 1 1 1 0 0 1 | |
<1 | 1 1 1 0 1 0 | |
<1 | 1 1 1 0 1 1 | |
<1 | 1 1 1 0 0 0 | |
<1 | 0 1 1 1 1 1 | |
<1 | 1 0 1 1 1 0 | |
<1 | 1 1 0 0 0 0 | |
<1 | 1 1 1 1 0 0 | |
------------+--------------------- | |
100% | | |
Variables are (1) health (2) hsat (3) female (4) lowinc (5) age6 (6) pol3 | |
. | |
. * Select case studies. | |
. keep if inlist(cntry, "FR", "GB") | |
(52327 observations deleted) | |
. | |
. * Delete incomplete observations. | |
. drop if mi(hsat, age6, female, health, pol3, lowinc) | |
(404 observations deleted) | |
. | |
. * Final sample sizes. | |
. bys cntry: count | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
1942 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
2079 | |
. | |
. | |
. * Normality | |
. * --------- | |
. | |
. * Distribution of the DV in the case studies. | |
. hist hsat, discrete normal xla(0 10) by(cntry, legend(off) note("")) /// | |
> name(dv_histograms, replace) | |
. | |
. * Generate strictly positive DV recode. | |
. gen hsat1 = hsat + 1 | |
. | |
. * Visual check of common transformations. | |
. gladder hsat1, bin(11) /// | |
> name(gladder, replace) | |
. | |
. /* Notes: | |
> | |
> - There are more missing observations for Britain than for France, and this | |
> might distort the results if the non-respondents come, for example, from the | |
> same end of the political spectrum. We'll be careful. | |
> | |
> - The distribution of the DV is skewed to the right in both case studies, which | |
> is consistent with the hypothesis that extensive healthcare states like the | |
> ones found in Britain France enjoy higher popular support. | |
> | |
> - To allow for a log-transformation, the variable should be strictly positive | |
> since the function f: y = log(x) is undefined for x = 0. We use a recode of | |
> the DV of strictly positive range to test for transformations. | |
> | |
> - The square root comes only marginally closer to a normal distribution. With | |
> little improvement in normality, transforming the DV would be overkill. It is | |
> reasonable to carry on with the untransformed DV. */ | |
. | |
. | |
. * Export summary statistics | |
. * ------------------------- | |
. | |
. * The next command is part of the SRQM folder. If Stata returns an error when | |
. * you run it, set the folder as your working directory and type -run profile- | |
. * to run the course setup, then try the command again. If you still experience | |
. * problems with the -stab- command, please send a detailed email on the issue. | |
. | |
. stab using week11_stats_FR.txt if cntry == "FR", replace /// | |
> mean(hsat) /// | |
> prop(female age6 health lowinc pol3) | |
(note: file week11_stats_FR.txt not found) | |
Variable mean sd min max mea | |
> n sd min max mean sd min | |
> max mean sd min max mean | |
> sd min max | |
Gender % % % % | |
> % | |
Age groups % % % % | |
> % | |
Subjective general~h % % % % | |
> % | |
Subjective low inc~e % % % % | |
> % | |
Political views (l~) % % % % | |
> % | |
N = 19420 | |
File: week11_stats_FR.txt | |
. | |
. stab using week11_stats_GB.txt if cntry == "GB", replace /// | |
> mean(hsat) /// | |
> prop(female age6 health lowinc pol3) | |
(note: file week11_stats_GB.txt not found) | |
Variable mean sd min max mea | |
> n sd min max mean sd min | |
> max mean sd min max mean | |
> sd min max | |
Gender % % % % | |
> % | |
Age groups % % % % | |
> % | |
Subjective general~h % % % % | |
> % | |
Subjective low inc~e % % % % | |
> % | |
Political views (l~) % % % % | |
> % | |
N = 20790 | |
File: week11_stats_GB.txt | |
. | |
. /* Syntax of the -stab- command: | |
> | |
> - using FILE - name of the exported file; plain text (.txt) recommended | |
> - replace - overwrite any previously existing file | |
> - mean() - summarizes a list of continuous variables (mean, sd, min, max) | |
> - prop() - summarizes a list of categorical variables (frequencies) | |
> | |
> In the example above, the -stab- command will export two files to the working | |
> directory, containing summary statistics for France (week11_stats_FR.txt) and | |
> Britain (week11_stats_GB.txt). */ | |
. | |
. | |
. * ===================== | |
. * = ASSOCIATION TESTS = | |
. * ===================== | |
. | |
. | |
. * Relationships with socio-demographics | |
. * ------------------------------------- | |
. | |
. * Line graph using DV means computed for each age and gender group. | |
. cap drop msat_? | |
. bys cntry age6: egen msat_1 = mean(hsat) if female | |
(1880 missing values generated) | |
. bys cntry age6: egen msat_2 = mean(hsat) if !female | |
(2141 missing values generated) | |
. tw conn msat_? age6, by(cntry, note("")) /// | |
> xti("Age") yti("Mean level of satisfaction") /// | |
> legend(row(1) order(1 "Female" 2 "Male")) /// | |
> name(hsat_age_sex, replace) | |
. | |
. * Association between DV and gender. | |
. by cntry: ttest hsat, by(female) | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Male | 898 6.203786 .0707298 2.119535 6.064971 6.342601 | |
Female | 1044 5.822797 .0637708 2.060499 5.697663 5.947931 | |
---------+-------------------------------------------------------------------- | |
combined | 1942 5.99897 .0475649 2.096094 5.905687 6.092254 | |
---------+-------------------------------------------------------------------- | |
diff | .3809893 .0950314 .1946148 .5673637 | |
------------------------------------------------------------------------------ | |
diff = mean(Male) - mean(Female) t = 4.0091 | |
Ho: diff = 0 degrees of freedom = 1940 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Male | 982 6.255601 .0679263 2.128599 6.122303 6.388898 | |
Female | 1097 5.717411 .0676622 2.24104 5.584649 5.850173 | |
---------+-------------------------------------------------------------------- | |
combined | 2079 5.971621 .04835 2.204568 5.876802 6.06644 | |
---------+-------------------------------------------------------------------- | |
diff | .5381897 .096149 .3496312 .7267482 | |
------------------------------------------------------------------------------ | |
diff = mean(Male) - mean(Female) t = 5.5975 | |
Ho: diff = 0 degrees of freedom = 2077 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000 | |
. | |
. * Correlation between DV and age. | |
. by cntry: pwcorr hsat age, obs star(.01) | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
| hsat age | |
-------------+------------------ | |
hsat | 1.0000 | |
| 1942 | |
| | |
age | -0.0655* 1.0000 | |
| 1942 1942 | |
| | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
| hsat age | |
-------------+------------------ | |
hsat | 1.0000 | |
| 2079 | |
| | |
age | 0.2042* 1.0000 | |
| 2079 2079 | |
| | |
. | |
. * Generate a dummy from extreme categories of age. | |
. cap drop agex | |
. gen agex:agex = . | |
(4021 missing values generated) | |
. replace agex = 0 if age6 == 15 | |
(378 real changes made) | |
. replace agex = 1 if age6 == 65 | |
(915 real changes made) | |
. la def agex 0 "15-24" 1 "65+", replace | |
. | |
. * Difference between age extremes. | |
. bys cntry: ttest hsat, by(agex) | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
15-24 | 202 6.475248 .1298296 1.845225 6.219245 6.73125 | |
65+ | 420 5.961905 .1027907 2.106582 5.759855 6.163954 | |
---------+-------------------------------------------------------------------- | |
combined | 622 6.128617 .081723 2.038167 5.96813 6.289104 | |
---------+-------------------------------------------------------------------- | |
diff | .5133428 .1734354 .1727508 .8539347 | |
------------------------------------------------------------------------------ | |
diff = mean(15-24) - mean(65+) t = 2.9599 | |
Ho: diff = 0 degrees of freedom = 620 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.9984 Pr(|T| > |t|) = 0.0032 Pr(T > t) = 0.0016 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
15-24 | 176 5.784091 .1443521 1.915046 5.499196 6.068986 | |
65+ | 495 6.856566 .0958556 2.132653 6.668231 7.044901 | |
---------+-------------------------------------------------------------------- | |
combined | 671 6.575261 .0822037 2.129378 6.413853 6.736669 | |
---------+-------------------------------------------------------------------- | |
diff | -1.072475 .1823618 -1.430545 -.7144044 | |
------------------------------------------------------------------------------ | |
diff = mean(15-24) - mean(65+) t = -5.8810 | |
Ho: diff = 0 degrees of freedom = 669 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 | |
. | |
. | |
. * Relationship to health status | |
. * ----------------------------- | |
. | |
. * DV by health. | |
. gr dot hsat, over(health) over(cntry) /// | |
> yti("Satisfaction in health services") /// | |
> name(dv_health, replace) | |
. | |
. * Line graph using DV means computed for each health status and gender group. | |
. cap drop mu_hsat_* | |
. bys health female: egen mu_hsat_FR = mean(hsat) if cntry == "FR" | |
(2079 missing values generated) | |
. bys health female: egen mu_hsat_GB = mean(hsat) if cntry == "GB" | |
(1942 missing values generated) | |
. tw conn mu_hsat_* health, by(female, note("")) /// | |
> xti("Health status") yti("Mean level of satisfaction") /// | |
> xlab(1 "Good" 5 "Bad") /// | |
> legend(row(1) order(1 "FR" 2 "GB")) /// | |
> name(hsat_health, replace) | |
. | |
. * Generate a dummy from health status (bad/very bad = 0, good/very good = 1). | |
. cap drop health01 | |
. recode health (1/2 = 1 "Good") (4/5 = 0 "Poor") (else = .), gen(health01) | |
(2988 differences between health and health01) | |
. | |
. * Association between DV and health status. | |
. bys cntry: ttest hsat, by(health01) | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Poor | 143 5.545455 .2064476 2.468755 5.137347 5.953563 | |
Good | 1243 6.132743 .0577299 2.035335 6.019485 6.246002 | |
---------+-------------------------------------------------------------------- | |
combined | 1386 6.07215 .056162 2.090858 5.961978 6.182322 | |
---------+-------------------------------------------------------------------- | |
diff | -.5872888 .1840209 -.9482789 -.2262988 | |
------------------------------------------------------------------------------ | |
diff = mean(Poor) - mean(Good) t = -3.1914 | |
Ho: diff = 0 degrees of freedom = 1384 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0007 Pr(|T| > |t|) = 0.0014 Pr(T > t) = 0.9993 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
Poor | 141 5.929078 .2209048 2.623099 5.492337 6.365819 | |
Good | 1501 5.965356 .0541266 2.097014 5.859185 6.071528 | |
---------+-------------------------------------------------------------------- | |
combined | 1642 5.962241 .0529676 2.146332 5.85835 6.066132 | |
---------+-------------------------------------------------------------------- | |
diff | -.0362784 .1891085 -.4071979 .3346411 | |
------------------------------------------------------------------------------ | |
diff = mean(Poor) - mean(Good) t = -0.1918 | |
Ho: diff = 0 degrees of freedom = 1640 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.4239 Pr(|T| > |t|) = 0.8479 Pr(T > t) = 0.5761 | |
. | |
. | |
. * Relationship to low income status | |
. * --------------------------------- | |
. | |
. * DV by income. | |
. bys cntry: ttest hsat, by(lowinc) | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
0 | 1647 6.061324 .0505408 2.05111 5.962193 6.160455 | |
1 | 295 5.650847 .1341597 2.304268 5.386812 5.914882 | |
---------+-------------------------------------------------------------------- | |
combined | 1942 5.99897 .0475649 2.096094 5.905687 6.092254 | |
---------+-------------------------------------------------------------------- | |
diff | .4104762 .132225 .1511582 .6697941 | |
------------------------------------------------------------------------------ | |
diff = mean(0) - mean(1) t = 3.1044 | |
Ho: diff = 0 degrees of freedom = 1940 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.9990 Pr(|T| > |t|) = 0.0019 Pr(T > t) = 0.0010 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
0 | 1730 6.064162 .0516962 2.150212 5.962768 6.165555 | |
1 | 349 5.512894 .1288758 2.407598 5.259421 5.766367 | |
---------+-------------------------------------------------------------------- | |
combined | 2079 5.971621 .04835 2.204568 5.876802 6.06644 | |
---------+-------------------------------------------------------------------- | |
diff | .5512679 .128829 .2986205 .8039152 | |
------------------------------------------------------------------------------ | |
diff = mean(0) - mean(1) t = 4.2791 | |
Ho: diff = 0 degrees of freedom = 2077 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000 | |
. | |
. * Association between IV and political attitude. | |
. bys cntry: tab lowinc pol3, col chi2 nokey | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Subjective | Political views (left-right) | |
low income | Left Centre Right | Total | |
-----------+---------------------------------+---------- | |
0 | 605 449 593 | 1,647 | |
| 81.54 82.54 90.40 | 84.81 | |
-----------+---------------------------------+---------- | |
1 | 137 95 63 | 295 | |
| 18.46 17.46 9.60 | 15.19 | |
-----------+---------------------------------+---------- | |
Total | 742 544 656 | 1,942 | |
| 100.00 100.00 100.00 | 100.00 | |
Pearson chi2(2) = 24.2449 Pr = 0.000 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Subjective | Political views (left-right) | |
low income | Left Centre Right | Total | |
-----------+---------------------------------+---------- | |
0 | 500 709 521 | 1,730 | |
| 83.47 80.94 86.26 | 83.21 | |
-----------+---------------------------------+---------- | |
1 | 99 167 83 | 349 | |
| 16.53 19.06 13.74 | 16.79 | |
-----------+---------------------------------+---------- | |
Total | 599 876 604 | 2,079 | |
| 100.00 100.00 100.00 | 100.00 | |
Pearson chi2(2) = 7.2899 Pr = 0.026 | |
. | |
. * Proportions test (since the lowinc dummy is a proportion of the sample). | |
. bys cntry: prtest lowinc if pol3 != 2, by(pol3) | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Two-sample test of proportions Left: Number of obs = 742 | |
Right: Number of obs = 656 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
Left | .1846361 .014244 .1567184 .2125539 | |
Right | .0960366 .0115038 .0734895 .1185836 | |
-------------+---------------------------------------------------------------- | |
diff | .0885995 .0183093 .052714 .124485 | |
| under Ho: .0187645 4.72 0.000 | |
------------------------------------------------------------------------------ | |
diff = prop(Left) - prop(Right) z = 4.7217 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 1.0000 Pr(|Z| < |z|) = 0.0000 Pr(Z > z) = 0.0000 | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Two-sample test of proportions Left: Number of obs = 599 | |
Right: Number of obs = 604 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
Left | .1652755 .0151762 .1355307 .1950202 | |
Right | .1374172 .0140089 .1099604 .1648741 | |
-------------+---------------------------------------------------------------- | |
diff | .0278582 .0206534 -.0126217 .0683382 | |
| under Ho: .0206625 1.35 0.178 | |
------------------------------------------------------------------------------ | |
diff = prop(Left) - prop(Right) z = 1.3482 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.9112 Pr(|Z| < |z|) = 0.1776 Pr(Z > z) = 0.0888 | |
. | |
. | |
. * Relationship to left-right attitude | |
. * ----------------------------------- | |
. | |
. * Correlation between DV and political attitude (left 1-10 right). | |
. by cntry: pwcorr hsat lrscale, obs sig | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
| hsat lrscale | |
-------------+------------------ | |
hsat | 1.0000 | |
| | |
| 1942 | |
| | |
lrscale | 0.1998 1.0000 | |
| 0.0000 | |
| 1942 1942 | |
| | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
| hsat lrscale | |
-------------+------------------ | |
hsat | 1.0000 | |
| | |
| 2079 | |
| | |
lrscale | 0.0153 1.0000 | |
| 0.4853 | |
| 2079 2079 | |
| | |
. | |
. * Association between DV and political attitude (left, centre, right). | |
. gr box hsat, noout note("") over(pol3) asyvars over(cntry) legend(row(1)) /// | |
> scheme(burd4) name(dv_pol3, replace) | |
. | |
. * Comparison with covariates | |
. * -------------------------- | |
. | |
. d hsat esat gsat | |
storage display value | |
variable name type format label variable label | |
------------------------------------------------------------------------------------ | |
hsat byte %2.0f stfhlth State of health services in country | |
nowadays | |
esat byte %2.0f stfedu State of education in country nowadays | |
gsat byte %2.0f stfgov How satisfied with the national | |
government | |
. | |
. * DV and other ESS satisfaction items (edu = education, gov = government). | |
. cap drop msat* | |
. bys cntry lrscale: egen msat1 = mean(hsat) | |
. bys cntry lrscale: egen msat2 = mean(esat) | |
. bys cntry lrscale: egen msat3 = mean(gsat) | |
. | |
. * Line graph, using the means computed above for each left-right group. | |
. tw conn msat? lrscale, by(cntry, note("")) /// | |
> xla(0 "Left" 10 "Right") xti("") yti("Mean level of satisfaction") /// | |
> legend(row(1) order(1 "Health services" 2 "Education" 3 "Government")) /// | |
> name(stf_lrscale, replace) | |
. | |
. /* Notes: | |
> | |
> - The significance tests are expectedly highly positive due to the large N. | |
> The risk here is to make Type I errors, even though the variations between | |
> age groups in each country seem statistically robust. | |
> | |
> - Health status seems important in France but not in Britain, whereas old age | |
> seems important in Britain but not in France. It will be interesting to see | |
> if any of these effects persist after controlling for income. | |
> | |
> - The relationship between financial difficulties and political leaning shows | |
> how your independent variables are interacting with each other. | |
> | |
> - Other measures of satisfaction (which are not part of the model itself) show | |
> how health services correlate to other measures of public sector performance | |
> when the measures are examined by left-right positioning. Politics matter. */ | |
. | |
. | |
. * ===================== | |
. * = REGRESSION MODELS = | |
. * ===================== | |
. | |
. | |
. * Multiple linear regression model for each country case. | |
. bys cntry: reg hsat ib45.age6 female i.health lowinc ib2.pol3 | |
------------------------------------------------------------------------------------ | |
-> cntry = FR | |
Source | SS df MS Number of obs = 1942 | |
-------------+------------------------------ F( 13, 1928) = 10.15 | |
Model | 546.017183 13 42.0013217 Prob > F = 0.0000 | |
Residual | 7981.98076 1928 4.14003151 R-squared = 0.0640 | |
-------------+------------------------------ Adj R-squared = 0.0577 | |
Total | 8527.99794 1941 4.39361048 Root MSE = 2.0347 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .5757433 .1831204 3.14 0.002 .2166085 .9348781 | |
25 | .4181367 .1620445 2.58 0.010 .1003357 .7359377 | |
35 | .2737408 .1555479 1.76 0.079 -.031319 .5788006 | |
55 | .0178386 .1577826 0.11 0.910 -.2916038 .327281 | |
65 | .1721822 .1509785 1.14 0.254 -.1239161 .4682806 | |
| | |
female | -.3929954 .0930814 -4.22 0.000 -.5755461 -.2104446 | |
| | |
health | | |
2 | -.367027 .1261025 -2.91 0.004 -.6143386 -.1197154 | |
3 | -.4762367 .1408647 -3.38 0.001 -.7524999 -.1999735 | |
4 | -.5536348 .2210454 -2.50 0.012 -.9871479 -.1201216 | |
5 | -1.020825 .4543408 -2.25 0.025 -1.911876 -.1297743 | |
| | |
lowinc | -.2263043 .1330334 -1.70 0.089 -.4872088 .0346003 | |
| | |
pol3 | | |
1 | -.5218848 .1154915 -4.52 0.000 -.7483861 -.2953834 | |
3 | .3431802 .1195869 2.87 0.004 .1086469 .5777134 | |
| | |
_cons | 6.458039 .1756112 36.77 0.000 6.113631 6.802447 | |
------------------------------------------------------------------------------ | |
------------------------------------------------------------------------------------ | |
-> cntry = GB | |
Source | SS df MS Number of obs = 2079 | |
-------------+------------------------------ F( 13, 2065) = 13.25 | |
Model | 777.733419 13 59.8256476 Prob > F = 0.0000 | |
Residual | 9321.59222 2065 4.51408824 R-squared = 0.0770 | |
-------------+------------------------------ Adj R-squared = 0.0712 | |
Total | 10099.3256 2078 4.86011821 Root MSE = 2.1246 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .1577784 .1958508 0.81 0.421 -.2263073 .541864 | |
25 | -.0492595 .1659278 -0.30 0.767 -.3746627 .2761437 | |
35 | -.0429892 .1537828 -0.28 0.780 -.3445747 .2585963 | |
55 | .3753688 .1624 2.31 0.021 .056884 .6938537 | |
65 | 1.240436 .1496894 8.29 0.000 .946878 1.533994 | |
| | |
female | -.5118235 .0939119 -5.45 0.000 -.6959954 -.3276516 | |
| | |
health | | |
2 | -.293325 .1115856 -2.63 0.009 -.5121571 -.0744928 | |
3 | -.3068507 .1356245 -2.26 0.024 -.5728256 -.0408758 | |
4 | -.450338 .2229726 -2.02 0.044 -.8876125 -.0130635 | |
5 | -.0378434 .4049108 -0.09 0.926 -.8319194 .7562325 | |
| | |
lowinc | -.3094155 .129387 -2.39 0.017 -.563158 -.055673 | |
| | |
pol3 | | |
1 | .1433662 .1130665 1.27 0.205 -.07837 .3651024 | |
3 | .0290743 .1146745 0.25 0.800 -.1958155 .2539641 | |
| | |
_cons | 6.101753 .1548257 39.41 0.000 5.798122 6.405383 | |
------------------------------------------------------------------------------ | |
. | |
. * Cleaner output with the -leanout- command. | |
. leanout: reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "FR" | |
Dependent variable: hsat | |
Variable Coef SE 95% CI | |
----------------------------------------------- | |
age6 | |
15 0.6 0.2 ( 0.2, 0.9) | |
25 0.4 0.2 ( 0.1, 0.7) | |
35 0.3 0.2 ( -0.0, 0.6) | |
55 0.0 0.2 ( -0.3, 0.3) | |
65 0.2 0.2 ( -0.1, 0.5) | |
female -0.4 0.1 ( -0.6, -0.2) | |
health | |
2 -0.4 0.1 ( -0.6, -0.1) | |
3 -0.5 0.1 ( -0.8, -0.2) | |
4 -0.6 0.2 ( -1.0, -0.1) | |
5 -1.0 0.5 ( -1.9, -0.1) | |
lowinc -0.2 0.1 ( -0.5, 0.0) | |
pol3 | |
1 -0.5 0.1 ( -0.7, -0.3) | |
3 0.3 0.1 ( 0.1, 0.6) | |
_cons 6.5 0.2 ( 6.1, 6.8) | |
----------------------------------------------- | |
Number of observations = 1942 | |
Root Mean Squared Error = 2.0 | |
. leanout: reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "GB" | |
Dependent variable: hsat | |
Variable Coef SE 95% CI | |
----------------------------------------------- | |
age6 | |
15 0.2 0.2 ( -0.2, 0.5) | |
25 -0.0 0.2 ( -0.4, 0.3) | |
35 -0.0 0.2 ( -0.3, 0.3) | |
55 0.4 0.2 ( 0.1, 0.7) | |
65 1.2 0.1 ( 0.9, 1.5) | |
female -0.5 0.1 ( -0.7, -0.3) | |
health | |
2 -0.3 0.1 ( -0.5, -0.1) | |
3 -0.3 0.1 ( -0.6, -0.0) | |
4 -0.5 0.2 ( -0.9, -0.0) | |
5 -0.0 0.4 ( -0.8, 0.8) | |
lowinc -0.3 0.1 ( -0.6, -0.1) | |
pol3 | |
1 0.1 0.1 ( -0.1, 0.4) | |
3 0.0 0.1 ( -0.2, 0.3) | |
_cons 6.1 0.2 ( 5.8, 6.4) | |
----------------------------------------------- | |
Number of observations = 2079 | |
Root Mean Squared Error = 2.1 | |
. | |
. /* Notes: | |
> | |
> - This model is specified as a multiple linear regression. It captures linear | |
> relationships by computing the partial derivative of each variable, which is | |
> its effect on the DV when all other variables are held constant. | |
> | |
> We will therefore read the coefficient of an IV as its net effect on the DV, | |
> independently of all other variables in the model. This interpretation gives | |
> its meaning to the idiom of 'all other things being equal' (ceteris paribus). | |
> | |
> - The baseline age category is set to the category that contains the average | |
> population age (45-54 years-old) and is coded 'ib45' because the categories | |
> of 'age6' are coded 15, 25, 35 etc. | |
> | |
> - The baseline health status is set to default reference category 1 = very good. | |
> Categories 2-5 code for 2 = good to 5 = poor health. | |
> | |
> - The baseline political attitude is the modal (and central) category 2 = centre | |
> so that 1 = leftwing and 3-rightwing. | |
> | |
> The baseline model, given by the constant, is therefore the predicted mean of | |
> the DV for respondents who are males, aged 45-54, in very good health, at the | |
> centre politically and who did not report financial difficulties ('lowinc'). | |
> | |
> - Let's manually check whether the model does a good job at predicting the | |
> constant (the baseline model) in the second country case: | |
> | |
> su hsat if age6 == 45 & !female & health == 1 & !lowinc & pol3 == 2 & cntry == | |
> "GB" | |
> | |
> - For the same country case, the model predicts a higher value for respondents | |
> aged 65+, keeping all other variables equal. Let's check that too: | |
> | |
> su hsat if age6 == 65 & !female & health == 1 & !lowinc & pol3 == 2 & cntry == | |
> "GB" | |
> | |
> Not so bad for a model predicting only 7% of the variance, but remember that | |
> the predicted values are only means, that they are significant only for some | |
> coefficients, and that they apply only to a fraction of all observations. | |
> | |
> - To assess the overall quality of the models, you should rather read the RMSE. | |
> The Root-Mean-Square Error is the standard error of the regression: it shows | |
> by how much we mispredict the DV on average. | |
> | |
> We later turn to regression diagnostics to explore the error term. */ | |
. | |
. | |
. * Using the -estout- command | |
. * -------------------------- | |
. | |
. * Store model estimates. | |
. eststo clear | |
. bys cntry: eststo: qui reg hsat ib45.age6 female i.health lowinc ib2.pol3 | |
------------------------------------------------------------------------------------ | |
-> FR | |
(est1 stored) | |
------------------------------------------------------------------------------------ | |
-> GB | |
(est2 stored) | |
. | |
. * View stored model estimates. | |
. eststo dir | |
------------------------------------------------------- | |
name | command depvar npar title | |
-------------+----------------------------------------- | |
est1 | regress hsat 17 FR | |
est2 | regress hsat 17 GB | |
------------------------------------------------------- | |
. | |
. * View standardized coefficients. | |
. esttab, wide nogaps beta(2) se(2) sca(rmse) mti("FR" "GB") | |
---------------------------------------------------------------------- | |
(1) (2) | |
FR GB | |
---------------------------------------------------------------------- | |
15.age6 0.08** (0.18) 0.02 (0.20) | |
25.age6 0.07** (0.16) -0.01 (0.17) | |
35.age6 0.05 (0.16) -0.01 (0.15) | |
45b.age6 0.00 (.) 0.00 (.) | |
55.age6 0.00 (0.16) 0.06* (0.16) | |
65.age6 0.03 (0.15) 0.24*** (0.15) | |
female -0.09*** (0.09) -0.12*** (0.09) | |
1b.health 0.00 (.) 0.00 (.) | |
2.health -0.09** (0.13) -0.07** (0.11) | |
3.health -0.10*** (0.14) -0.06* (0.14) | |
4.health -0.06* (0.22) -0.05* (0.22) | |
5.health -0.05* (0.45) -0.00 (0.40) | |
lowinc -0.04 (0.13) -0.05* (0.13) | |
1.pol3 -0.12*** (0.12) 0.03 (0.11) | |
2b.pol3 0.00 (.) 0.00 (.) | |
3.pol3 0.08** (0.12) 0.01 (0.11) | |
---------------------------------------------------------------------- | |
N 1942 2079 | |
rmse 2.035 2.125 | |
---------------------------------------------------------------------- | |
Standardized beta coefficients; Standard errors in parentheses | |
* p<0.05, ** p<0.01, *** p<0.001 | |
. | |
. * Export unstandardized coefficients. | |
. esttab using week11_regressions.txt, replace /// | |
> nolines wide nogaps b(1) se(1) sca(rmse) mti("FR" "GB") | |
(note: file week11_regressions.txt not found) | |
(output written to week11_regressions.txt) | |
. | |
. | |
. * Models with covariates | |
. * ---------------------- | |
. | |
. * Store model estimates (again). | |
. eststo clear | |
. bys cntry: eststo: reg hsat ib45.age6 female i.health lowinc ib2.pol3 | |
------------------------------------------------------------------------------------ | |
-> FR | |
Source | SS df MS Number of obs = 1942 | |
-------------+------------------------------ F( 13, 1928) = 10.15 | |
Model | 546.017183 13 42.0013217 Prob > F = 0.0000 | |
Residual | 7981.98076 1928 4.14003151 R-squared = 0.0640 | |
-------------+------------------------------ Adj R-squared = 0.0577 | |
Total | 8527.99794 1941 4.39361048 Root MSE = 2.0347 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .5757433 .1831204 3.14 0.002 .2166085 .9348781 | |
25 | .4181367 .1620445 2.58 0.010 .1003357 .7359377 | |
35 | .2737408 .1555479 1.76 0.079 -.031319 .5788006 | |
55 | .0178386 .1577826 0.11 0.910 -.2916038 .327281 | |
65 | .1721822 .1509785 1.14 0.254 -.1239161 .4682806 | |
| | |
female | -.3929954 .0930814 -4.22 0.000 -.5755461 -.2104446 | |
| | |
health | | |
2 | -.367027 .1261025 -2.91 0.004 -.6143386 -.1197154 | |
3 | -.4762367 .1408647 -3.38 0.001 -.7524999 -.1999735 | |
4 | -.5536348 .2210454 -2.50 0.012 -.9871479 -.1201216 | |
5 | -1.020825 .4543408 -2.25 0.025 -1.911876 -.1297743 | |
| | |
lowinc | -.2263043 .1330334 -1.70 0.089 -.4872088 .0346003 | |
| | |
pol3 | | |
1 | -.5218848 .1154915 -4.52 0.000 -.7483861 -.2953834 | |
3 | .3431802 .1195869 2.87 0.004 .1086469 .5777134 | |
| | |
_cons | 6.458039 .1756112 36.77 0.000 6.113631 6.802447 | |
------------------------------------------------------------------------------ | |
(est1 stored) | |
------------------------------------------------------------------------------------ | |
-> GB | |
Source | SS df MS Number of obs = 2079 | |
-------------+------------------------------ F( 13, 2065) = 13.25 | |
Model | 777.733419 13 59.8256476 Prob > F = 0.0000 | |
Residual | 9321.59222 2065 4.51408824 R-squared = 0.0770 | |
-------------+------------------------------ Adj R-squared = 0.0712 | |
Total | 10099.3256 2078 4.86011821 Root MSE = 2.1246 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .1577784 .1958508 0.81 0.421 -.2263073 .541864 | |
25 | -.0492595 .1659278 -0.30 0.767 -.3746627 .2761437 | |
35 | -.0429892 .1537828 -0.28 0.780 -.3445747 .2585963 | |
55 | .3753688 .1624 2.31 0.021 .056884 .6938537 | |
65 | 1.240436 .1496894 8.29 0.000 .946878 1.533994 | |
| | |
female | -.5118235 .0939119 -5.45 0.000 -.6959954 -.3276516 | |
| | |
health | | |
2 | -.293325 .1115856 -2.63 0.009 -.5121571 -.0744928 | |
3 | -.3068507 .1356245 -2.26 0.024 -.5728256 -.0408758 | |
4 | -.450338 .2229726 -2.02 0.044 -.8876125 -.0130635 | |
5 | -.0378434 .4049108 -0.09 0.926 -.8319194 .7562325 | |
| | |
lowinc | -.3094155 .129387 -2.39 0.017 -.563158 -.055673 | |
| | |
pol3 | | |
1 | .1433662 .1130665 1.27 0.205 -.07837 .3651024 | |
3 | .0290743 .1146745 0.25 0.800 -.1958155 .2539641 | |
| | |
_cons | 6.101753 .1548257 39.41 0.000 5.798122 6.405383 | |
------------------------------------------------------------------------------ | |
(est2 stored) | |
. | |
. * Run identical model on satisfaction with education. | |
. bys cntry: eststo: reg esat ib45.age6 female i.health lowinc ib2.pol3 | |
------------------------------------------------------------------------------------ | |
-> FR | |
Source | SS df MS Number of obs = 1918 | |
-------------+------------------------------ F( 13, 1904) = 4.38 | |
Model | 239.038913 13 18.3876087 Prob > F = 0.0000 | |
Residual | 7986.26714 1904 4.19446803 R-squared = 0.0291 | |
-------------+------------------------------ Adj R-squared = 0.0224 | |
Total | 8225.30605 1917 4.29071781 Root MSE = 2.048 | |
------------------------------------------------------------------------------ | |
esat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .338069 .1848041 1.83 0.068 -.0243707 .7005088 | |
25 | .1280543 .1638014 0.78 0.434 -.1931947 .4493033 | |
35 | .3197301 .1573435 2.03 0.042 .0111463 .6283139 | |
55 | .138363 .1598155 0.87 0.387 -.1750689 .4517949 | |
65 | .0144126 .1534757 0.09 0.925 -.2865856 .3154108 | |
| | |
female | .0175677 .0942759 0.19 0.852 -.1673271 .2024626 | |
| | |
health | | |
2 | -.1528419 .1275963 -1.20 0.231 -.403085 .0974013 | |
3 | -.3176335 .1430491 -2.22 0.027 -.598183 -.037084 | |
4 | -.406681 .2250693 -1.81 0.071 -.8480894 .0347273 | |
5 | -.7693203 .4576307 -1.68 0.093 -1.666831 .1281899 | |
| | |
lowinc | -.4343483 .1345077 -3.23 0.001 -.6981462 -.1705504 | |
| | |
pol3 | | |
1 | -.3903843 .116867 -3.34 0.001 -.6195851 -.1611835 | |
3 | .0564588 .1212735 0.47 0.642 -.1813841 .2943017 | |
| | |
_cons | 5.209674 .1780503 29.26 0.000 4.86048 5.558868 | |
------------------------------------------------------------------------------ | |
(est3 stored) | |
------------------------------------------------------------------------------------ | |
-> GB | |
Source | SS df MS Number of obs = 2028 | |
-------------+------------------------------ F( 13, 2014) = 3.10 | |
Model | 171.674796 13 13.2057536 Prob > F = 0.0001 | |
Residual | 8578.75666 2014 4.2595614 R-squared = 0.0196 | |
-------------+------------------------------ Adj R-squared = 0.0133 | |
Total | 8750.43146 2027 4.31693708 Root MSE = 2.0639 | |
------------------------------------------------------------------------------ | |
esat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .7480398 .1915159 3.91 0.000 .3724498 1.12363 | |
25 | .3473778 .1630304 2.13 0.033 .0276521 .6671036 | |
35 | .2107234 .1509536 1.40 0.163 -.0853182 .5067649 | |
55 | .103085 .1603123 0.64 0.520 -.2113102 .4174802 | |
65 | .3305423 .1478204 2.24 0.025 .0406455 .6204391 | |
| | |
female | -.0976372 .0924553 -1.06 0.291 -.2789551 .0836808 | |
| | |
health | | |
2 | -.0752296 .1096759 -0.69 0.493 -.2903197 .1398605 | |
3 | -.1966405 .1336877 -1.47 0.141 -.458821 .0655401 | |
4 | -.4759092 .2179303 -2.18 0.029 -.9033015 -.0485169 | |
5 | -.0885606 .3994685 -0.22 0.825 -.8719753 .6948541 | |
| | |
lowinc | -.1847112 .1266079 -1.46 0.145 -.4330074 .0635851 | |
| | |
pol3 | | |
1 | .0261182 .1114108 0.23 0.815 -.1923743 .2446108 | |
3 | -.3412746 .1128567 -3.02 0.003 -.5626027 -.1199465 | |
| | |
_cons | 5.729339 .1530038 37.45 0.000 5.429277 6.029401 | |
------------------------------------------------------------------------------ | |
(est4 stored) | |
. | |
. * Run identical model on satisfaction with government. | |
. bys cntry: eststo: reg gsat ib45.age6 female i.health lowinc ib2.pol3 | |
------------------------------------------------------------------------------------ | |
-> FR | |
Source | SS df MS Number of obs = 1927 | |
-------------+------------------------------ F( 13, 1913) = 62.78 | |
Model | 3143.99841 13 241.846032 Prob > F = 0.0000 | |
Residual | 7368.8267 1913 3.85197423 R-squared = 0.2991 | |
-------------+------------------------------ Adj R-squared = 0.2943 | |
Total | 10512.8251 1926 5.45837233 Root MSE = 1.9626 | |
------------------------------------------------------------------------------ | |
gsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .1415863 .1774439 0.80 0.425 -.2064176 .4895902 | |
25 | -.1676006 .1563996 -1.07 0.284 -.4743322 .139131 | |
35 | -.1988654 .1504437 -1.32 0.186 -.4939163 .0961855 | |
55 | .0137179 .1526804 0.09 0.928 -.2857197 .3131554 | |
65 | .2527315 .1462162 1.73 0.084 -.0340285 .5394915 | |
| | |
female | -.0071429 .090083 -0.08 0.937 -.1838141 .1695283 | |
| | |
health | | |
2 | -.2660753 .1221563 -2.18 0.030 -.5056489 -.0265017 | |
3 | -.3098499 .136446 -2.27 0.023 -.5774485 -.0422513 | |
4 | -.5489245 .214653 -2.56 0.011 -.969903 -.127946 | |
5 | -.6821846 .4385135 -1.56 0.120 -1.542199 .1778301 | |
| | |
lowinc | -.5982746 .1294904 -4.62 0.000 -.8522318 -.3443175 | |
| | |
pol3 | | |
1 | -1.447156 .1121175 -12.91 0.000 -1.667042 -1.227271 | |
3 | 1.368768 .1158465 11.82 0.000 1.141569 1.595967 | |
| | |
_cons | 4.312006 .1701966 25.34 0.000 3.978216 4.645797 | |
------------------------------------------------------------------------------ | |
(est5 stored) | |
------------------------------------------------------------------------------------ | |
-> GB | |
Source | SS df MS Number of obs = 2070 | |
-------------+------------------------------ F( 13, 2056) = 6.55 | |
Model | 442.594151 13 34.045704 Prob > F = 0.0000 | |
Residual | 10690.3604 2056 5.19959165 R-squared = 0.0398 | |
-------------+------------------------------ Adj R-squared = 0.0337 | |
Total | 11132.9546 2069 5.38083837 Root MSE = 2.2803 | |
------------------------------------------------------------------------------ | |
gsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .8183095 .2110845 3.88 0.000 .4043478 1.232271 | |
25 | .1340198 .178367 0.75 0.453 -.2157789 .4838186 | |
35 | -.0549785 .1651793 -0.33 0.739 -.3789146 .2689576 | |
55 | .0306191 .1744077 0.18 0.861 -.311415 .3726533 | |
65 | .3743036 .1611006 2.32 0.020 .0583662 .6902409 | |
| | |
female | -.2991263 .1010176 -2.96 0.003 -.4972338 -.1010188 | |
| | |
health | | |
2 | -.1736806 .1198973 -1.45 0.148 -.4088133 .0614521 | |
3 | -.495565 .1459817 -3.39 0.001 -.7818524 -.2092776 | |
4 | -.5286341 .2402121 -2.20 0.028 -.9997185 -.0575497 | |
5 | -.3879851 .4346074 -0.89 0.372 -1.240302 .4643315 | |
| | |
lowinc | -.4232963 .1390684 -3.04 0.002 -.6960259 -.1505666 | |
| | |
pol3 | | |
1 | .4348585 .1217063 3.57 0.000 .196178 .673539 | |
3 | -.2616768 .1232299 -2.12 0.034 -.5033452 -.0200084 | |
| | |
_cons | 3.765473 .1664704 22.62 0.000 3.439004 4.091941 | |
------------------------------------------------------------------------------ | |
(est6 stored) | |
. | |
. * View updated list of model estimates. | |
. eststo dir | |
------------------------------------------------------- | |
name | command depvar npar title | |
-------------+----------------------------------------- | |
est1 | regress hsat 17 FR | |
est2 | regress hsat 17 GB | |
est3 | regress esat 17 FR | |
est4 | regress esat 17 GB | |
est5 | regress gsat 17 FR | |
est6 | regress gsat 17 GB | |
------------------------------------------------------- | |
. | |
. * Compare DV and covariates in each country, using standardized coefficients, | |
. * RMSE and R-squared to compare predicted variance across the models. | |
. esttab est1 est3 est5, lab nogaps beta(2) se(2) sca(rmse) r2 /// | |
> mti("Health" "Education" "Government") ti("France") | |
France | |
-------------------------------------------------------------------- | |
(1) (2) (3) | |
Health Education Government | |
-------------------------------------------------------------------- | |
15.Age groups 0.08** 0.05 0.02 | |
(0.18) (0.18) (0.18) | |
25.Age groups 0.07** 0.02 -0.03 | |
(0.16) (0.16) (0.16) | |
35.Age groups 0.05 0.06* -0.03 | |
(0.16) (0.16) (0.15) | |
45b.Age groups 0.00 0.00 0.00 | |
(.) (.) (.) | |
55.Age groups 0.00 0.03 0.00 | |
(0.16) (0.16) (0.15) | |
65.Age groups 0.03 0.00 0.04 | |
(0.15) (0.15) (0.15) | |
Gender -0.09*** 0.00 -0.00 | |
(0.09) (0.09) (0.09) | |
1b.Subjective gene~h 0.00 0.00 0.00 | |
(.) (.) (.) | |
2.Subjective gener~h -0.09** -0.04 -0.06* | |
(0.13) (0.13) (0.12) | |
3.Subjective gener~h -0.10*** -0.07* -0.06* | |
(0.14) (0.14) (0.14) | |
4.Subjective gener~h -0.06* -0.05 -0.06* | |
(0.22) (0.23) (0.21) | |
5.Subjective gener~h -0.05* -0.04 -0.03 | |
(0.45) (0.46) (0.44) | |
Subjective low inc~e -0.04 -0.08** -0.09*** | |
(0.13) (0.13) (0.13) | |
1.Political views ~) -0.12*** -0.09*** -0.30*** | |
(0.12) (0.12) (0.11) | |
2b.Political views~) 0.00 0.00 0.00 | |
(.) (.) (.) | |
3.Political views ~) 0.08** 0.01 0.28*** | |
(0.12) (0.12) (0.12) | |
-------------------------------------------------------------------- | |
Observations 1942 1918 1927 | |
R-squared 0.064 0.029 0.299 | |
rmse 2.035 2.048 1.963 | |
-------------------------------------------------------------------- | |
Standardized beta coefficients; Standard errors in parentheses | |
* p<0.05, ** p<0.01, *** p<0.001 | |
. | |
. esttab est2 est4 est6, lab nogaps beta(2) se(2) sca(rmse) r2 /// | |
> mti("Health" "Education" "Government") ti("UK") | |
UK | |
-------------------------------------------------------------------- | |
(1) (2) (3) | |
Health Education Government | |
-------------------------------------------------------------------- | |
15.Age groups 0.02 0.10*** 0.10*** | |
(0.20) (0.19) (0.21) | |
25.Age groups -0.01 0.06* 0.02 | |
(0.17) (0.16) (0.18) | |
35.Age groups -0.01 0.04 -0.01 | |
(0.15) (0.15) (0.17) | |
45b.Age groups 0.00 0.00 0.00 | |
(.) (.) (.) | |
55.Age groups 0.06* 0.02 0.00 | |
(0.16) (0.16) (0.17) | |
65.Age groups 0.24*** 0.07* 0.07* | |
(0.15) (0.15) (0.16) | |
Gender -0.12*** -0.02 -0.06** | |
(0.09) (0.09) (0.10) | |
1b.Subjective gene~h 0.00 0.00 0.00 | |
(.) (.) (.) | |
2.Subjective gener~h -0.07** -0.02 -0.04 | |
(0.11) (0.11) (0.12) | |
3.Subjective gener~h -0.06* -0.04 -0.09*** | |
(0.14) (0.13) (0.15) | |
4.Subjective gener~h -0.05* -0.05* -0.05* | |
(0.22) (0.22) (0.24) | |
5.Subjective gener~h -0.00 -0.01 -0.02 | |
(0.40) (0.40) (0.43) | |
Subjective low inc~e -0.05* -0.03 -0.07** | |
(0.13) (0.13) (0.14) | |
1.Political views ~) 0.03 0.01 0.08*** | |
(0.11) (0.11) (0.12) | |
2b.Political views~) 0.00 0.00 0.00 | |
(.) (.) (.) | |
3.Political views ~) 0.01 -0.07** -0.05* | |
(0.11) (0.11) (0.12) | |
-------------------------------------------------------------------- | |
Observations 2079 2028 2070 | |
R-squared 0.077 0.020 0.040 | |
rmse 2.125 2.064 2.280 | |
-------------------------------------------------------------------- | |
Standardized beta coefficients; Standard errors in parentheses | |
* p<0.05, ** p<0.01, *** p<0.001 | |
. | |
. /* Basic usage of -estout- commands: | |
> | |
> - The -estout- commands work by storing model estimates with -eststo- and then | |
> putting them into tables with -esttab-. Use these commands at the end of your | |
> models: start with -reg- and -leanout-, then use -eststo- and -esttab-. | |
> | |
> - The -estout- command is especially practical when you run many models, as | |
> shown here when we compare the model between country cases and then check | |
> how the DV model compares to other satisfaction measures (covariates). */ | |
. | |
. | |
. * ========================== | |
. * = REGRESSION DIAGNOSTICS = | |
. * ========================== | |
. | |
. | |
. * Note: what we call 'diagnostics' at that stage actually covers a broader range | |
. * of postestimation commands like -margins- and -marginsplot- (marginal effects) | |
. * or seemingly unrelated regression (SUREG). The overall logic of these commands | |
. * is to help with the detection of patterns that are not taken into account by | |
. * our 'front-end' linear regression model. | |
. | |
. | |
. * (1) France: Residuals | |
. * --------------------- | |
. | |
. reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "FR" | |
Source | SS df MS Number of obs = 1942 | |
-------------+------------------------------ F( 13, 1928) = 10.15 | |
Model | 546.017183 13 42.0013217 Prob > F = 0.0000 | |
Residual | 7981.98076 1928 4.14003151 R-squared = 0.0640 | |
-------------+------------------------------ Adj R-squared = 0.0577 | |
Total | 8527.99794 1941 4.39361048 Root MSE = 2.0347 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .5757433 .1831204 3.14 0.002 .2166085 .9348781 | |
25 | .4181367 .1620445 2.58 0.010 .1003357 .7359377 | |
35 | .2737408 .1555479 1.76 0.079 -.031319 .5788006 | |
55 | .0178386 .1577826 0.11 0.910 -.2916038 .327281 | |
65 | .1721822 .1509785 1.14 0.254 -.1239161 .4682806 | |
| | |
female | -.3929954 .0930814 -4.22 0.000 -.5755461 -.2104446 | |
| | |
health | | |
2 | -.367027 .1261025 -2.91 0.004 -.6143386 -.1197154 | |
3 | -.4762367 .1408647 -3.38 0.001 -.7524999 -.1999735 | |
4 | -.5536348 .2210454 -2.50 0.012 -.9871479 -.1201216 | |
5 | -1.020825 .4543408 -2.25 0.025 -1.911876 -.1297743 | |
| | |
lowinc | -.2263043 .1330334 -1.70 0.089 -.4872088 .0346003 | |
| | |
pol3 | | |
1 | -.5218848 .1154915 -4.52 0.000 -.7483861 -.2953834 | |
3 | .3431802 .1195869 2.87 0.004 .1086469 .5777134 | |
| | |
_cons | 6.458039 .1756112 36.77 0.000 6.113631 6.802447 | |
------------------------------------------------------------------------------ | |
. | |
. * Variance inflation. | |
. vif | |
Variable | VIF 1/VIF | |
-------------+---------------------- | |
age6 | | |
15 | 1.47 0.682149 | |
25 | 1.60 0.623269 | |
35 | 1.66 0.601749 | |
55 | 1.64 0.608562 | |
65 | 1.81 0.551771 | |
female | 1.01 0.989806 | |
health | | |
2 | 1.84 0.544567 | |
3 | 1.90 0.525788 | |
4 | 1.34 0.746783 | |
5 | 1.08 0.922074 | |
lowinc | 1.07 0.935009 | |
pol3 | | |
1 | 1.48 0.676967 | |
3 | 1.50 0.666409 | |
-------------+---------------------- | |
Mean VIF | 1.49 | |
. | |
. * Residuals-versus-fitted values plot. | |
. rvfplot, yline(0) /// | |
> name(rvf_fr, replace) | |
. | |
. * Store the standardized residuals for the estimation sample (France only). | |
. cap drop rst_fr | |
. predict rst_fr if e(sample), rsta | |
(2079 missing values generated) | |
. | |
. * Distribution of the standardized residuals. | |
. hist rst_fr, normal /// | |
> name(rst_fr_1, replace) | |
(bin=32, start=-3.3201849, width=.17776279) | |
. | |
. * Store the predicted values for the estimation sample (France only). | |
. cap drop yhat_fr | |
. predict yhat_fr if e(sample) | |
(option xb assumed; fitted values) | |
(2079 missing values generated) | |
. | |
. * Plot the distribution of the standardized residuals over socio-demographics. | |
. hist rst_fr, normal by(female age6, legend(off)) bin(10) xline(0) /// | |
> name(rst_fr_2, replace) | |
. | |
. * Plot the residuals-versus-fitted values by income and political views. | |
. sc rst_fr yhat_fr, by(pol3 lowinc, col(2) legend(off)) yline(0) /// | |
> name(rst_fr_3, replace) | |
. | |
. | |
. * (2) France: Marginal effects | |
. * ---------------------------- | |
. | |
. * Briefly recall the model by calling -reg- without any new specification. | |
. reg | |
Source | SS df MS Number of obs = 1942 | |
-------------+------------------------------ F( 13, 1928) = 10.15 | |
Model | 546.017183 13 42.0013217 Prob > F = 0.0000 | |
Residual | 7981.98076 1928 4.14003151 R-squared = 0.0640 | |
-------------+------------------------------ Adj R-squared = 0.0577 | |
Total | 8527.99794 1941 4.39361048 Root MSE = 2.0347 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .5757433 .1831204 3.14 0.002 .2166085 .9348781 | |
25 | .4181367 .1620445 2.58 0.010 .1003357 .7359377 | |
35 | .2737408 .1555479 1.76 0.079 -.031319 .5788006 | |
55 | .0178386 .1577826 0.11 0.910 -.2916038 .327281 | |
65 | .1721822 .1509785 1.14 0.254 -.1239161 .4682806 | |
| | |
female | -.3929954 .0930814 -4.22 0.000 -.5755461 -.2104446 | |
| | |
health | | |
2 | -.367027 .1261025 -2.91 0.004 -.6143386 -.1197154 | |
3 | -.4762367 .1408647 -3.38 0.001 -.7524999 -.1999735 | |
4 | -.5536348 .2210454 -2.50 0.012 -.9871479 -.1201216 | |
5 | -1.020825 .4543408 -2.25 0.025 -1.911876 -.1297743 | |
| | |
lowinc | -.2263043 .1330334 -1.70 0.089 -.4872088 .0346003 | |
| | |
pol3 | | |
1 | -.5218848 .1154915 -4.52 0.000 -.7483861 -.2953834 | |
3 | .3431802 .1195869 2.87 0.004 .1086469 .5777134 | |
| | |
_cons | 6.458039 .1756112 36.77 0.000 6.113631 6.802447 | |
------------------------------------------------------------------------------ | |
. | |
. * What is observable above is the (positive) linear effect of one predictor onto | |
. * the DV: all other things kept equal, rightwing views lead to a higher level of | |
. * satisfaction with health services, independently of age, gender, income and so | |
. * on. You can show the same thing by predicting the marginal effect of the IV on | |
. * the DV with the -margins- command. | |
. margins pol3 | |
Predictive margins Number of obs = 1942 | |
Model VCE : OLS | |
Expression : Linear prediction, predict() | |
------------------------------------------------------------------------------ | |
| Delta-method | |
| Margin Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
pol3 | | |
1 | 5.560562 .0752119 73.93 0.000 5.41315 5.707975 | |
2 | 6.082447 .0876891 69.36 0.000 5.91058 6.254315 | |
3 | 6.425627 .0804466 79.87 0.000 6.267955 6.5833 | |
------------------------------------------------------------------------------ | |
. marginsplot, /// | |
> name(margins_pol3_fr, replace) | |
Variables that uniquely identify margins: pol3 | |
. | |
. * Let's plot a more complex interaction where we observe the effect of political | |
. * views and health status combined. The linear effect of political views remains | |
. * observable at good health but becomes indistinguishable when health degrades. | |
. margins health#pol3 | |
Predictive margins Number of obs = 1942 | |
Model VCE : OLS | |
Expression : Linear prediction, predict() | |
------------------------------------------------------------------------------ | |
| Delta-method | |
| Margin Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
health#pol3 | | |
1 1 | 5.903804 .1219382 48.42 0.000 5.66481 6.142799 | |
1 2 | 6.425689 .1317597 48.77 0.000 6.167445 6.683933 | |
1 3 | 6.768869 .1224024 55.30 0.000 6.528965 7.008773 | |
2 1 | 5.536777 .0923518 59.95 0.000 5.355771 5.717783 | |
2 2 | 6.058662 .1015952 59.64 0.000 5.859539 6.257785 | |
2 3 | 6.401842 .0963706 66.43 0.000 6.212959 6.590725 | |
3 1 | 5.427567 .1057111 51.34 0.000 5.220377 5.634757 | |
3 2 | 5.949452 .1147927 51.83 0.000 5.724463 6.174442 | |
3 3 | 6.292632 .1109473 56.72 0.000 6.07518 6.510085 | |
4 1 | 5.350169 .1974214 27.10 0.000 4.963231 5.737108 | |
4 2 | 5.872054 .203375 28.87 0.000 5.473446 6.270662 | |
4 3 | 6.215234 .2017018 30.81 0.000 5.819906 6.610562 | |
5 1 | 4.882979 .4425538 11.03 0.000 4.015589 5.750368 | |
5 2 | 5.404864 .4447404 12.15 0.000 4.533189 6.276539 | |
5 3 | 5.748044 .4453698 12.91 0.000 4.875135 6.620953 | |
------------------------------------------------------------------------------ | |
. marginsplot, recast(line) recastci(rarea) ciopts(fi(25)) legend(row(1)) /// | |
> name(margins_health_pol3_fr, replace) | |
Variables that uniquely identify margins: health pol3 | |
. | |
. | |
. * (3) Britain: Exercise | |
. * --------------------- | |
. | |
. reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "GB" | |
Source | SS df MS Number of obs = 2079 | |
-------------+------------------------------ F( 13, 2065) = 13.25 | |
Model | 777.733419 13 59.8256476 Prob > F = 0.0000 | |
Residual | 9321.59222 2065 4.51408824 R-squared = 0.0770 | |
-------------+------------------------------ Adj R-squared = 0.0712 | |
Total | 10099.3256 2078 4.86011821 Root MSE = 2.1246 | |
------------------------------------------------------------------------------ | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .1577784 .1958508 0.81 0.421 -.2263073 .541864 | |
25 | -.0492595 .1659278 -0.30 0.767 -.3746627 .2761437 | |
35 | -.0429892 .1537828 -0.28 0.780 -.3445747 .2585963 | |
55 | .3753688 .1624 2.31 0.021 .056884 .6938537 | |
65 | 1.240436 .1496894 8.29 0.000 .946878 1.533994 | |
| | |
female | -.5118235 .0939119 -5.45 0.000 -.6959954 -.3276516 | |
| | |
health | | |
2 | -.293325 .1115856 -2.63 0.009 -.5121571 -.0744928 | |
3 | -.3068507 .1356245 -2.26 0.024 -.5728256 -.0408758 | |
4 | -.450338 .2229726 -2.02 0.044 -.8876125 -.0130635 | |
5 | -.0378434 .4049108 -0.09 0.926 -.8319194 .7562325 | |
| | |
lowinc | -.3094155 .129387 -2.39 0.017 -.563158 -.055673 | |
| | |
pol3 | | |
1 | .1433662 .1130665 1.27 0.205 -.07837 .3651024 | |
3 | .0290743 .1146745 0.25 0.800 -.1958155 .2539641 | |
| | |
_cons | 6.101753 .1548257 39.41 0.000 5.798122 6.405383 | |
------------------------------------------------------------------------------ | |
. | |
. * As an exercise, run your own selection of regression diagnostics and marginal | |
. * effects for the British model. Compare the predictors in each country and see, | |
. * for instance, if age and political views have the same effects in Britain. | |
. | |
. | |
. * ============== | |
. * = EXTENSIONS = | |
. * ============== | |
. | |
. | |
. * Note: this section showcases some methods that are related to the content of | |
. * the course, but go beyond its scope. Both techniques yield corrected standard | |
. * errors, which is crucial for panel data analysis. These methods require more | |
. * theoretical support (and possibly different data) to operate, and are shown | |
. * here for demonstration purposes only. | |
. | |
. | |
. * (1) Bootstrapping | |
. * ----------------- | |
. | |
. reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "FR", /// | |
> vce(bootstrap, r(100)) | |
(running regress on estimation sample) | |
Bootstrap replications (100) | |
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 | |
.................................................. 50 | |
.................................................. 100 | |
Linear regression Number of obs = 1942 | |
Replications = 100 | |
Wald chi2(13) = 190.75 | |
Prob > chi2 = 0.0000 | |
R-squared = 0.0640 | |
Adj R-squared = 0.0577 | |
Root MSE = 2.0347 | |
------------------------------------------------------------------------------ | |
| Observed Bootstrap Normal-based | |
hsat | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .5757433 .1552132 3.71 0.000 .271531 .8799556 | |
25 | .4181367 .1511703 2.77 0.006 .1218484 .714425 | |
35 | .2737408 .1576567 1.74 0.083 -.0352607 .5827422 | |
55 | .0178386 .1424735 0.13 0.900 -.2614044 .2970815 | |
65 | .1721822 .1371706 1.26 0.209 -.0966671 .4410316 | |
| | |
female | -.3929954 .0873834 -4.50 0.000 -.5642636 -.2217271 | |
| | |
health | | |
2 | -.367027 .1157262 -3.17 0.002 -.5938462 -.1402078 | |
3 | -.4762367 .1525456 -3.12 0.002 -.7752206 -.1772528 | |
4 | -.5536348 .2265738 -2.44 0.015 -.9977112 -.1095583 | |
5 | -1.020825 .5374374 -1.90 0.058 -2.074183 .0325327 | |
| | |
lowinc | -.2263043 .1409385 -1.61 0.108 -.5025387 .0499301 | |
| | |
pol3 | | |
1 | -.5218848 .1215707 -4.29 0.000 -.7601589 -.2836106 | |
3 | .3431802 .1201522 2.86 0.004 .1076861 .5786742 | |
| | |
_cons | 6.458039 .1466394 44.04 0.000 6.170631 6.745447 | |
------------------------------------------------------------------------------ | |
. reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "GB", /// | |
> vce(bootstrap, r(100)) | |
(running regress on estimation sample) | |
Bootstrap replications (100) | |
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 | |
.................................................. 50 | |
.................................................. 100 | |
Linear regression Number of obs = 2079 | |
Replications = 100 | |
Wald chi2(13) = 184.49 | |
Prob > chi2 = 0.0000 | |
R-squared = 0.0770 | |
Adj R-squared = 0.0712 | |
Root MSE = 2.1246 | |
------------------------------------------------------------------------------ | |
| Observed Bootstrap Normal-based | |
hsat | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .1577784 .1683583 0.94 0.349 -.1721979 .4877546 | |
25 | -.0492595 .1620299 -0.30 0.761 -.3668323 .2683134 | |
35 | -.0429892 .1435742 -0.30 0.765 -.3243894 .238411 | |
55 | .3753688 .1788985 2.10 0.036 .0247341 .7260035 | |
65 | 1.240436 .14953 8.30 0.000 .9473627 1.533509 | |
| | |
female | -.5118235 .086252 -5.93 0.000 -.6808744 -.3427726 | |
| | |
health | | |
2 | -.293325 .1110796 -2.64 0.008 -.5110371 -.0756128 | |
3 | -.3068507 .1450168 -2.12 0.034 -.5910784 -.022623 | |
4 | -.450338 .227803 -1.98 0.048 -.8968236 -.0038524 | |
5 | -.0378434 .4747934 -0.08 0.936 -.9684214 .8927345 | |
| | |
lowinc | -.3094155 .1357964 -2.28 0.023 -.5755715 -.0432595 | |
| | |
pol3 | | |
1 | .1433662 .1139235 1.26 0.208 -.0799197 .3666521 | |
3 | .0290743 .1141306 0.25 0.799 -.1946176 .2527663 | |
| | |
_cons | 6.101753 .1599751 38.14 0.000 5.788207 6.415298 | |
------------------------------------------------------------------------------ | |
. | |
. /* What happened here: | |
> | |
> - Bootstrapping is a simulation technique that resamples the data as many times | |
> as you ask it (here we ran 100 replications) and then computes the standard | |
> error from the standard deviation of these simulations. | |
> | |
> - Resampling means that the data used in each simulation is randomly selected | |
> from the original dataset, with replacement: one value may appear many times. | |
> The result is 100 simulations of the data with slightly different values. | |
> | |
> - Bootstrapping is particularly efficient at lower sample sizes, for which it | |
> provides more reliable standard errors than the 'square root of N' formula. | |
> It applies to parametric estimation commands like -su-, -reg-, etc. */ | |
. | |
. | |
. * (2) Clustered standard errors | |
. * ----------------------------- | |
. | |
. * Remember that we saved the initial models as 'est1' (FR) and 'est2' (GB). | |
. eststo dir | |
------------------------------------------------------- | |
name | command depvar npar title | |
-------------+----------------------------------------- | |
est1 | regress hsat 17 FR | |
est2 | regress hsat 17 GB | |
est3 | regress esat 17 FR | |
est4 | regress esat 17 GB | |
est5 | regress gsat 17 FR | |
est6 | regress gsat 17 GB | |
------------------------------------------------------- | |
. | |
. * The next command stores the right-hand side of the regression equation, i.e. | |
. * the list of predictors (IVs), into a convenient string of text handled by | |
. * Stata as a local macro. This works almost like the global macro trick we saw | |
. * before, and becomes useful when you have long lists of predictors. | |
. local rhs "ib45.age6 female i.health lowinc ib2.pol3" | |
. | |
. * IMPORTANT: storing the variable names into a local macro is technically more | |
. * appropriate than using a global one as we did in a earlier do-file. However, | |
. * this come with additional constraints: local macros are handled with `ticks' | |
. * instead of the $dollar sign, and they have to be run in the same sequence as | |
. * the regression commands to work properly, WITHOUT stopping execution. This | |
. * means that your local macros will work only if you run the whole code block | |
. * (the line below AND the -reg- commands), or the whole do-file. | |
. | |
. * Store robust models. | |
. eststo FRr: reg hsat `rhs' if cntry == "FR", vce(cluster regionfr) | |
Linear regression Number of obs = 1942 | |
F( 7, 8) = . | |
Prob > F = . | |
R-squared = 0.0640 | |
Root MSE = 2.0347 | |
(Std. Err. adjusted for 9 clusters in regionfr) | |
------------------------------------------------------------------------------ | |
| Robust | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .5757433 .1394916 4.13 0.003 .2540751 .8974116 | |
25 | .4181367 .2011238 2.08 0.071 -.0456555 .8819289 | |
35 | .2737408 .1915253 1.43 0.191 -.1679173 .7153988 | |
55 | .0178386 .2260695 0.08 0.939 -.5034787 .5391558 | |
65 | .1721822 .1770583 0.97 0.359 -.2361149 .5804794 | |
| | |
female | -.3929954 .0839611 -4.68 0.002 -.5866099 -.1993808 | |
| | |
health | | |
2 | -.367027 .1075007 -3.41 0.009 -.6149241 -.1191299 | |
3 | -.4762367 .1366204 -3.49 0.008 -.791284 -.1611894 | |
4 | -.5536348 .1513388 -3.66 0.006 -.9026228 -.2046468 | |
5 | -1.020825 .4291435 -2.38 0.045 -2.010432 -.0312185 | |
| | |
lowinc | -.2263043 .136934 -1.65 0.137 -.5420745 .089466 | |
| | |
pol3 | | |
1 | -.5218848 .1065375 -4.90 0.001 -.7675607 -.2762088 | |
3 | .3431802 .1310755 2.62 0.031 .0409196 .6454407 | |
| | |
_cons | 6.458039 .1159184 55.71 0.000 6.190731 6.725348 | |
------------------------------------------------------------------------------ | |
. eststo GBr: reg hsat `rhs' if cntry == "GB", vce(cluster regiongb) | |
Linear regression Number of obs = 2079 | |
F( 10, 11) = . | |
Prob > F = . | |
R-squared = 0.0770 | |
Root MSE = 2.1246 | |
(Std. Err. adjusted for 12 clusters in regiongb) | |
------------------------------------------------------------------------------ | |
| Robust | |
hsat | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
age6 | | |
15 | .1577784 .1483508 1.06 0.310 -.1687396 .4842963 | |
25 | -.0492595 .2052812 -0.24 0.815 -.5010803 .4025613 | |
35 | -.0429892 .1393327 -0.31 0.763 -.3496585 .2636801 | |
55 | .3753688 .1441162 2.60 0.024 .0581713 .6925664 | |
65 | 1.240436 .1346074 9.22 0.000 .9441671 1.536705 | |
| | |
female | -.5118235 .0689489 -7.42 0.000 -.663579 -.3600681 | |
| | |
health | | |
2 | -.293325 .138153 -2.12 0.057 -.5973977 .0107478 | |
3 | -.3068507 .1851052 -1.66 0.126 -.7142644 .100563 | |
4 | -.450338 .1975345 -2.28 0.044 -.8851085 -.0155674 | |
5 | -.0378434 .5442734 -0.07 0.946 -1.235781 1.160094 | |
| | |
lowinc | -.3094155 .0971279 -3.19 0.009 -.5231926 -.0956384 | |
| | |
pol3 | | |
1 | .1433662 .1649275 0.87 0.403 -.2196369 .5063693 | |
3 | .0290743 .1177467 0.25 0.810 -.2300844 .2882331 | |
| | |
_cons | 6.101753 .124926 48.84 0.000 5.826793 6.376713 | |
------------------------------------------------------------------------------ | |
. | |
. * Compare both versions for a more realistic assessment of the standard errors. | |
. esttab est1 FRr est2 GBr, nogaps b(2) se(2) sca(rmse) compress /// | |
> mti("FR" "FR robust" "GB" "GB robust") | |
-------------------------------------------------------------- | |
(1) (2) (3) (4) | |
FR FR robust GB GB robust | |
-------------------------------------------------------------- | |
15.age6 0.58** 0.58** 0.16 0.16 | |
(0.18) (0.14) (0.20) (0.15) | |
25.age6 0.42** 0.42 -0.05 -0.05 | |
(0.16) (0.20) (0.17) (0.21) | |
35.age6 0.27 0.27 -0.04 -0.04 | |
(0.16) (0.19) (0.15) (0.14) | |
45b.age6 0.00 0.00 0.00 0.00 | |
(.) (.) (.) (.) | |
55.age6 0.02 0.02 0.38* 0.38* | |
(0.16) (0.23) (0.16) (0.14) | |
65.age6 0.17 0.17 1.24*** 1.24*** | |
(0.15) (0.18) (0.15) (0.13) | |
female -0.39*** -0.39** -0.51*** -0.51*** | |
(0.09) (0.08) (0.09) (0.07) | |
1b.health 0.00 0.00 0.00 0.00 | |
(.) (.) (.) (.) | |
2.health -0.37** -0.37** -0.29** -0.29 | |
(0.13) (0.11) (0.11) (0.14) | |
3.health -0.48*** -0.48** -0.31* -0.31 | |
(0.14) (0.14) (0.14) (0.19) | |
4.health -0.55* -0.55** -0.45* -0.45* | |
(0.22) (0.15) (0.22) (0.20) | |
5.health -1.02* -1.02* -0.04 -0.04 | |
(0.45) (0.43) (0.40) (0.54) | |
lowinc -0.23 -0.23 -0.31* -0.31** | |
(0.13) (0.14) (0.13) (0.10) | |
1.pol3 -0.52*** -0.52** 0.14 0.14 | |
(0.12) (0.11) (0.11) (0.16) | |
2b.pol3 0.00 0.00 0.00 0.00 | |
(.) (.) (.) (.) | |
3.pol3 0.34** 0.34* 0.03 0.03 | |
(0.12) (0.13) (0.11) (0.12) | |
_cons 6.46*** 6.46*** 6.10*** 6.10*** | |
(0.18) (0.12) (0.15) (0.12) | |
-------------------------------------------------------------- | |
N 1942 1942 2079 2079 | |
rmse 2.03 2.03 2.12 2.12 | |
-------------------------------------------------------------- | |
Standard errors in parentheses | |
* p<0.05, ** p<0.01, *** p<0.001 | |
. | |
. /* What happened here: | |
> | |
> - We clustered the data by geographical region in each regression, which means | |
> that the standard errors of the coefficients will increase if the variance of | |
> the data differs between regions, indicating some macro-level effect. | |
> | |
> - In this example, we assume that poorer and/or less populated regions will not | |
> benefit from the same health care facilities than others, which will create | |
> differences between predicted means of the DV clustered by region. | |
> | |
> - The results show that the clustered models lose some significant coefficients | |
> in comparison to the original ones, which should invite us to correct some of | |
> our initial interpretations, or consider more advanced modelling. | |
> | |
> - Robust (corrected) standard errors become crucial when the data form a panel, | |
> as with cross-sectional time-series (CSTS) data, because the observations are | |
> then country-years and variance will exist between and within them. */ | |
. | |
. | |
. * ======= | |
. * = END = | |
. * ======= | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done, and we have covered tons of stuff. Thanks for following! | |
. * exit, clear | |
. | |
end of do-file | |
. | |
. * Check setup. | |
. run setup/require estout fre scheme-burd spineplot | |
. | |
. * Log results. | |
. cap log using code/week12.log, replace | |
. | |
. /* ------------------------------------------ SRQM Session 12 ------------------ | |
> | |
> F. Briatte and I. Petev | |
> | |
> - TOPIC: Sexual Partners in the United States | |
> | |
> - DATA: U.S. General Social Survey (2010) | |
> | |
> What makes Americans likely to report high numbers of sexual partners in the | |
> last five years? What makes them more likely to report low numbers? | |
> | |
> For this session, all hypotheses are to be provided by the students. | |
> | |
> Last updated 2013-05-31. | |
> | |
> ----------------------------------------------------------------------------- */ | |
. | |
. * Load GSS dataset for selected survey year. | |
. use data/gss0012 if year == 2010, clear | |
(U.S. General Social Survey 2000-2012) | |
. | |
. * Inspect DV. | |
. fre partnrs5 | |
partnrs5 -- how many sex partners r had in last 5 years | |
------------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
----------------------------------+-------------------------------------------- | |
Valid 0 no partners | 261 12.77 14.40 14.40 | |
1 1 partner | 963 47.11 53.12 67.51 | |
2 2 partners | 175 8.56 9.65 77.16 | |
3 3 partners | 114 5.58 6.29 83.45 | |
4 4 partners | 77 3.77 4.25 87.70 | |
5 5-10 partners | 123 6.02 6.78 94.48 | |
6 11-20 partners | 40 1.96 2.21 96.69 | |
7 21-100 partners | 11 0.54 0.61 97.30 | |
8 more than 100 partners | 3 0.15 0.17 97.46 | |
9 1 or more, dk # | 46 2.25 2.54 100.00 | |
Total | 1813 88.70 100.00 | |
Missing .d | 2 0.10 | |
.i | 202 9.88 | |
.n | 27 1.32 | |
Total | 231 11.30 | |
Total | 2044 100.00 | |
------------------------------------------------------------------------------- | |
. | |
. * Keep only valid observations, excluding oblivious respondents. | |
. clonevar sxp = partnrs5 if partnrs5 < 9 | |
(277 missing values generated) | |
. | |
. * Code missing values for deeper inspection. | |
. gen missing = mi(sxp) | |
. | |
. * Generate six age groups (15-24, 25-34, ..., 65+). | |
. gen age6:age6 = irecode(age, 24, 34, 44, 54, 64, .) | |
(3 missing values generated) | |
. | |
. * Code the value as the lower bound of the age groups (the data buckets). | |
. replace age6 = 10 * age6 + 15 | |
(2041 real changes made) | |
. | |
. * Assign value labels. | |
. la def age6 15 "15-24" 25 "25-34" 35 "35-44" /// | |
> 45 "45-54" 55 "55-64" 65 "65+", replace | |
. la var age6 "Age groups" | |
. | |
. * Inspect missing values by age and sex. | |
. gr bar (count) age, over(missing) asyvars stack over(age6) over(sex) /// | |
> name(missing_agesex, replace) | |
. | |
. * Chi-squared test for age groups. | |
. bys sex: tab age6 missing, col nof chi2 | |
------------------------------------------------------------------------------------ | |
-> sex = male | |
| missing | |
Age groups | 0 1 | Total | |
-----------+----------------------+---------- | |
15-24 | 9.27 9.73 | 9.33 | |
25-34 | 18.40 7.08 | 16.97 | |
35-44 | 17.76 17.70 | 17.75 | |
45-54 | 19.69 24.78 | 20.34 | |
55-64 | 18.28 16.81 | 18.09 | |
65+ | 16.60 23.89 | 17.53 | |
-----------+----------------------+---------- | |
Total | 100.00 100.00 | 100.00 | |
Pearson chi2(5) = 11.8447 Pr = 0.037 | |
------------------------------------------------------------------------------------ | |
-> sex = female | |
| missing | |
Age groups | 0 1 | Total | |
-----------+----------------------+---------- | |
15-24 | 8.60 7.98 | 8.51 | |
25-34 | 20.14 16.56 | 19.64 | |
35-44 | 19.03 14.11 | 18.33 | |
45-54 | 17.21 14.72 | 16.85 | |
55-64 | 16.60 12.88 | 16.07 | |
65+ | 18.42 33.74 | 20.59 | |
-----------+----------------------+---------- | |
Total | 100.00 100.00 | 100.00 | |
Pearson chi2(5) = 20.4871 Pr = 0.001 | |
. | |
. * Proportions test for sex groups. | |
. prtest missing, by(sex) | |
Two-sample test of proportions male: Number of obs = 891 | |
female: Number of obs = 1153 | |
------------------------------------------------------------------------------ | |
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
male | .1268238 .0111484 .1049733 .1486743 | |
female | .1422376 .0102867 .1220761 .1623992 | |
-------------+---------------------------------------------------------------- | |
diff | -.0154138 .0151691 -.0451448 .0143171 | |
| under Ho: .0152674 -1.01 0.313 | |
------------------------------------------------------------------------------ | |
diff = prop(male) - prop(female) z = -1.0096 | |
Ho: diff = 0 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(Z < z) = 0.1563 Pr(|Z| < |z|) = 0.3127 Pr(Z > z) = 0.8437 | |
. | |
. * Comparison of average age between missing and nonmissing groups, by sex. | |
. bys sex: ttest age, by(missing) | |
------------------------------------------------------------------------------------ | |
-> sex = male | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
0 | 777 47.25097 .6029825 16.80797 46.0673 48.43464 | |
1 | 113 51.41593 1.717378 18.25598 48.01316 54.81869 | |
---------+-------------------------------------------------------------------- | |
combined | 890 47.77978 .5713296 17.0444 46.65846 48.90109 | |
---------+-------------------------------------------------------------------- | |
diff | -4.164964 1.711306 -7.523641 -.8062872 | |
------------------------------------------------------------------------------ | |
diff = mean(0) - mean(1) t = -2.4338 | |
Ho: diff = 0 degrees of freedom = 888 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0076 Pr(|T| > |t|) = 0.0151 Pr(T > t) = 0.9924 | |
------------------------------------------------------------------------------------ | |
-> sex = female | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
0 | 988 47.26113 .5589403 17.56887 46.16429 48.35798 | |
1 | 163 53.26994 1.622315 20.71233 50.06633 56.47355 | |
---------+-------------------------------------------------------------------- | |
combined | 1151 48.11208 .5352406 18.15878 47.06192 49.16223 | |
---------+-------------------------------------------------------------------- | |
diff | -6.008805 1.525558 -9.001996 -3.015614 | |
------------------------------------------------------------------------------ | |
diff = mean(0) - mean(1) t = -3.9388 | |
Ho: diff = 0 degrees of freedom = 1149 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 1.0000 | |
. | |
. * Inspect DV by age. | |
. spineplot sxp age6, scheme(burd8) name(sp, replace) | |
. | |
. * Inspect DV by age, sex and interviewer's sex. | |
. gr bar sxp, over(sex) asyvars over(age6) by(intsex) /// | |
> name(dv_agesexint, replace) | |
. | |
. * Inspect IVs. | |
. fre sex age coninc educ marital wrkstat size, r(10) | |
sex -- respondents sex | |
-------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------+-------------------------------------------- | |
Valid 1 male | 891 43.59 43.59 43.59 | |
2 female | 1153 56.41 56.41 100.00 | |
Total | 2044 100.00 100.00 | |
-------------------------------------------------------------- | |
age -- age of respondent | |
-------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-----------------------+-------------------------------------------- | |
Valid 18 | 10 0.49 0.49 0.49 | |
19 | 24 1.17 1.18 1.67 | |
20 | 24 1.17 1.18 2.84 | |
21 | 35 1.71 1.71 4.56 | |
22 | 19 0.93 0.93 5.49 | |
: | : : : : | |
85 | 6 0.29 0.29 98.04 | |
86 | 7 0.34 0.34 98.38 | |
87 | 4 0.20 0.20 98.58 | |
88 | 9 0.44 0.44 99.02 | |
89 89 or older | 20 0.98 0.98 100.00 | |
Total | 2041 99.85 100.00 | |
Missing .n | 3 0.15 | |
Total | 2044 100.00 | |
-------------------------------------------------------------------- | |
coninc -- family income in constant dollars | |
---------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------+-------------------------------------------- | |
Valid 401.5 | 43 2.10 2.38 2.38 | |
1606 | 24 1.17 1.33 3.71 | |
2810.5 | 17 0.83 0.94 4.65 | |
3613.5 | 8 0.39 0.44 5.10 | |
4416.5 | 19 0.93 1.05 6.15 | |
: | : : : : | |
66247.5 | 129 6.31 7.15 80.50 | |
80300 | 111 5.43 6.15 86.65 | |
96360 | 69 3.38 3.82 90.47 | |
112420 | 57 2.79 3.16 93.63 | |
152927.23 | 115 5.63 6.37 100.00 | |
Total | 1805 88.31 100.00 | |
Missing .i | 239 11.69 | |
Total | 2044 100.00 | |
---------------------------------------------------------------- | |
educ -- highest year of school completed | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 0 | 5 0.24 0.25 0.25 | |
1 | 1 0.05 0.05 0.29 | |
2 | 5 0.24 0.25 0.54 | |
3 | 4 0.20 0.20 0.74 | |
4 | 9 0.44 0.44 1.18 | |
: | : : : : | |
16 | 334 16.34 16.38 86.41 | |
17 | 71 3.47 3.48 89.90 | |
18 | 101 4.94 4.95 94.85 | |
19 | 33 1.61 1.62 96.47 | |
20 | 72 3.52 3.53 100.00 | |
Total | 2039 99.76 100.00 | |
Missing .d | 1 0.05 | |
.n | 4 0.20 | |
Total | 5 0.24 | |
Total | 2044 100.00 | |
----------------------------------------------------------- | |
marital -- marital status | |
---------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
-------------------------+-------------------------------------------- | |
Valid 1 married | 891 43.59 43.61 43.61 | |
2 widowed | 181 8.86 8.86 52.47 | |
3 divorced | 341 16.68 16.69 69.16 | |
4 separated | 65 3.18 3.18 72.34 | |
5 never married | 565 27.64 27.66 100.00 | |
Total | 2043 99.95 100.00 | |
Missing .n | 1 0.05 | |
Total | 2044 100.00 | |
---------------------------------------------------------------------- | |
wrkstat -- labor force status | |
------------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
----------------------------+-------------------------------------------- | |
Valid 1 working fulltime | 917 44.86 44.93 44.93 | |
2 working parttime | 234 11.45 11.46 56.39 | |
3 temp not working | 33 1.61 1.62 58.01 | |
4 unempl, laid off | 145 7.09 7.10 65.12 | |
5 retired | 319 15.61 15.63 80.74 | |
6 school | 93 4.55 4.56 85.30 | |
7 keeping house | 235 11.50 11.51 96.82 | |
8 other | 65 3.18 3.18 100.00 | |
Total | 2041 99.85 100.00 | |
Missing .n | 3 0.15 | |
Total | 2044 100.00 | |
------------------------------------------------------------------------- | |
size -- size of place in 1000s | |
----------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
--------------+-------------------------------------------- | |
Valid 0 | 52 2.54 2.54 2.54 | |
1 | 62 3.03 3.03 5.58 | |
2 | 71 3.47 3.47 9.05 | |
3 | 80 3.91 3.91 12.96 | |
4 | 137 6.70 6.70 19.67 | |
: | : : : : | |
1518 | 16 0.78 0.78 95.40 | |
1954 | 6 0.29 0.29 95.69 | |
2896 | 21 1.03 1.03 96.72 | |
3695 | 19 0.93 0.93 97.65 | |
8008 | 48 2.35 2.35 100.00 | |
Total | 2044 100.00 100.00 | |
----------------------------------------------------------- | |
. | |
. * Drop missing values. | |
. drop if mi(sxp, age, coninc, educ, marital, wrkstat) | |
(439 observations deleted) | |
. | |
. * Drop ambiguous wrkstat category "Other". | |
. drop if wrkstat == 8 | |
(45 observations deleted) | |
. | |
. * Recode sex. | |
. gen female = (sex == 1) if !mi(sex) | |
. | |
. * Final sample size. | |
. count | |
1560 | |
. | |
. * Survey weights. | |
. svyset vpsu [weight = wtssall], strata (vstrat) | |
(sampling weights assumed) | |
pweight: wtssall | |
VCE: linearized | |
Single unit: missing | |
Strata 1: vstrat | |
SU 1: vpsu | |
FPC 1: <zero> | |
. | |
. * Export summary stats. | |
. stab using week12_stats.txt, replace /// | |
> mean(coninc educ size) /// | |
> prop(age6 marital wrkstat) | |
(note: file week12_stats.txt not found) | |
Variable mean sd min max mea | |
> n sd min max mean sd min | |
> max mean sd min max mean | |
> sd min max mean sd min m | |
> ax mean sd min max mean sd | |
> min max | |
Age groups % % % % | |
> % % % % | |
marital status % % % % | |
> % % % % | |
labor force status % % % % | |
> % % % % | |
N = 15600 | |
File: week12_stats.txt | |
. | |
. | |
. * =================== | |
. * = DV DISTRIBUTION = | |
. * =================== | |
. | |
. | |
. * Explore the DV. | |
. fre sxp | |
sxp -- how many sex partners r had in last 5 years | |
------------------------------------------------------------------------------ | |
| Freq. Percent Valid Cum. | |
---------------------------------+-------------------------------------------- | |
Valid 0 no partners | 201 12.88 12.88 12.88 | |
1 1 partner | 856 54.87 54.87 67.76 | |
2 2 partners | 161 10.32 10.32 78.08 | |
3 3 partners | 106 6.79 6.79 84.87 | |
4 4 partners | 70 4.49 4.49 89.36 | |
5 5-10 partners | 116 7.44 7.44 96.79 | |
6 11-20 partners | 37 2.37 2.37 99.17 | |
7 21-100 partners | 10 0.64 0.64 99.81 | |
8 more than 100 partners | 3 0.19 0.19 100.00 | |
Total | 1560 100.00 100.00 | |
------------------------------------------------------------------------------ | |
. | |
. * Histogram for normality assessment. | |
. hist sxp, bin(10) percent addl norm /// | |
> name(dv_hist, replace) | |
(bin=10, start=0, width=.8) | |
. | |
. * Bivariate hypothesis test: mean DV by sex. | |
. ttest sxp, by(female) | |
Two-sample t test with equal variances | |
------------------------------------------------------------------------------ | |
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | |
---------+-------------------------------------------------------------------- | |
0 | 861 1.513357 .0483794 1.419587 1.418401 1.608312 | |
1 | 699 1.958512 .065634 1.735272 1.829648 2.087376 | |
---------+-------------------------------------------------------------------- | |
combined | 1560 1.712821 .0401031 1.583944 1.634159 1.791482 | |
---------+-------------------------------------------------------------------- | |
diff | -.4451556 .0798758 -.6018309 -.2884803 | |
------------------------------------------------------------------------------ | |
diff = mean(0) - mean(1) t = -5.5731 | |
Ho: diff = 0 degrees of freedom = 1558 | |
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 | |
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 | |
. | |
. | |
. * ===================== | |
. * = REGRESSION MODELS = | |
. * ===================== | |
. | |
. | |
. * A simple linear regression model test. | |
. reg sxp i.female | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 1, 1558) = 31.06 | |
Model | 76.4503376 1 76.4503376 Prob > F = 0.0000 | |
Residual | 3834.89325 1558 2.46142057 R-squared = 0.0195 | |
-------------+------------------------------ Adj R-squared = 0.0189 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.5689 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4451556 .0798758 5.57 0.000 .2884803 .6018309 | |
_cons | 1.513357 .0534677 28.30 0.000 1.40848 1.618233 | |
------------------------------------------------------------------------------ | |
. | |
. * Let's add some of our control variables one by one. Let's first control for | |
. * income: is higher income associated with a higher number of partners? | |
. reg sxp i.female coninc | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 2, 1557) = 29.90 | |
Model | 144.654674 2 72.3273369 Prob > F = 0.0000 | |
Residual | 3766.68892 1557 2.41919648 R-squared = 0.0370 | |
-------------+------------------------------ Adj R-squared = 0.0357 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.5554 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4884352 .0796061 6.14 0.000 .3322887 .6445816 | |
coninc | -5.25e-06 9.89e-07 -5.31 0.000 -7.19e-06 -3.31e-06 | |
_cons | 1.743573 .0684809 25.46 0.000 1.609248 1.877898 | |
------------------------------------------------------------------------------ | |
. | |
. * Let's transform income into a more meaningful scale: a dollar change in income | |
. * is not enough to have a large effect. Let's measure income to 10,000s of USD. | |
. gen inc = coninc / 10^4 | |
. | |
. * Regress again. | |
. reg sxp i.female inc | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 2, 1557) = 29.90 | |
Model | 144.654674 2 72.3273371 Prob > F = 0.0000 | |
Residual | 3766.68892 1557 2.41919648 R-squared = 0.0370 | |
-------------+------------------------------ Adj R-squared = 0.0357 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.5554 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4884352 .0796061 6.14 0.000 .3322887 .6445816 | |
inc | -.0525298 .0098932 -5.31 0.000 -.0719352 -.0331245 | |
_cons | 1.743573 .0684809 25.46 0.000 1.609248 1.877898 | |
------------------------------------------------------------------------------ | |
. | |
. * Let's control for education as well. | |
. reg sxp i.female inc educ | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 3, 1556) = 20.81 | |
Model | 150.858853 3 50.2862843 Prob > F = 0.0000 | |
Residual | 3760.48474 1556 2.41676397 R-squared = 0.0386 | |
-------------+------------------------------ Adj R-squared = 0.0367 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.5546 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4983867 .0798081 6.24 0.000 .3418439 .6549295 | |
inc | -.0602989 .0110131 -5.48 0.000 -.0819011 -.0386968 | |
educ | .0236624 .0147684 1.60 0.109 -.0053057 .0526304 | |
_cons | 1.449399 .195946 7.40 0.000 1.065053 1.833745 | |
------------------------------------------------------------------------------ | |
. | |
. * Let's control for urban size. | |
. reg sxp i.female inc educ size | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 4, 1555) = 17.99 | |
Model | 172.9803 4 43.245075 Prob > F = 0.0000 | |
Residual | 3738.36329 1555 2.40409215 R-squared = 0.0442 | |
-------------+------------------------------ Adj R-squared = 0.0418 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.5505 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4968538 .0796002 6.24 0.000 .3407187 .6529889 | |
inc | -.0609605 .0109864 -5.55 0.000 -.0825102 -.0394109 | |
educ | .0218699 .0147415 1.48 0.138 -.0070454 .0507851 | |
size | .0001062 .000035 3.03 0.002 .0000375 .0001748 | |
_cons | 1.444007 .1954397 7.39 0.000 1.060654 1.82736 | |
------------------------------------------------------------------------------ | |
. | |
. * How about working status? | |
. reg sxp i.female inc educ size i.wrkstat | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 10, 1549) = 16.98 | |
Model | 386.460207 10 38.6460207 Prob > F = 0.0000 | |
Residual | 3524.88338 1549 2.27558643 R-squared = 0.0988 | |
-------------+------------------------------ Adj R-squared = 0.0930 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.5085 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .473281 .0797203 5.94 0.000 .3169099 .629652 | |
inc | -.0637253 .0108897 -5.85 0.000 -.0850853 -.0423652 | |
educ | .0178777 .014525 1.23 0.219 -.010613 .0463683 | |
size | .000102 .0000341 2.99 0.003 .0000351 .000169 | |
| | |
wrkstat | | |
2 | -.2652899 .124414 -2.13 0.033 -.5093275 -.0212522 | |
3 | -.3174371 .2858415 -1.11 0.267 -.8781141 .24324 | |
4 | .2735554 .1528662 1.79 0.074 -.026291 .5734019 | |
5 | -.9783119 .1173846 -8.33 0.000 -1.208561 -.7480624 | |
6 | .3471192 .1837131 1.89 0.059 -.0132334 .7074719 | |
7 | -.373904 .1331162 -2.81 0.005 -.6350109 -.1127971 | |
| | |
_cons | 1.699544 .2026013 8.39 0.000 1.302142 2.096945 | |
------------------------------------------------------------------------------ | |
. fre wrkstat | |
wrkstat -- labor force status | |
------------------------------------------------------------------------ | |
| Freq. Percent Valid Cum. | |
---------------------------+-------------------------------------------- | |
Valid 1 working fulltime | 770 49.36 49.36 49.36 | |
2 working parttime | 186 11.92 11.92 61.28 | |
3 temp not working | 29 1.86 1.86 63.14 | |
4 unempl, laid off | 115 7.37 7.37 70.51 | |
5 retired | 214 13.72 13.72 84.23 | |
6 school | 76 4.87 4.87 89.10 | |
7 keeping house | 170 10.90 10.90 100.00 | |
Total | 1560 100.00 100.00 | |
------------------------------------------------------------------------ | |
. | |
. * Let's add a control for marital status. | |
. reg sxp i.female inc educ size i.wrkstat i.marital | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 14, 1545) = 29.71 | |
Model | 829.670175 14 59.2621554 Prob > F = 0.0000 | |
Residual | 3081.67341 1545 1.99461062 R-squared = 0.2121 | |
-------------+------------------------------ Adj R-squared = 0.2050 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.4123 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4165171 .0754064 5.52 0.000 .2686074 .5644268 | |
inc | -.0091845 .0110717 -0.83 0.407 -.0309016 .0125327 | |
educ | -.0104019 .0137466 -0.76 0.449 -.0373658 .016562 | |
size | .0000473 .0000323 1.46 0.143 -.0000161 .0001106 | |
| | |
wrkstat | | |
2 | -.2570435 .1167104 -2.20 0.028 -.485971 -.0281161 | |
3 | -.3358877 .268278 -1.25 0.211 -.8621152 .1903397 | |
4 | .1626127 .1433551 1.13 0.257 -.1185785 .4438038 | |
5 | -.6387619 .1158716 -5.51 0.000 -.8660441 -.4114796 | |
6 | -.0355131 .1753451 -0.20 0.840 -.3794527 .3084266 | |
7 | -.2376881 .1254588 -1.89 0.058 -.4837755 .0083994 | |
| | |
marital | | |
2 | -.1436741 .1535087 -0.94 0.349 -.4447816 .1574333 | |
3 | .6954988 .1079756 6.44 0.000 .4837047 .907293 | |
4 | .5578938 .2157927 2.59 0.010 .1346163 .9811713 | |
5 | 1.346379 .0951923 14.14 0.000 1.159659 1.533098 | |
| | |
_cons | 1.33734 .1949692 6.86 0.000 .9549077 1.719772 | |
------------------------------------------------------------------------------ | |
. fre marital | |
marital -- marital status | |
--------------------------------------------------------------------- | |
| Freq. Percent Valid Cum. | |
------------------------+-------------------------------------------- | |
Valid 1 married | 704 45.13 45.13 45.13 | |
2 widowed | 113 7.24 7.24 52.37 | |
3 divorced | 254 16.28 16.28 68.65 | |
4 separated | 47 3.01 3.01 71.67 | |
5 never married | 442 28.33 28.33 100.00 | |
Total | 1560 100.00 100.00 | |
--------------------------------------------------------------------- | |
. | |
. * Finally, let's control for age. | |
. reg sxp i.female inc educ size i.wrkstat i.marital age | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 15, 1544) = 41.57 | |
Model | 1125.14908 15 75.009939 Prob > F = 0.0000 | |
Residual | 2786.19451 1544 1.80453012 R-squared = 0.2877 | |
-------------+------------------------------ Adj R-squared = 0.2807 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.3433 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4744982 .0718664 6.60 0.000 .3335321 .6154644 | |
inc | .0000264 .0105555 0.00 0.998 -.0206783 .020731 | |
educ | -.0030519 .0130878 -0.23 0.816 -.0287236 .0226198 | |
size | .0000364 .0000307 1.18 0.236 -.0000239 .0000966 | |
| | |
wrkstat | | |
2 | -.1573256 .1112833 -1.41 0.158 -.3756079 .0609567 | |
3 | -.1842932 .2554498 -0.72 0.471 -.6853584 .316772 | |
4 | .2808993 .1366664 2.06 0.040 .0128279 .5489708 | |
5 | .1310702 .1255631 1.04 0.297 -.115222 .3773624 | |
6 | -.3849779 .1690023 -2.28 0.023 -.7164761 -.0534797 | |
7 | -.0950985 .1198503 -0.79 0.428 -.3301852 .1399881 | |
| | |
marital | | |
2 | .4698645 .153682 3.06 0.002 .168417 .7713121 | |
3 | .8262775 .1032092 8.01 0.000 .6238326 1.028722 | |
4 | .5124794 .2052838 2.50 0.013 .1098149 .9151439 | |
5 | .8806497 .0975843 9.02 0.000 .689238 1.072061 | |
| | |
age | -.0378645 .002959 -12.80 0.000 -.0436687 -.0320603 | |
_cons | 2.877276 .2210723 13.02 0.000 2.443643 3.31091 | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Reinterpretation of the constant | |
. * -------------------------------- | |
. | |
. * Lastly, the constant reflects the value of y when the IVs are equal to the | |
. * reference category for the categorical IVs (i.e., males, full-time employment, | |
. * married) or 0 for the continuous IVs (income = 0, education = 0, age = 0, size = | |
> 0). | |
. * However, often for continuous variables, as in this case, the 0 category is | |
. * unlikely (educ = 0 and income = 0) or unreal (age = 0 and size = 0). Therefore, | |
> the | |
. * constant is not meaningful and interpretable. In such cases, it's best to | |
. * recode your continuous IVs so that their mean is equal to 0, making the | |
. * reference category for the constant the sample mean for each continuous IV. | |
. * To do so, we simply nead to substract from each variable its mean. | |
. | |
. su inc | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
inc | 1560 4.751758 4.002815 .04015 15.29272 | |
. gen zinc = inc - r(mean) | |
. | |
. su size | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
size | 1560 319.8955 1123.792 0 8008 | |
. gen zsize = size - r(mean) | |
. | |
. su age | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
age | 1560 46.68269 16.85957 18 89 | |
. gen zage = age - r(mean) | |
. | |
. su educ | |
Variable | Obs Mean Std. Dev. Min Max | |
-------------+-------------------------------------------------------- | |
educ | 1560 13.80385 2.970237 2 20 | |
. gen zeduc = educ - r(mean) | |
. | |
. * Replicate the final regression model with transformed continuous variables. | |
. reg sxp i.female zinc zeduc zsize i.wrkstat i.marital zage | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 15, 1544) = 41.57 | |
Model | 1125.14908 15 75.0099388 Prob > F = 0.0000 | |
Residual | 2786.19451 1544 1.80453012 R-squared = 0.2877 | |
-------------+------------------------------ Adj R-squared = 0.2807 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.3433 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4744982 .0718664 6.60 0.000 .3335321 .6154644 | |
zinc | .0000264 .0105555 0.00 0.998 -.0206783 .020731 | |
zeduc | -.0030519 .0130878 -0.23 0.816 -.0287236 .0226198 | |
zsize | .0000364 .0000307 1.18 0.236 -.0000239 .0000966 | |
| | |
wrkstat | | |
2 | -.1573256 .1112833 -1.41 0.158 -.3756079 .0609567 | |
3 | -.1842932 .2554498 -0.72 0.471 -.6853584 .316772 | |
4 | .2808993 .1366664 2.06 0.040 .0128279 .5489708 | |
5 | .1310702 .1255631 1.04 0.297 -.115222 .3773624 | |
6 | -.3849779 .1690023 -2.28 0.023 -.7164761 -.0534797 | |
7 | -.0950985 .1198503 -0.79 0.428 -.3301852 .1399881 | |
| | |
marital | | |
2 | .4698645 .153682 3.06 0.002 .168417 .7713121 | |
3 | .8262775 .1032092 8.01 0.000 .6238326 1.028722 | |
4 | .5124794 .2052838 2.50 0.013 .1098149 .9151439 | |
5 | .8806497 .0975843 9.02 0.000 .689238 1.072061 | |
| | |
zage | -.0378645 .002959 -12.80 0.000 -.0436687 -.0320603 | |
_cons | 1.079297 .0741387 14.56 0.000 .9338733 1.22472 | |
------------------------------------------------------------------------------ | |
. | |
. * The results do not change except for the constant. For this model, the constant | |
. * stands for the average number of partners among respondents who are: | |
. * - Male (female = 0) | |
. * - With average income (zinc = 0) | |
. * - With average education (...) | |
. * - From a mid-sized town | |
. * - Employed full-time | |
. * - Married | |
. * - Mid-age | |
. | |
. | |
. * Standardized coefficients | |
. * ------------------------- | |
. | |
. * Model with metric coefficients (in units of each variable). | |
. reg sxp i.female zinc zeduc zsize i.wrkstat i.marital zage | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 15, 1544) = 41.57 | |
Model | 1125.14908 15 75.0099388 Prob > F = 0.0000 | |
Residual | 2786.19451 1544 1.80453012 R-squared = 0.2877 | |
-------------+------------------------------ Adj R-squared = 0.2807 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.3433 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4744982 .0718664 6.60 0.000 .3335321 .6154644 | |
zinc | .0000264 .0105555 0.00 0.998 -.0206783 .020731 | |
zeduc | -.0030519 .0130878 -0.23 0.816 -.0287236 .0226198 | |
zsize | .0000364 .0000307 1.18 0.236 -.0000239 .0000966 | |
| | |
wrkstat | | |
2 | -.1573256 .1112833 -1.41 0.158 -.3756079 .0609567 | |
3 | -.1842932 .2554498 -0.72 0.471 -.6853584 .316772 | |
4 | .2808993 .1366664 2.06 0.040 .0128279 .5489708 | |
5 | .1310702 .1255631 1.04 0.297 -.115222 .3773624 | |
6 | -.3849779 .1690023 -2.28 0.023 -.7164761 -.0534797 | |
7 | -.0950985 .1198503 -0.79 0.428 -.3301852 .1399881 | |
| | |
marital | | |
2 | .4698645 .153682 3.06 0.002 .168417 .7713121 | |
3 | .8262775 .1032092 8.01 0.000 .6238326 1.028722 | |
4 | .5124794 .2052838 2.50 0.013 .1098149 .9151439 | |
5 | .8806497 .0975843 9.02 0.000 .689238 1.072061 | |
| | |
zage | -.0378645 .002959 -12.80 0.000 -.0436687 -.0320603 | |
_cons | 1.079297 .0741387 14.56 0.000 .9338733 1.22472 | |
------------------------------------------------------------------------------ | |
. | |
. * Model with all coefficients expressed in standard deviation units. | |
. reg sxp i.female zinc zeduc zsize i.wrkstat i.marital zage, b | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 15, 1544) = 41.57 | |
Model | 1125.14908 15 75.0099388 Prob > F = 0.0000 | |
Residual | 2786.19451 1544 1.80453012 R-squared = 0.2877 | |
-------------+------------------------------ Adj R-squared = 0.2807 | |
Total | 3911.34359 1559 2.50887979 Root MSE = 1.3433 | |
------------------------------------------------------------------------------ | |
sxp | Coef. Std. Err. t P>|t| Beta | |
-------------+---------------------------------------------------------------- | |
1.female | .4744982 .0718664 6.60 0.000 .1490217 | |
zinc | .0000264 .0105555 0.00 0.998 .0000666 | |
zeduc | -.0030519 .0130878 -0.23 0.816 -.005723 | |
zsize | .0000364 .0000307 1.18 0.236 .0258157 | |
| | |
wrkstat | | |
2 | -.1573256 .1112833 -1.41 0.158 -.0321976 | |
3 | -.1842932 .2554498 -0.72 0.471 -.0157207 | |
4 | .2808993 .1366664 2.06 0.040 .0463562 | |
5 | .1310702 .1255631 1.04 0.297 .0284779 | |
6 | -.3849779 .1690023 -2.28 0.023 -.0523401 | |
7 | -.0950985 .1198503 -0.79 0.428 -.0187146 | |
| | |
marital | | |
2 | .4698645 .153682 3.06 0.002 .0769167 | |
3 | .8262775 .1032092 8.01 0.000 .1926589 | |
4 | .5124794 .2052838 2.50 0.013 .0553248 | |
5 | .8806497 .0975843 9.02 0.000 .2506167 | |
| | |
zage | -.0378645 .002959 -12.80 0.000 -.4030311 | |
_cons | 1.079297 .0741387 14.56 0.000 . | |
------------------------------------------------------------------------------ | |
. | |
. | |
. * Residuals | |
. * --------- | |
. | |
. * Get residuals. | |
. predict r, resid | |
. | |
. * Distribution of the residuals. | |
. kdensity r, norm | |
. | |
. * Residuals-versus-fitted values plot. | |
. rvfplot | |
. | |
. | |
. * Extensions | |
. * ---------- | |
. | |
. recode partnrs5 (0 = 0) (1 = 1) (2 = 2) (3 = 3) (4 = 4) /// | |
> (5 = 8) (6 = 15) (7 = 60) (8 = 120) (else = .), ge | |
> n(sxp_count) | |
(166 differences between partnrs5 and sxp_count) | |
. | |
. * Multiple linear regression. | |
. eststo LIN: reg sxp_count i.female inc educ size i.wrkstat i.marital age | |
Source | SS df MS Number of obs = 1560 | |
-------------+------------------------------ F( 15, 1544) = 8.80 | |
Model | 6862.01082 15 457.467388 Prob > F = 0.0000 | |
Residual | 80250.7578 1544 51.9758794 R-squared = 0.0788 | |
-------------+------------------------------ Adj R-squared = 0.0698 | |
Total | 87112.7686 1559 55.8773371 Root MSE = 7.2094 | |
------------------------------------------------------------------------------ | |
sxp_count | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | 1.244053 .3856958 3.23 0.001 .4875102 2.000596 | |
inc | -.0379668 .0566498 -0.67 0.503 -.1490854 .0731519 | |
educ | -.1309148 .0702401 -1.86 0.063 -.2686908 .0068612 | |
size | .0000231 .0001649 0.14 0.889 -.0003003 .0003465 | |
| | |
wrkstat | | |
2 | -.9158042 .5972397 -1.53 0.125 -2.087291 .2556825 | |
3 | -1.379695 1.370959 -1.01 0.314 -4.068833 1.309443 | |
4 | .6738155 .7334673 0.92 0.358 -.7648818 2.112513 | |
5 | .1258675 .6738774 0.19 0.852 -1.195944 1.447679 | |
6 | -1.345912 .9070085 -1.48 0.138 -3.125011 .4331864 | |
7 | -1.048245 .6432179 -1.63 0.103 -2.309918 .2134281 | |
| | |
marital | | |
2 | 1.339347 .8247873 1.62 0.105 -.278475 2.957168 | |
3 | .9072996 .5539073 1.64 0.102 -.1791905 1.99379 | |
4 | 2.18047 1.101726 1.98 0.048 .0194335 4.341507 | |
5 | 1.998603 .5237195 3.82 0.000 .971326 3.025879 | |
| | |
age | -.0861799 .0158807 -5.43 0.000 -.1173299 -.0550299 | |
_cons | 7.521305 1.18646 6.34 0.000 5.194062 9.848548 | |
------------------------------------------------------------------------------ | |
. | |
. * Negative binomial regression (for count data). | |
. eststo NBR: nbreg sxp_count i.female inc educ size i.wrkstat i.marital age | |
Fitting Poisson model: | |
Iteration 0: log likelihood = -4765.8027 | |
Iteration 1: log likelihood = -4765.5101 | |
Iteration 2: log likelihood = -4765.51 | |
Fitting constant-only model: | |
Iteration 0: log likelihood = -3370.3246 | |
Iteration 1: log likelihood = -3360.0198 | |
Iteration 2: log likelihood = -3360.0186 | |
Iteration 3: log likelihood = -3360.0186 | |
Fitting full model: | |
Iteration 0: log likelihood = -3119.4143 | |
Iteration 1: log likelihood = -3092.2597 | |
Iteration 2: log likelihood = -3010.6521 | |
Iteration 3: log likelihood = -3009.4699 | |
Iteration 4: log likelihood = -3009.4695 | |
Iteration 5: log likelihood = -3009.4695 | |
Negative binomial regression Number of obs = 1560 | |
LR chi2(15) = 701.10 | |
Dispersion = mean Prob > chi2 = 0.0000 | |
Log likelihood = -3009.4695 Pseudo R2 = 0.1043 | |
------------------------------------------------------------------------------ | |
sxp_count | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
-------------+---------------------------------------------------------------- | |
1.female | .4677508 .0593528 7.88 0.000 .3514216 .5840801 | |
inc | -.0098874 .00899 -1.10 0.271 -.0275075 .0077327 | |
educ | -.0382383 .0114983 -3.33 0.001 -.0607746 -.015702 | |
size | .0000109 .000025 0.44 0.661 -.000038 .0000598 | |
| | |
wrkstat | | |
2 | -.2188368 .0960093 -2.28 0.023 -.4070116 -.0306621 | |
3 | -.3174595 .2301175 -1.38 0.168 -.7684815 .1335625 | |
4 | .2093893 .1045834 2.00 0.045 .0044096 .4143691 | |
5 | .0796039 .1174751 0.68 0.498 -.1506432 .3098509 | |
6 | -.3848193 .1280587 -3.01 0.003 -.6358097 -.1338289 | |
7 | -.4379 .1063386 -4.12 0.000 -.6463198 -.2294802 | |
| | |
marital | | |
2 | .2248589 .1511692 1.49 0.137 -.0714272 .5211451 | |
3 | .5367478 .0863664 6.21 0.000 .3674727 .7060229 | |
4 | .6616965 .1558389 4.25 0.000 .3562578 .9671352 | |
5 | .574297 .0771317 7.45 0.000 .4231217 .7254723 | |
| | |
age | -.0375803 .0026774 -14.04 0.000 -.0428278 -.0323327 | |
_cons | 2.572448 .1841411 13.97 0.000 2.211538 2.933358 | |
-------------+---------------------------------------------------------------- | |
/lnalpha | -.3374227 .050378 -.4361617 -.2386838 | |
-------------+---------------------------------------------------------------- | |
alpha | .7136071 .0359501 .6465132 .7876639 | |
------------------------------------------------------------------------------ | |
Likelihood-ratio test of alpha=0: chibar2(01) = 3512.08 Prob>=chibar2 = 0.000 | |
. | |
. * Compare models. | |
. esttab LIN NBR, b(1) wide compress mti("Lin. reg." "Neg. bin.") | |
-------------------------------------------------------- | |
(1) (2) | |
Lin. reg. Neg. bin. | |
-------------------------------------------------------- | |
main | |
0b.female 0.0 (.) 0.0 (.) | |
1.female 1.2** (3.23) 0.5*** (7.88) | |
inc -0.0 (-0.67) -0.0 (-1.10) | |
educ -0.1 (-1.86) -0.0*** (-3.33) | |
size 0.0 (0.14) 0.0 (0.44) | |
1b.wrkstat 0.0 (.) 0.0 (.) | |
2.wrkstat -0.9 (-1.53) -0.2* (-2.28) | |
3.wrkstat -1.4 (-1.01) -0.3 (-1.38) | |
4.wrkstat 0.7 (0.92) 0.2* (2.00) | |
5.wrkstat 0.1 (0.19) 0.1 (0.68) | |
6.wrkstat -1.3 (-1.48) -0.4** (-3.01) | |
7.wrkstat -1.0 (-1.63) -0.4*** (-4.12) | |
1b.marital 0.0 (.) 0.0 (.) | |
2.marital 1.3 (1.62) 0.2 (1.49) | |
3.marital 0.9 (1.64) 0.5*** (6.21) | |
4.marital 2.2* (1.98) 0.7*** (4.25) | |
5.marital 2.0*** (3.82) 0.6*** (7.45) | |
age -0.1*** (-5.43) -0.0*** (-14.04) | |
_cons 7.5*** (6.34) 2.6*** (13.97) | |
-------------------------------------------------------- | |
lnalpha | |
_cons -0.3*** (-6.70) | |
-------------------------------------------------------- | |
N 1560 1560 | |
-------------------------------------------------------- | |
t statistics in parentheses | |
* p<0.05, ** p<0.01, *** p<0.001 | |
. | |
. * Export in wide format. | |
. esttab LIN NBR using week12_regressions.txt, /// | |
> b(1) wide compress mti("Lin. reg." "Neg. bin.") | |
(output written to week12_regressions.txt) | |
. | |
. | |
. * ======== | |
. * = EXIT = | |
. * ======== | |
. | |
. | |
. * Close log (if opened). | |
. cap log close | |
. | |
. * We are done. Just quit the application, have a nice week, and see you soon :) | |
. * exit, clear | |
. | |
end of do-file |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment