briatte · December 21, 2015 05:29
diff --git a/demo.log b/demo.log
 ------------------------------------------------------------------------------------
      name:  srqm_demo
       log:  /Users/fr/Documents/Teaching/SRQM/demo.log
  log type:  text
 opened on:  17 Aug 2013, 18:28:28

 . 
 . * Check setup. This line appears in every course do-file. It makes sure that
 . * you have the appropriate files and packages to successfully run the code.
 . run setup/require fre lookfor_all

 . 
 . /* ------------------------------------------ SRQM Session 1 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - Hi! Welcome to your first SRQM do-file.
 > 
 >  - You are probably viewing this file from the Stata do-file editor, after
 >    opening it with the -doedit code/week1- command. If so, you are
 >    doing it right: congratulations.
 >    
 >  - You will be reading through your first do-file in just a minute. It is
 >    essential that you read through each week's do-file to become familiar
 >    with Stata commands.
 > 
 >  - We will start exploring do-files in class, and you get to finish them on your
 >    own as homework, along with reading one chapter from the course handbook and
 >    a few sections from the Stata Guide. These tasks complement each other.
 >    
 >  - Everything that you learn from the course do-files will be put to use in your
 >    research project. Practice with Stata by trying out commands as you learn
 >    them. If things do not work out, try again after checking the command syntax.
 > 
 >    Last updated 2013-05-29.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . 
 . * Comments
 . * --------
 . 
 . * This line is a comment due to the '*' symbol at its beginning. It takes a
 . * green colour in the Stata do-file editor. This do-file is fully commented
 . * to guide you through the basics. In your own code, you should also use
 . * comments to document and section your operations.
 . 
 . // note: lines or chunks of code that start with '//' are also comments, ...
 . 
 . /* and blocks of code that start with that symbol
 >    and end with the reverse one are also comments */
 . 
 . // ... and Stata helps you to detect comments by coloring them in green.
 . 
 . * When you see the words 'uncomment to run', it means 'remove the comment to run
 . * the code'. Remove the asterisk and trailing space on the next line, then run 
 . * it by copy-pasting it into the Command window and pressing Enter:
 . 
 . * di "Hello world."
 . 
 . * When I cite a Stata command in the comments, I cite it -between dashes-, but
 . * the dashes are not part of the command. They are just here to delimit where
 . * the command starts and where it stops.
 . 
 . 
 . * Practice
 . * --------
 . 
 . * Your mission for next week is to replicate this do-file. That means running
 . * it in full, reading the comments along as you execute its commands. Use the
 . * course slides to learn about running do-files and read from the Stata Guide
 . * to understand the commands used.
 . 
 . * There is no substitute to practice to learn statistical software. Code is
 . * like music, you will recognize the tune and notation if you listen to it.
 . * When you learn to code, you learn to play, either for yourself or for the
 . * audience of your programming language. For Stata, the audience is a pretty
 . * wide range of people and institutions.
 . 
 . 
 . * Interface
 . * ---------
 . 
 . * Quickly review the Stata windows. The Command window is where you will enter
 . * all commands, the results of which will show in the Results window. Your
 . * past commands will also show in the Review window. Finally, the Variables
 . * window should be empty at that stage, because no dataset is currently loaded
 . * in Stata. More windows will be opened as we go on.
 . 
 . * Note that we will use windows but not, as you are used to, menus. The menu
 . * interface in Stata offers point-and-click accessibility but is not suited
 . * for programming purposes. Instead, everything we do will be command-based.
 . 
 . 
 . * ====================
 . * = WARM-UP EXERCISE =
 . * ====================
 . 
 . 
 . * Type or copy and paste the following line to the Command window:
 . pwd
 /Users/fr/Documents/Teaching/SRQM

 . 
 . * The previous command returns the path to your working directory. It prints
 . * its output to the Results window, and the command is stored in history as
 . * shown in the Review window.
 . 
 . * Now, load a sample Stata dataset that is included with the software:
 . sysuse lifeexp, clear
 (Life expectancy, 1998)

 . 
 . * The previous command loads data in the background. You can access the data
 . * with the following command. Close the window after taking a look.
 . browse

 . 
 . * Back to the main window, the Variables window shows the list of variables.
 . * We are going to use two of them to build a plot. Type the following:
 . scatter lexp safewater

 . 
 . * This command creates your first Stata graph. Close the Graph window when
 . * you are done inspecting the graph. Finally, type the following command after
 . * uncommenting it (remove the asterisk and trailing space):
 . 
 . * doedit example
 . 
 . * The previous command creates an empty do-file called 'example.do'. The file
 . * is located in your working directory, which should be the SRQM folder.
 . * Stata has also opened the file in the Do-file Editor so that you can edit it
 . * from its programming interface. Copy and paste the four following lines into
 . * that empty do-file window:
 . 
 . // Example do-file.
 . sysuse lifeexp, clear
 (Life expectancy, 1998)

 . sc lexp safewater

 . clear

 . 
 . * Notice that the syntax used for the -scatter- command is different because
 . * it has been abbreviated to -sc-. The first line is a comment that uses an
 . * alternative way to tell Stata that the line is a comment. Save and close
 . * the do-file window when you have copied the full code to it.
 . 
 . * The do-file can now be run with the following command (uncomment to run):
 . * do example
 . 
 . * The do-file can now be erased with the following command (uncomment to run):
 . * rm example.do
 . 
 . * These commands quickly show you how we are going to use the software: by
 . * running (executing) code from Stata do-files, so that you can write your
 . * own do-file for your research projects.
 . 
 . 
 . * ============
 . * = COMMANDS =
 . * ============
 . 
 . 
 . * Tip (1): Get to learn some syntax
 . * ---------------------------------
 . 
 . * Most Stata commands share an identical syntax that calls one or several
 . * variables as the main argument:
 . 
 . *   command variable
 . 
 . * Most Stata commands will also allow one or more options after a comma.
 . * Optional arguments are shown in brackets in the Stata help pages:
 . 
 . *   command variable [, options]
 . 
 . 
 . * Tip (2): Run all lines in sequential order
 . * ------------------------------------------
 . 
 . * You need to execute all lines of a do-file in order to avoid execution errors.
 . * The example below illustrates the point:
 . 
 . clear

 . set obs 100
 obs was 0, now 100

 . gen test = 1

 . ren test x // This line will not run if you do not run the previous ones first.

 .            // The command intends to rename the 'test' variable, but 'test' does
 .            // not exist unless you create it first by running the previous line.
 . 
 . 
 . * Tip (3): Keyboard shortcuts for Mac / Win
 . * -----------------------------------------
 . 
 . /* Mac:
 > 
 >    - Cmd-L (Ctrl-L) selects a whole line
 >    - Shift + Up/Down arrows selects or deselects neighbouring lines
 >    - Cmd-Shift-D (Ctrl-D) executes the selection
 >    - Cmd-` (Alt-Tab) switches between application windows
 > 
 >    Cmd is the 'Command' key. The ` ('back accent') key might be hard to
 >    find on non-QWERTY keyboards, so check if you see it on your system.
 > 
 >    Win:
 > 
 >    - Ctrl-L selects a whole line
 >    - Shift + Up/Down arrows selects or deselects neighbouring lines
 >    - Ctrl-D executes the selection
 >    - Alt-Tab switches between application windows */
 .    
 . * Do not confuse Mac and Win keyboard shortcuts, or you might execute the whole
 . * do-file by mistake! If that happens, or if you get lost while replicating a 
 . * do-file, the safest option is to run it again from the top. To do that, make
 . * your life easier with keyboard shortcuts: select the line where you want to
 . * start again by pressing Cmd-L (Win: Ctrl-L), then press Cmd-Shift-UpArrow
 . * (Win: Ctrl-Shift-UpArrow), and finally press Cmd-Shift-D (Win: Ctrl-D) to run
 . * the code again down to your initial line.
 . 
 . * Yes, all this takes a bit of practice. Think of it as music: learning to read
 . * and write code is like learning to read and write music sheets, and learning 
 . * to type and run code is like learning a bit of piano.
 . 
 . 
 . * Tip (4): Command navigation
 . * ---------------------------
 . 
 . * You can navigate through past commands from the Command window by using the
 . * PageUp and PageDown keys. Try running the following command after taking out
 . * the asterisk at the beginning of the line:
 . 
 . * memory6
 . 
 . * You should get an error: the right command is -memory- without the final '6'.
 . * To quickly correct your mistake, press PageUp and Stata will print the command
 . * again to your Command window, allowing you to quickly correct the syntax of
 . * your command and try it again without the final '6'.
 . 
 . 
 . * Tip (5): Run multiple lines together
 . * ------------------------------------
 . 
 . * When you see '///' at the end of a line, you have to select the next line too
 . * and execute the lines together from the do-file: copy-pasting to the Command
 . * window will not work. Use Ctrl-L (Win) or Cmd-L (Mac) and Shift+DownArrow to
 . * select the lines, then run them with Ctrl-D (Win) or Cmd-Shift-D (Mac).
 . 
 . di "This is a test. Select this line, " ///
 >     "and this line too, " _n ///
 >     "and this line too. Now, execute from the keyboard. Well done :)"
 This is a test. Select this line, and this line too, 
 and this line too. Now, execute from the keyboard. Well done :)

 . 
 . * You will have to do the same for code loops, such as 'foreach {}' loops.
 . * You will usually be warned before in the comments. Finally, note that these
 . * multiple-line commands do *not* work if you copy-paste from the do-file to
 . * the Command window. This is why I recommend that you learn keyboard shortcuts
 . * quickly, so as to minimize issues with code execution and focus on the rest.
 . 
 . 
 . * =========
 . * = SETUP =
 . * =========
 . 
 . 
 . * The following steps teach you about setting up Stata on any computer. Start
 . * by making sure that you have nothing stored in Stata memory by wiping off
 . * any data in memory with the -clear- command:
 . clear

 . 
 . * The settings covered in this section of the do-file can be taken care of by
 . * a setup utility written for the course. Please turn to the README file of the
 . * SRQM folder for instructions, or follow the procedure in our first classes.
 . 
 . 
 . * (1) Memory
 . * ----------
 . 
 . * Skip this section if you are running Stata 12+.
 . 
 . * Your first step with Stata consists in allocating enough memory to it. The
 . * default amount of memory that Stata loads at startup is too small to open
 . * large datasets: if you forget to set memory, Stata will reply with an error
 . * message. The basic command to allocate 500MB memory follows:
 . set mem 500m
 set memory ignored.
    Memory no longer needs to be set in modern Statas; memory adjustments are
    performed on the fly automatically.

 . 
 . * You need to repeat that command every time you run Stata. The command works
 . * only if Stata has no data in storage: if you already have a dataset opened,
 . * then Stata will reply with an error message. Fortunately, if you are running
 . * Stata from your own computer, you can set memory permanently:
 . set mem 500m, perm
 set memory ignored.
    Memory no longer needs to be set in modern Statas; memory adjustments are
    performed on the fly automatically.

 . 
 . * There is more to learn about memory size and default settings in Stata, but
 . * for the purpose of this course, this will largely suffice. Furthermore, if
 . * you are running Stata 12, you are spared from setting memory yourself: Stata
 . * will do it automatically.
 . 
 . 
 . * (2) Screen breaks
 . * -----------------
 . 
 . * By default, Stata uses screen breaks. If you forget to disable those, the
 . * 'Results' window will nag you with useless 'more' prompts and you will have
 . * to scroll results manually. Save yourself the hassle by disabling them:
 . set more off

 . 
 . * In fact, let's try to disable them permanently on your computer:
 . set more off, perm
 (set more preference recorded)

 . 
 . 
 . * (3) Additional commands
 . * -----------------------
 . 
 . * Stata can be extended by installing packages, just like you would install a
 . * plugin or an extension for another software. The packages add new commands or
 . * graph schemes to Stata.
 . 
 . * Make sure that you are connected to the Internet before continuing, so that
 . * Stata can connect to the SSC archive and to other online sources. If you are
 . * using a Sciences Po workstation, you will also need to uncomment and run the
 . * following command to avoid an issue with admin privileges:
 . 
 . * sysdir set PLUS "c:\temp"
 . 
 . * This course makes heavy use of the -fre- command to view frequencies. The
 . * course setup should have installed it for you, but let's practice installing
 . * additional Stata commands. Install the -fre- command (again) by uncommenting
 . * and running this command while online:
 . 
 . * ssc install fre
 . 
 . * Now read the package description:
 . ado de fre

 ------------------------------------------------------------------------------------
 [1] package fre from http://fmwww.bc.edu/repec/bocode/f
 ------------------------------------------------------------------------------------

 TITLE
      'FRE': module to display one-way frequency table

 DESCRIPTION/AUTHOR(S)
      
        fre displays for each specified variable a univariate frequency
      table containing counts, percent, and cumulative percent.
      Variables may be string or numeric. Labels, in full length, and
      values are printed. By default, fre only tabulates the smallest
      and largest 10 values (along with all missing values), but this
      can be changed. Furthermore, values with zero observed frequency
      may be included in the  tables. The default for fre is to display
      the frequency  tables in the results window. Alternatively, the
      tables may be written to a file on disk, either tab-delimited or
      LaTeX-formatted.
      
      KW: data management
      KW: frequencies
      KW: frequency table
      KW: tabulation
      
      Requires: Stata version 9.2
      
      Distribution-Date: 20120618
      
      Author: Ben Jann, University of Bern
      Support: email [email protected]
      

 INSTALLATION FILES
      f/fre.ado
      f/fre.hlp

 INSTALLED ON
      17 Aug 2013
 ------------------------------------------------------------------------------------

 . 
 . 
 . * (4) Working directory
 . * ---------------------
 . 
 . * The working directory is where Stata will look to open and save stuff like
 . * datasets or logs. Use the -pwd- command to see where Stata is looking now.
 . pwd
 /Users/fr/Documents/Teaching/SRQM

 . 
 . * Use -ls- command to list the files where Stata is looking. The -w- option will
 . * cause the command to print only the filenames without system information.
 . ls, w

 README.md       backup.log      course/         demo.log        setup/
 admin/          code/           data/           profile.do

 . 
 . * For this course, you need to set the working directory to the SRQM folder.
 . * Use the 'File :: Change Working Directory...' menu item in the Stata graphical
 . * user interface to select the SRQM folder. The path to that folder will show in
 . * the Results window. It might look like this:
 . 
 . * cd ~/Documents/Teaching/SRQM/
 . 
 . * I use Mac OS X, which is why my file path takes that form. Ivaylo uses a PC,
 . * and his own working directory might be set like this:
 . 
 . * cd C:\Users\Ivo\Desktop\SRQM
 . 
 . * You will need to identify that file path on your own computer. Choose a simple
 . * location for the SRQM folder and then keep it there without renaming it or any
 . * of the folders that lead to it. Be careful with that, or you will get errors
 . * when trying to study for the course.
 . 
 . * The -cd- command shown above navigates through your folders. The next example
 . * assumes that you are now in the SRQM folder. It will select the folder that
 . * contains the course do-files. Note that if the path contained spaces, you
 . * would need to add quotes around it.
 . 
 . * cd code
 . 
 . * Uncomment and run the line above, then uncomment and run the next command to
 . * go back one level and return to the SRQM folder:
 . 
 . * cd ..
 . 
 . * Finally, you can list the files without moving to a directory. The following
 . * command shows the contents of the data/ folder:
 . ls data/, w

 ess2008.dta             gss0012_variables.txt   qog2013_variables.txt
 ess2008.zip             nhis2009.dta            world-c.dta
 ess2008_codebook.pdf    nhis2009.zip            world-d.dta
 ess2008_variables.txt   nhis2009_variables.txt  wvs2000.dta
 gss0012.dta             qog2013.dta             wvs2000.zip
 gss0012.zip             qog2013.zip             wvs2000_codebook.pdf
 gss0012_codebook.pdf    qog2013_codebook.pdf    wvs2000_variables.txt

 . 
 . 
 . * (5) Log
 . * -------
 . 
 . * You can save the commands and results from this do-file to a log file, which
 . * will serve as a backup of your work. To log this session, type:
 . log using code/week1.log, replace
 (note: file /Users/fr/Documents/Teaching/SRQM/code/week1.log not found)
 ------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /Users/fr/Documents/Teaching/SRQM/code/week1.log
  log type:  text
 opened on:  17 Aug 2013, 18:28:33

 . 
 . * The log command will now create a history of your work on this do-file. You
 . * should keep it for replication purposes. It will log all your commands and
 . * their results, including commands that returned an error. Refer to the Stata
 . * Guide for further guidance on log files, and do not forget to produce logs in
 . * the .log plain text format rather than in the less handy SMCL default format.
 . * Also make sure that you specify the -replace- option to overwite any previous
 . * log file that might have been created by running this do-file in the past.
 . * The -name- option can be omitted.
 . 
 . * Now run these example commands (do not worry about the comments, you can leave
 . * them where they are and 'execute' them too, Stata will just ignore them):
 . 
 . * Loading data from the U.S. National Health Interview Survey (2009).
 . use data/nhis2009, clear
 (U.S. National Health Interview Survey 2009)

 . 
 . * The -clear- option gets rid of any data previously loaded into memory, since
 . * Stata can only open one dataset at once.
 . 
 . * Describe a few variables.
 . d year sex weight raceb

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 year            int    %8.0g       year_lbl   Survey year
 sex             byte   %8.0g       sex_lbl    Sex
 weight          int    %36.0g      weight_lbl
                                              Weight in pounds without clothes or
                                                shoes
 raceb           float  %9.0g       raceb      Race

 . 
 . * Keep observations only for year 2009.    
 . keep if year == 2009
 (227298 observations deleted)

 . 
 . * Calculate the frequencies for each racial-ethnic group.
 . fre raceb

 raceb -- Race
 ----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
 -------------------+--------------------------------------------
 Valid   1 White    |      14269      58.74      58.74      58.74
        2 Black    |       3893      16.03      16.03      74.77
        3 Hispanic |       4758      19.59      19.59      94.36
        4 Asian    |       1371       5.64       5.64     100.00
        Total      |      24291     100.00     100.00           
 ----------------------------------------------------------------

 . 
 . * Obtain summary statistics for the weight variable.
 . su weight

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
      weight |     24291    172.5895    37.12779        100        285

 . 
 . * List gender groups from the sex variable.
 . tab sex

        Sex |      Freq.     Percent        Cum.
 ------------+-----------------------------------
       Male |     10,978       45.19       45.19
     Female |     13,313       54.81      100.00
 ------------+-----------------------------------
      Total |     24,291      100.00

 . 
 . * Crosstabulate sex and race.
 . tab sex raceb

           |                    Race
       Sex |     White      Black   Hispanic      Asian |     Total
 -----------+--------------------------------------------+----------
      Male |     6,676      1,532      2,150        620 |    10,978 
    Female |     7,593      2,361      2,608        751 |    13,313 
 -----------+--------------------------------------------+----------
     Total |    14,269      3,893      4,758      1,371 |    24,291 


 . 
 . * Plot average weight by sex and race. You must run both lines below together.
 . gr dot weight, over(raceb) over(sex) ///
 >         name(weight_race_sex, replace)

 . 
 . * To close the log file previously opened, type the following command:
 . cap log close

 . 
 . * You will not be able to run the above command if no log is opened. The -cap-
 . * prefix allows you to run the command and continue even if it returns an error.
 . 
 . * If you now go to your code/ folder and open the week1.log file with
 . * any plain text editor, you will find a copy of everything that was entered
 . * between the -log using- and -log close- commands, including comments, the
 . * example above and its output for each command. You can view the file in Stata:
 . view code/week1.log

 . 
 . * The dot graph will need to be saved separately: this can be done in several
 . * ways that are documented in the course slides and in the Stata Guide. The
 . * Stata help pages also cover each graph command. Have a look at them:
 . help graph

 . 
 . * Identically, there is more about logs in the Stata Guide and in several of
 . * the tutorials included in the course material, but we also recommend that you
 . * use the Stata help pages, as explained below.
 . 
 . 
 . * ============
 . * = DATASETS =
 . * ============
 . 
 . 
 . * (1) List datasets
 . * -----------------
 . 
 . * Show all datasets for this course. The asterisk in the command is an escape
 . * character that causes the command to return all matches (within .dta files).
 . * The -w- option is to make the output less verbose.
 . ls "data/*.dta", w

 data/ess2008.dta        data/qog2013.dta        data/wvs2000.dta
 data/gss0012.dta        data/world-c.dta
 data/nhis2009.dta       data/world-d.dta

 . 
 . * Note: the quotes in the command above are optional. Quotes are only required
 . * when the path contains spaces. For example, if the data/ folder were called
 . * 'Course datasets', quotes would be necessary to run -ls "Course datasets"-.
 . * This means that, if the path to your working directory contains quotes, you
 . * must enclose it in quotes if you use -cd- to set your working directory.
 . 
 . * Typical example.
 . * cd "/Users/somestudent/Documents/Sciences Po/4A/Semester 1/Stats stuff/SRQM"
 . 
 . * Now back to the datasets.
 . 
 . * All datasets are in the data/ folder of the SRQM Teaching Pack. The commands
 . * used to load them in the course do-files will work only if you have correctly
 . * set your working directory to the SRQM folder first. The course setup does it
 . * for you, unless you move the SRQM folder, in which case it will stop working.
 . 
 . * The README file of the data/ folder holds links to essential documents for you
 . * to read if you want to use the data for your research project. You can start 
 . * looking for variables of interest by using the -lookfor- command after loading
 . * one of the course datasets.
 . 
 . 
 . * (2) European Social Survey Round 5, 2008
 . * ----------------------------------------
 . 
 . * Load.
 . use data/ess2008, clear
 (European Social Survey 2008)

 . 
 . * Example search.
 . lookfor health immig

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 stfhlth         byte   %2.0f       stfhlth    State of health services in country
                                                nowadays
 imsmetn         byte   %1.0f       imsmetn    Allow many/few immigrants of same
                                                race/ethnic group as majority
 imdfetn         byte   %1.0f       imdfetn    Allow many/few immigrants of different
                                                race/ethnic group from majority
 impcntr         byte   %1.0f       impcntr    Allow many/few immigrants from poorer
                                                countries outside Europe
 imbgeco         byte   %2.0f       imbgeco    Immigration bad or good for country's
                                                economy
 imueclt         byte   %2.0f       imueclt    Country's cultural life undermined or
                                                enriched by immigrants
 imwbcnt         byte   %2.0f       imwbcnt    Immigrants make country worse or
                                                better place to live
 health          byte   %1.0f       health     Subjective general health
 gvhlthc         byte   %2.0f       gvhlthc    Health care for the sick, governments'
                                                responsibility
 hlthcef         byte   %2.0f       hlthcef    Provision of health care, how
                                                efficient
 imsclbn         byte   %1.0f       imsclbn    When should immigrants obtain rights
                                                to social benefits/services
 imrccon         byte   %2.0f       imrccon    Immigrants receive more or less than
                                                they contribute
 lvpbhlt         byte   %1.0f       lvpbhlt    Level of public health care affordable
                                                10 years from now
 lknhlcn         byte   %1.0f       lknhlcn    How likely not receive health care
                                                needed if become ill next 12 months
 p70hltb         byte   %2.0f       p70hltb    People over 70 a burden on health
                                                service these days

 . 
 . 
 . * (3) Quality of Government, 2013
 . * -------------------------------
 . 
 . * Load.
 . use data/qog2013, clear
 (Quality of Government 2013)

 . 
 . * Example search.
 . lookfor devel orig

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 gr_cso          int    %8.0g                  Development Civil Society
                                                Organizations
 ht_colonial     byte   %55.0g      ht_colonial
                                              Colonial Origin
 iag_hd          double %10.0g                 Human Development
 lp_legor        float  %31.0g      lp_legorlabel
                                              Legal origin
 undp_hdi        double %10.0g                 Human Development Index
 wdi_aid         double %10.0g                 Net Development Assistance and Aid
                                                (Constant USD)
 wdi_aidcu       double %10.0g                 Net Development Assistance and Aid
                                                (Current USD)

 . 
 . 
 . * (4) World Values Survey, 2000
 . * -----------------------------
 . 
 . * Load.
 . use data/wvs2000, clear
 (World Values Survey 2000)

 . 
 . * Example search.
 . lookfor army homo

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 v76             byte   %8.0g       v76        neighbors: homosexuals
 v1660           byte   %8.0g       v1660      having the army rule
 v208            byte   %8.0g       v208       justifiable: homosexuality

 . 
 . 
 . * (5) General Social Survey, 2012
 . * -------------------------------
 . 
 . * Load.
 . use data/gss0012, clear
 (U.S. General Social Survey 2000-2012)

 . 
 . * Example search.
 . lookfor army homo

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 spkhomo         byte   %8.0g       LABAE      allow homosexual to speak
 colhomo         byte   %8.0g       LABAF      allow homosexual to teach
 libhomo         byte   %8.0g       LABAG      allow homosexuals book in library
 conarmy         byte   %8.0g       LABBJ      confidence in military
 homosex         byte   %8.0g       LABCC      homosexual sex relations
 marhomo         byte   %8.0g       LABJK      homosexuals should have right to marry
 homosex1        byte   %8.0g       LABPU      is homosexual sex wrong?

 . 
 . * Note that this dataset holds more than one year of data.
 . tab year

   gss year |
   for this |
 respondent |      Freq.     Percent        Cum.
 ------------+-----------------------------------
       2000 |      2,817       14.87       14.87
       2002 |      2,765       14.59       29.46
       2004 |      2,812       14.84       44.31
       2006 |      4,510       23.81       68.11
       2008 |      2,023       10.68       78.79
       2010 |      2,044       10.79       89.58
       2012 |      1,974       10.42      100.00
 ------------+-----------------------------------
      Total |     18,945      100.00

 . 
 . * This means that you will have to reduce it to one year of observations before
 . * analyzing it. More on that next week. For now, back to looking for variables.
 . 
 . 
 . * (6) Search across datasets
 . * --------------------------
 . 
 . * Tip: an additional package can help you search for variables across datasets.
 . * It should have been installed by the course setup utility. If not, install it
 . * yourself with -ssc install lookfor_all- (requires an Internet connection).
 . lookfor_all health, dir(data)
 Variables in:
 use "/Users/fr/Documents/Teaching/SRQM/data/ess2008.dta" 
 variables: stfhlth health gvhlthc hlthcef lvpbhlt lknhlcn p70hltb

 Variables in:
 use "/Users/fr/Documents/Teaching/SRQM/data/gss0012.dta" 
 variables: natheal nathealy health abhlth mentloth health30 health12 hlthinfo hlthp
 > apr hlthmag1 hlthmag2 hlthdoc hlthfrel hlthtv hlthwww didlessp limitedp treat11 do
 > ccosts safehlth health1 physhlth mntlhlth outsider medsavtx medsymps medaddct medu
 > nacc mhhlpmhp mhgvthlt mhtrtot2 mhclsoth mhseroth mhhlpoth mhreloth mhtrtslf mhsee
 > pub emphlth sphlth richhlth hrdshp6 askmentl

 Variables in:
 use "/Users/fr/Documents/Teaching/SRQM/data/nhis2009.dta" 
 variables: health uninsured

 Variables in:
 use "/Users/fr/Documents/Teaching/SRQM/data/qog2013.dta" 
 variables: wdi_hec wdi_prhe wdi_puhe wdi_the wvs_a009


 File "/Users/fr/Documents/Teaching/SRQM/data/world-c.dta" cannot be open in current 
 > version of Stata

 File "/Users/fr/Documents/Teaching/SRQM/data/world-d.dta" cannot be open in current 
 > version of Stata
 Variables in:
 use "/Users/fr/Documents/Teaching/SRQM/data/wvs2000.dta" 
 variables: v12 v52 v67


 Total 7 out of 7 files checked in  "/Users/fr/Documents/Teaching/SRQM/data/"

 . 
 . * The command above, like all commands that calls datasets or do-files,
 . * requires that the SRQM folder has been set as the working directory.
 . 
 . * Because some commands like -lookfor_all- require to be installed before you
 . * run the course do-files, the course setup utility has installed them in our
 . * first session together. However, by security, I also include a small loop in
 . * all course do-files that automatically detect uninstalled commands and fetch
 . * them from online if needed. These loops look like the one below and require
 . * that you select all four lines together and then execute them.
 . foreach p in lookfor_all {
  2.         cap which `p'
  3.         if _rc == 111 cap noi ssc install `p'
  4. }

 . 
 . * The syntax of these loops is typically more complex than anything that you
 . * will have to read or write for this course, so do not panic if they do not
 . * make sense to you. Focus on getting the rest of the code straight.
 . 
 . 
 . * ========
 . * = HELP =
 . * ========
 . 
 . 
 . * It is essential to the methods covered by this course that you learn to use
 . * help extensively. The course material includes a lot of help with Stata, but
 . * you should also learn to use internal Stata help pages, accessible with the
 . * -help- command. If you want to understand the following command:
 . *
 . * su weight if raceb == 1, d
 . 
 . * To understand what -su- means and does, type -help- followed by -su-:
 . help su

 . 
 . * The underline tells you that -su- is shorthand for -summarize-, which returns
 . * a few summary statistics for one or more variables. The -help- command itself
 . * can be abbreviated to simply -h-. The -if- component of the command is also
 . * documented in Stata:
 . h if

 . 
 . * Finally, the -d- option shown in the example is documented on the help page
 . * for -summarize-. It produces more statistics: -d- is shorthand for -detail-.
 . * Do not confuse it with the -d- shorthand  for the -describe- command, which
 . * lists the variables in the current dataset.
 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * The course will teach you to write commands like the ones featured in this
 . * do-file. If you combine practice, documentation and a bit of intuition, you
 . * can learn most of the Stata syntax in a few weeks through trial-and-error.
 . * Get ready by practicing as soon as possible! Programming works that way.
 . * Oh, and congratulations for reaching this line.
 . 
 . * Last words: when you leave Stata, DO NOT SAVE YOUR DATASET. Keep it intact as
 . * originally downloaded. Instead, save the do-file that contains the commands
 . * you used to perform your analysis. Stata will automatically save the log file
 . * for you when you shut it down, so this requires no action on your side. For
 . * additional help, please turn again to the Stata Guide.
 . 
 . * Close log (if still opened, which it should not).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require fre

 . 
 . * Log results.
 . cap log using code/week2.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 2 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - WHAT:  Support for Sharia Law in Nine Countries
 > 
 >  - DATA:   U.S. National Health Interview Survey (2009)
 > 
 >  - Hi! Welcome to your second SRQM do-file.
 > 
 >  - All the do-files for this course assume that you have set up Stata first by 
 >    adjusting some parameters, most importantly setting the working directory to
 >    your SRQM folder. Please refer to the do-file from Session 1 for guidance.
 > 
 >  - Welcome again to Stata. Read the comment lines as you go along, and run the
 >    code by executing command lines sequentially. Select lines with Cmd-L (Mac)
 >    or Ctrl-L (Win), and execute them with Cmd-Shift-D (Mac) or Ctrl-D (Win).
 > 
 >  - We will explore the National Health Interview Survey with a few basic Stata
 >    commands. This is to show you how to explore a dataset and its variables. You
 >    need to make a choice of dataset for your project by the end of the week.
 >  
 >  - If you want to study one country or compare two of them, turn to survey data
 >    from the European Social Survey (ESS), U.S. General Social Survey (GSS) or
 >    World Values Survey (WVS).
 > 
 >  - If you want to study country-level data, use the Quality of Government (QOG)
 >    dataset. Your sample should be all world countries: do not further restrict
 >    the sample further by subsetting to less observations.
 > 
 >    Last updated 2013-02-17.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load NHIS dataset.
 . use data/nhis2009, clear
 (U.S. National Health Interview Survey 2009)

 . 
 . * Once the dataset is loaded, the Variables window will fill up, and you will
 . * be able to look at the actual dataset from the Data Editor. Read from the
 . * course material to make sure that you know how to read through a dataset:
 . * its data structure shows observations in rows and variables in columns.
 . 
 . * List all variables in the dataset.
 . describe

 Contains data from data/nhis2009.dta
  obs:       251,589                          U.S. National Health Interview
                                                Survey 2009
 vars:            32                          16 Aug 2013 05:22
 size:    20,630,298                          (_dta has notes)
 ------------------------------------------------------------------------------------
              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 year            int    %8.0g       year_lbl   Survey year
 serial          double %8.0f                  Sequential Serial Number, Household
                                                Record
 strata          int    %8.0g       strata_lbl
                                              Stratum for variance estimation
 psu             int    %8.0g       psu_lbl    Primary sampling unit (PSU) for
                                                variance estimation
 hhweight        long   %12.0g                 Household weight, final annual
 pernum          byte   %8.0g                  Person number in household
 perweight       double %9.0f                  Final basic annual weight
 sampweight      double %9.0f                  Sample Person Weight
 nhispid         str16  %16s                   NHIS Unique Identifier, person
 age             byte   %31.0g      age_lbl    Age
 marstat         byte   %37.0g      marstat_lbl
                                              Legal marital status
 sex             byte   %8.0g       sex_lbl    Sex
 hispeth         byte   %45.0g      hispeth_lbl
                                              Hispanic ethnicity
 racea           int    %43.0g      racea_lbl
                                              Main Racial Background (Pre-1997
                                                Revised OMB Standards),
                                                self-reported or interv
 regionbr        byte   %42.0g      regionbr_lbl
                                              Global region of birth
 yrsinus         byte   %30.0g      yrsinus_lbl
                                              Number of years spent in the U.S.
 educrec1        byte   %36.0g      educrec1_lbl
                                              Educational attainment recode,
                                                nonintervalled
 earnings        byte   %23.0g      earnings_lbl
                                              Person's total earnings, previous
                                                calendar year
 incimp1         byte   %17.0g      incimp1_lbl
                                              Imputed total combined family income
                                                (1997+ grouping)
 health          byte   %23.0g      health_lbl
                                              Health status
 height          byte   %30.0g      height_lbl
                                              Height in inches without shoes
 weight          int    %36.0g      weight_lbl
                                              Weight in pounds without clothes or
                                                shoes
 visityrno       byte   %19.0g      visityrno_lbl
                                              Total office visits in past 12 months
 ybarcare        byte   %23.0g      ybarcare_lbl
                                              Needed but couldn't afford medical
                                                care, past 12 months
 uninsured       byte   %23.0g      uninsured_lbl
                                              Health Insurance coverage status
 diayrsago       byte   %34.0g      diayrsago_lbl
                                              Years since first diagnosed with
                                                diabetes
 strongfwk       byte   %35.0g      strongfwk_lbl
                                              Frequency of strengthening activity:
                                                Times per week
 vig10fwk        byte   %30.0g      vig10fwk_lbl
                                              Frequency of vigorous activity 10+
                                                minutes: Times per week
 rsweight        float  %9.0g                  Adjusted to original size Sample
                                                Person Weight
 raceb           float  %9.0g       raceb      Race
 vigor           byte   %9.0g                  Frequenciy of vigorous activity 10+
                                                minutes: times per week
 strength        byte   %9.0g                  Frequenciy of strengthening activity:
                                                times per week
 ------------------------------------------------------------------------------------
 Sorted by:  

 . 
 . 
 . * Finding variables
 . * -----------------
 . 
 . * Locate some variables of interest by looking for keywords in the variables.
 . * You can explore your dataset by looking for particular keywords in the
 . * variable names and labels. This is particularly useful when your dataset
 . * comes with variable names that are hard or impossible to understand by
 . * themselves, such as 'v1' or 'epi_epi'. The example below will identify
 . * several variables with either 'height' or 'weight' in their descriptors.
 . lookfor height weight

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 hhweight        long   %12.0g                 Household weight, final annual
 perweight       double %9.0f                  Final basic annual weight
 sampweight      double %9.0f                  Sample Person Weight
 height          byte   %30.0g      height_lbl
                                              Height in inches without shoes
 weight          int    %36.0g      weight_lbl
                                              Weight in pounds without clothes or
                                                shoes
 rsweight        float  %9.0g                  Adjusted to original size Sample
                                                Person Weight

 . 
 . * List their values for the first ten observations.
 . list height weight in 1/10

     +-----------------+
     | height   weight |
     |-----------------|
  1. |     67      185 |
  2. |     68      125 |
  3. |     67      132 |
  4. |     69      150 |
  5. |     62      143 |
     |-----------------|
  6. |     70      160 |
  7. |     71      183 |
  8. |     75      200 |
  9. |     67      125 |
 10. |     69      140 |
     +-----------------+

 . 
 . 
 . * Subsetting to cross-sectional format
 . * ------------------------------------
 . 
 . * Our first step verifies whether the survey is cross-sectional. As we find
 . * that the data contains more than one survey wave and spans over several years,
 . * we keep only most recent observations. This step applies only to datasets that
 . * contain multiple survey years, which is generally not the case in this course.
 . 
 . * Check whether the survey is cross-sectional.
 . lookfor year

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 year            int    %8.0g       year_lbl   Survey year
 yrsinus         byte   %30.0g      yrsinus_lbl
                                              Number of years spent in the U.S.
 earnings        byte   %23.0g      earnings_lbl
                                              Person's total earnings, previous
                                                calendar year
 diayrsago       byte   %34.0g      diayrsago_lbl
                                              Years since first diagnosed with
                                                diabetes

 . tab year

 Survey year |      Freq.     Percent        Cum.
 ------------+-----------------------------------
       2000 |     28,712       11.41       11.41
       2001 |     29,459       11.71       23.12
       2002 |     27,087       10.77       33.89
       2003 |     26,998       10.73       44.62
       2004 |     27,462       10.92       55.53
       2005 |     27,484       10.92       66.46
       2006 |     21,010        8.35       74.81
       2007 |     20,173        8.02       82.83
       2008 |     18,913        7.52       90.34
       2009 |     24,291        9.66      100.00
 ------------+-----------------------------------
      Total |    251,589      100.00

 . 
 . * The data should be cross-sectional for the purpose of this course. However,
 . * the dataset contains observations for more than one year. We will solve that
 . * issue by keeping observations for the 2009 survey year only.
 . 
 . * Delete all observations except for 2009.
 . drop if year != 2009
 (227298 observations deleted)

 . 
 . * The -drop- command deleted all observations for which the variable 'year' is
 . * different (!=) from 2009. An equivalent command would be:
 . *
 . * keep if year == 2009
 . *
 . * This command keeps only observations for which the 'year' variable is equal
 . * (==) to 2009. Notice that the 'equal to' operator in Stata is a double equal
 . * sign (==). Logical operators apply to many commands: read on to find out.
 . * Also note that the spaces around logical operators are optional.
 . 
 . * Make sure that you fully understand how cross-sectional data are arranged by
 . * opening the Data Editor or using the -browse- command to take a quick look.
 . 
 . 
 . * Survey weights
 . * --------------
 . 
 . * The command below sets survey weights, which can be used to obtain weighted
 . * estimates at later stages of the analysis. We will not require them much.
 . 
 . * Survey weights (see NHIS documentation).
 . svyset psu [pw = perweight], strata(strata)

      pweight: perweight
          VCE: linearized
  Single unit: missing
     Strata 1: strata
         SU 1: psu
        FPC 1: <zero>

 . 
 . 
 . * =========================
 . * = VARIABLE MANIPULATION =
 . * =========================
 . 
 . 
 . * Dependent variable: Body Mass Index
 . * -----------------------------------
 . 
 . * Our next step is to compute the Body Mass Index for each observation in the
 . * dataset (i.e. for each respondent to the survey) from their height and weight
 . * by using the 'height' and 'weight' variables, and the formula for BMI.
 . 
 . * Create the Body Mass Index from height and weight. We can write the -generate-
 . * command as its -gen- shorthand. We will later call BMI our dependent variable,
 . * and we will use other (independent) variables to try to predict its values.
 . gen bmi = weight * 703 / height^2

 . 
 . * If something looks wrong later on in your analysis, check your BMI equation.
 . * Also note that Stata is case-sensitive: we will write 'BMI' in the comments,
 . * but the variable itself is called 'bmi' and should be written in lowercase.
 . 
 . 
 . * Labelling a variable
 . * --------------------
 . 
 . * Add a description label to the variable. All label commands start with -label-
 . * (shorthand -la-). The one below labels a variable (shorthand -var-).
 . la var bmi "Body Mass Index"

 . 
 . * List BMI among the variables included in the current dataset.
 . d bmi

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 bmi             float  %9.0g                  Body Mass Index

 . 
 . * The -describe- command (shorthand -d-) shows that the BMI variable is now
 . * part of the NHIS dataset. However, DO NOT SAVE your dataset, even when you
 . * perform a useful operation like this one. Instead, you will run the do-file
 . * to generate the variable again, hence making your calculation of BMI fully
 . * understandable and replicable by an exterior observer, like us, or anyone.
 . 
 . * Take a look at the BMI of a few respondents. Values between 15 and 40 are
 . * expected for human beings as we know them on this planet. We also list a
 . * few other variables to start thinking about possible relationships.
 . li sex age health bmi in 50/60

     +-------------------------------------+
     |    sex   age      health        bmi |
     |-------------------------------------|
 50. |   Male    28        Poor   29.11834 |
 51. |   Male    29   Excellent   26.62286 |
 52. |   Male    21   Very Good   35.86735 |
 53. |   Male    40        Good   29.64641 |
 54. | Female    63        Fair   37.58521 |
     |-------------------------------------|
 55. | Female    38   Very Good   23.40106 |
 56. |   Male    54        Good   33.71531 |
 57. |   Male    47   Very Good   23.49076 |
 58. | Female    38   Excellent   20.89472 |
 59. | Female    81        Good   24.68913 |
     |-------------------------------------|
 60. |   Male    32   Very Good   33.08425 |
     +-------------------------------------+

 . li sex age health bmi in -10/l

       +-------------------------------------+
       |    sex   age      health        bmi |
       |-------------------------------------|
 24282. | Female    26   Excellent   24.12663 |
 24283. |   Male    70        Good   33.77728 |
 24284. | Female    19   Very Good   23.29467 |
 24285. | Female    24        Good   37.49089 |
 24286. | Female    77        Poor   29.85058 |
       |-------------------------------------|
 24287. | Female    57   Very Good   24.20799 |
 24288. |   Male    20   Very Good   24.40488 |
 24289. | Female    67        Good   28.49072 |
 24290. |   Male    62        Poor   33.27811 |
 24291. | Female    55        Good   19.76427 |
       +-------------------------------------+

 . 
 . 
 . * Summary statistics
 . * ------------------
 . 
 . * We now turn to analysing the newly created 'bmi' variable, using the
 . * -summarize- command (shorthand -su-) to obtain its mean, min and max values,
 . * as well as standard deviation, which we will cover later on.
 . su bmi

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |     24291       27.27    5.134197   15.20329   50.48837

 . 
 . * Add the -detail- option (shorthand -d-) for precise statistics that cover
 . * its mean, minimum and maximum values, as well as its percentile distribution.
 . su bmi, d

                       Body Mass Index
 -------------------------------------------------------------
      Percentiles      Smallest
 1%     18.30729       15.20329
 5%     20.11707       15.20329
 10%     21.26276       15.20329       Obs               24291
 25%     23.51343        15.5041       Sum of Wgt.       24291

 50%     26.57845                      Mean              27.27
                        Largest       Std. Dev.      5.134197
 75%     30.22843       49.60056
 90%     34.32617       50.38167       Variance       26.35998
 95%     36.91451       50.48837       Skewness       .7207431
 99%     41.59763       50.48837       Kurtosis       3.463278

 . 
 . * Further sessions will gradually explain how to read each statistic displayed.
 . * For now, just note that the median respondent in the dataset, which is meant
 . * to be representative of the United States adult population in 2009, has a
 . * BMI of 26, which indicates overweight. The average (mean) BMI is over that
 . * value, which indicates that higher BMI values are either more frequent
 . * and/or more extreme than lower BMI values. You can also note that the top 1%
 . * respondents has a BMI between 41 and 50, which indicates morbid obesity.
 . 
 . 
 . * Visualization
 . * -------------
 . 
 . * Visualizing the distribution of BMI values among the observations contained
 . * in the dataset will make these first insights more clear and more complete.
 . * Create a histogram (shorthand -hist-) for the distribution of BMI.
 . hist bmi, freq normal ///
 >         name(bmi, replace)
 (bin=43, start=15.203287, width=.82058321)

 . 
 . * A histogram describes the distribution of the variable in the sample, i.e.
 . * the distribution of different values of BMI among the respondents to the
 . * survey. The -freq- option specifies to use percentages, and the -normal-
 . * option overlays a normal distribution to the histogram, a curve to which
 . * we will soon come back when we cover essential statistical theory. The
 . * -name- option saves the graph under that name in Stata temporary memory.
 . 
 . * Another visualization is the boxplot, which uses different criteria to shape
 . * the distribution of the variable. Refer to the course material to understand
 . * how quartiles and outliers are constructed to form each element of the plot.
 . * Also note that a boxplot is pretty uninformative if, as in this example, you
 . * decide not to split the visualization over any number of categories.
 . gr hbox bmi, ///
 >         name(bmi_boxplot, replace)

 . 
 . * The next example uses the -over() asyvars- options to produce boxplots of BMI
 . * over gender groups, and then again over insurance status. This method creates
 . * several box plots, one for each category -- a method called 'visualizing over
 . * small multiples'. The result will stay in memory under the name given by the
 . * -name()- option. Note, finally, that you need to select both lines to run the
 . * command properly: if you do not include the final line, nothing will happen.
 . gr hbox bmi if uninsured != 9, over(sex) asyvars over(uninsured) ///
 >         name(bmi_sex_ins, replace)

 . 
 . 
 . * Logical expressions
 . * -------------------
 . 
 . * Note how the 'DK' category for insurance status was removed by using a call
 . * to the conditional operator -if-, to exclude observations with an insurance
 . * status equal to 9 when drawing the plot. This part of the command reads as:
 . * draw a boxplot of all observations with an insurance status not equal to 9.
 . 
 . * Here are more examples of logical expressions.
 . 
 . su bmi if age >= 20 & age < 25

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      1923     25.3919    4.839532   15.96345   46.63502

 . * This command reads as: 'run the -summarize- command on the 'bmi' variable,
 . * but only for observations for wich the 'age' variable takes a value greater
 . * than or equal to 20 and ('&') lesser than 25.'
 . 
 . su bmi if sex == 1 & age >= 65

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      1831    27.71574    4.401945   17.62924   45.18825

 . * This command reads as: 'summarize BMI for observations of sex equal to 1
 . * (i.e. males in this dataset) and of age greater or equal to 65.'
 . 
 . su bmi if raceb == 2 | raceb == 3

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      8651    28.17926    5.218596    15.5041   50.48837

 . * This command uses the 'raceb' variable, which codes Blacks and Hispanics
 . * with values 2 and 3. This command therefore summarises BMI only for these
 . * two ethnic groups: the '|' symbol is the logical operator for 'or'. It
 . * reads as: 'summarize BMI if the respondent is Black or Hispanic.'
 . 
 . * If you have many categories to select, then using the -inlist- operator might
 . * be much quicker. The example below selects a series of income categories that
 . * fall either below the minimum wage in 2009 (15,000 dollars/year) or that fall
 . * five times over that or more (i.e. earnings == 11, the highest income category
 . * in the dataset).
 . tab earnings if inlist(earnings, 1, 2, 3, 11)

         Person's total |
     earnings, previous |
          calendar year |      Freq.     Percent        Cum.
 ------------------------+-----------------------------------
           $01 to $4999 |      1,081       21.63       21.63
         $5000 to $9999 |        923       18.47       40.10
       $10000 to $14999 |      1,252       25.06       65.16
        $75000 and over |      1,741       34.84      100.00
 ------------------------+-----------------------------------
                  Total |      4,997      100.00

 . 
 . * This operator is also practical to select countries, regions and other nominal
 . * variables in country-level data, and it accepts strings, i.e. text variables.
 . * Examples to follow later. For the moment, simply note that the example above
 . * uses a tabulation command because the earnings variable is categorical. This
 . * difference in the type of variable is crucial, and is illustrated further.
 . 
 . 
 . * =========================
 . * = INDEPENDENT VARIABLES =
 . * =========================
 . 
 . 
 . * Body Mass Index is our 'dependent variable', i.e. the one that we want to
 . * explain. We have reason to believe that some 'independent' variables like
 . * gender, health status and race could be influencing BMI. In other words,
 . * we assume that BMI can be partially 'predicted' by sex, health and race.
 . lookfor sex health race

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 sex             byte   %8.0g       sex_lbl    Sex
 racea           int    %43.0g      racea_lbl
                                              Main Racial Background (Pre-1997
                                                Revised OMB Standards),
                                                self-reported or interv
 health          byte   %23.0g      health_lbl
                                              Health status
 uninsured       byte   %23.0g      uninsured_lbl
                                              Health Insurance coverage status
 raceb           float  %9.0g       raceb      Race

 . 
 . 
 . * Summarizing over categories
 . * ---------------------------
 . 
 . * Summarize BMI (as well as height and weight) for each value of 'sex'. The
 . * -su- command assumes that you are describing a variable that can take any
 . * numeric value, and shows summary statistics for it. The -bysort- prefix
 . * (shorthand -bys-) takes one categorical variable and repeats the command
 . * over its categories. The entire command thus reads: for each value of the
 . * 'sex' variable, summarize the continuous variables 'bmi', 'age' and weight.
 . bysort sex: su bmi age weight

 ------------------------------------------------------------------------------------
 -> sex = Male

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |     10978    27.57415    4.430363   17.14956   48.70874
         age |     10978    46.47404    16.93469         18         84
      weight |     10978    190.5036     33.0331        126        285

 ------------------------------------------------------------------------------------
 -> sex = Female

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |     13313    27.01919    5.636827   15.20329   50.48837
         age |     13313    47.09419    17.35074         18         84
      weight |     13313    157.8174    33.65398        100        259


 . 
 . * Read the Stata codebook for the 'health' variable.
 . codebook health

 ------------------------------------------------------------------------------------
 health                                                                 Health status
 ------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  health_lbl

                 range:  [1,5]                        units:  1
         unique values:  5                        missing .:  7/24291

            tabulation:  Freq.   Numeric  Label
                          6750         1  Excellent
                          7833         2  Very Good
                          6423         3  Good
                          2496         4  Fair
                           782         5  Poor
                             7         .  

 . 
 . * The codebook shows that the health variable comes in ordered categories.
 . * In that case, the -su- command will not inspect the variable properly. You
 . * will instead need to use either the -tab- or the -fre- command to describe
 . * the variable properly, by viewing its frequencies:
 . fre health

 health -- Health status
 -----------------------------------------------------------------
                    |      Freq.    Percent      Valid       Cum.
 --------------------+--------------------------------------------
 Valid   1 Excellent |       6750      27.79      27.80      27.80
        2 Very Good |       7833      32.25      32.26      60.05
        3 Good      |       6423      26.44      26.45      86.50
        4 Fair      |       2496      10.28      10.28      96.78
        5 Poor      |        782       3.22       3.22     100.00
        Total       |      24284      99.97     100.00           
 Missing .           |          7       0.03                      
 Total               |      24291     100.00                      
 -----------------------------------------------------------------

 . 
 . * Note that health is measured on five levels that come as values (1-5), and
 . * labels attached to them (from 'Excellent' to 'Poor'). We will discuss this
 . * structure in depth when we introduce variable types and value labels. For
 . * the moment, simply note that the health variable holds an ordinal scale
 . * of self-reported health status, and that the values attached to its labels
 . * are merely a way to create an ordinal scale: 'poor' health is not worth 5
 . * points of anything. Refer later to the course material to make sure that
 . * you are familiar with the terminology and notions of variable description.
 . 
 . * Summarize BMI (as well as height and weight) for each value of the health
 . * variable. Note that -bys- is shorthand for the -bysort- prefix.
 . bys health: su bmi weight

 ------------------------------------------------------------------------------------
 -> health = Excellent

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      6750    25.78944    4.399813   16.13866   49.60056
      weight |      6750    165.2935    34.52845        100        285

 ------------------------------------------------------------------------------------
 -> health = Very Good

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      7833    27.06963    4.864313   15.20329   50.48837
      weight |      7833    172.0412    36.43623        100        285

 ------------------------------------------------------------------------------------
 -> health = Good

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      6423    28.21763    5.380641    15.5041   50.48837
      weight |      6423    177.2219    38.11962        100        285

 ------------------------------------------------------------------------------------
 -> health = Fair

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      2496    28.89986    5.636523   15.20329   48.81944
      weight |      2496    179.5897    38.78594        100        285

 ------------------------------------------------------------------------------------
 -> health = Poor

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |       782    29.08097    6.087322    15.6605   48.70874
      weight |       782    180.7225    40.37895        100        283

 ------------------------------------------------------------------------------------
 -> health = .

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |         7    26.17216    4.125204   19.15614   31.95455
      weight |         7    166.4286    33.44576        126        205


 . 
 . 
 . * Visualization over categories
 . * -----------------------------
 . 
 . * Graph the mean BMI of each ethnic group, using a dot plot.
 . gr dot bmi, over(raceb) ytitle("Average Body Mass Index") ///
 >         name(bmi_race, replace)

 . 
 . * Add a new categorical division between men and women to the dot plot.
 . gr dot bmi, over(sex) over(raceb) ytitle("Average Body Mass Index") ///
 >         name(bmi_race, replace)

 . 
 . * Each independent variable might influence BMI, but can also interact with
 . * another independent variable, making the explanation of BMI more complex
 . * and detailed because its predictors might also significantly interact with
 . * each other. Visualization allows to explore that intuition in the same way
 . * that it helped thinking about predictors to the dependent variable.
 . 
 . * The graph below explores a relationship between three independent variables.
 . * An additional trick in this graph is that its command runs over three lines.
 . * The '///' indicates that you have to select all three lines to properly run
 . * the graph command. This trick helps formatting do-files in short lines.
 . gr dot health, exclude0 yreverse over(sex) over(raceb) ///
 >         ylabel(1 "Excellent" 3 "Good" 5 "Poor") ytitle("Average health status") //
 > /
 >         name(health_sex_race, replace)

 . 
 . * The graph uses several options: due to the numerical coding of the 'health'
 . * variable, we had to remove 0 from the dot plot, and reverse the axis. We also
 . * made the horizontal (y) axis more legible by adding (y)labels and a (y)title.
 . * Note that the visual difference is naturally not sufficient to establish that 
 . * there is a significant difference in mean BMI across racial/ethnic groups.
 . 
 . 
 . * ==========================
 . * = FINALIZING THE DATASET =
 . * ==========================
 . 
 . 
 . * Patterns of missing values
 . * --------------------------
 . 
 . * Finally, let's see how many observations have all variables measured for our
 . * selection of variables. The -misstable- command produces a pattern that shows
 . * the number of observations with no missing values across all listed variables. 
 . misstable pat bmi age sex health raceb earnings uninsured, freq

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Frequency |  1
  ------------+-------------
       24,284 |  1
              |
            7 |  0
  ------------+-------------
       24,291 |

  Variables are  (1) health

 . 
 . * There are only 7 missing values in the selection of variables above. Let's see
 . * what happens if we also want to analyze the 'strength' and 'vigor' variables,
 . * which measure physical activity. We remove the -freq- option to read the size
 . * the data with no missing values as a percentage. The loss is still trivial.
 . misstable pat bmi age sex health raceb earnings uninsured strength vigor

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Percent   |  1  2  3
  ------------+-------------
       99%    |  1  1  1
              |
       <1     |  1  1  0
       <1     |  1  0  1
       <1     |  1  0  0
       <1     |  0  1  1
       <1     |  0  0  0
  ------------+-------------
      100%    |

  Variables are  (1) health  (2) strength  (3) vigor

 . 
 . 
 . * Subsetting
 . * ----------
 . 
 . * We can now finalize the dataset by deleting observations with missing data in
 . * our selection of variables. The final count is the actual sample size that we
 . * will analyze at later stages of the course.
 . drop if mi(bmi, age, sex, health, raceb, earnings, uninsured, strength, vigor)
 (228 observations deleted)

 . 
 . * Final count.
 . count
 24063

 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * The command above closes the log that we opened when we started this do-file.
 . * Logs are essential to keep records of your analysis. They complement do-files,
 . * which are records of your commands and comments only. Now that you have closed
 . * the log below, have a quick look at it.
 . view code/week2.log

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require fre scheme-burd

 . 
 . * Log results.
 . cap log using code/week3.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 3 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Support for Sharia Law in Nine Countries
 > 
 >  - DATA:   World Values Survey Wave 4 (2000)
 > 
 >  - Welcome again to Stata. This do-file contains the commands used in our third
 >    session. For coursework, practice with Stata code by running the code again,
 >    and read the full comments on the way.
 >    
 >  - This do-file explores the World Values Survey (WVS) dataset and focuses on
 >    support for sharia law among respondents in Arab-speaking countries, several
 >    of which have been in political turmoil over the past few years.
 > 
 >   - The dependent variable (DV) is a 5-point agreement scale with the statement:
 >    "[The government] should implement only the laws of the sharia". The variable
 >    was measured during WVS Wave 4 (1999-2004).
 > 
 >  - Make sure that you understand how to distinguish continuous and categorical
 >    types of variables by the end of this training session. Also make sure that
 >    you know how to encode variables and missing values for analysis in Stata.
 > 
 >  - Select a dataset for analysis. Use the -lookfor- and -lookfor_all- commands
 >    to identify which dataset has variables that match your interests, and use
 >    the -d-, -fre- and -su- commands to describe and inspect the variables.
 > 
 >  - Start writing a draft do-file in which you prepare your dataset for analysis.
 >    Use the course do-files for inspiration: start with a short header, then load
 >    the data and describe the variables, recoding them if needed.
 > 
 >  - When selecting variables, make sure that the dependent variable is continuous
 >    or pseudo-continuous. The dependent variable (DV) is the one that you want to
 >    explain using your selection of independent variables (IVs).
 >    
 >  - Write a draft paragraph that describes the dependent variable in sufficient
 >    detail, and another draft paragraph that lists your independent variables and
 >    offers a general theory on the articulation between your variables.
 > 
 >    Last updated 2013-02-18.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load WVS dataset.
 . use data/wvs2000, clear
 (World Values Survey 2000)

 . 
 . * Survey weights (see WVS documentation).
 . svyset [pw = v245]

      pweight: v245
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

 . 
 . * Inspect the list of included countries.
 . fre v2

 v2 -- country/region
 ---------------------------------------------------------------------
                        |      Freq.    Percent      Valid       Cum.
 ------------------------+--------------------------------------------
 Valid   8  Spain        |       1209       1.98       1.98       1.98
        11 Usa          |       1200       1.97       1.97       3.95
        12 Canada       |       1931       3.16       3.16       7.11
        13 Japan        |       1362       2.23       2.23       9.34
        14 Mexico       |       1535       2.51       2.51      11.85
        15 S Africa     |       3000       4.91       4.91      16.76
        19 Sweden       |       1015       1.66       1.66      18.43
        22 Argentina    |       1280       2.10       2.10      20.52
        24 S Korea      |       1200       1.97       1.97      22.49
        27 Puerto Rico  |        720       1.18       1.18      23.67
        29 Nigeria      |       2022       3.31       3.31      26.98
        30 Chile        |       1200       1.97       1.97      28.94
        32 India        |       2002       3.28       3.28      32.22
        38 Pakistan     |       2000       3.28       3.28      35.50
        39 China        |       1000       1.64       1.64      37.14
        44 Turkey       |       3401       5.57       5.57      42.71
        51 Peru         |       1501       2.46       2.46      45.16
        53 Venezuela    |       1200       1.97       1.97      47.13
        57 Zimbabwe     |       1002       1.64       1.64      48.77
        58 Philippines  |       1200       1.97       1.97      50.74
        59 Israel       |       1199       1.96       1.96      52.70
        60 Tanzania     |       1171       1.92       1.92      54.62
        61 Moldova      |       1008       1.65       1.65      56.27
        67 Saudi Arabia |       1502       2.46       2.46      58.73
        69 Bangladesh   |       1500       2.46       2.46      61.18
        70 Indonesia    |       1004       1.64       1.64      62.83
        71 Vietnam      |       1000       1.64       1.64      64.47
        72 Albania      |       1000       1.64       1.64      66.10
        74 Uganda       |       1002       1.64       1.64      67.74
        77 Singapore    |       1512       2.48       2.48      70.22
        81 Serbia       |       1200       1.97       1.97      72.19
        82 Montenegro   |       1060       1.74       1.74      73.92
        83 Macedonia    |       1055       1.73       1.73      75.65
        89 Egypt        |       3000       4.91       4.91      80.56
        90 Morocco      |       2264       3.71       3.71      84.27
        91 Iran         |       2532       4.15       4.15      88.42
        92 Jordan       |       1223       2.00       2.00      90.42
        93 Bosnia       |       1200       1.97       1.97      92.38
        96 Algeria      |       1282       2.10       2.10      94.48
        97 Iraq         |       2325       3.81       3.81      98.29
        99 Kyrgyzstan   |       1043       1.71       1.71     100.00
        Total           |      61062     100.00     100.00           
 ---------------------------------------------------------------------

 . 
 . * Rename the variable to something understandable.
 . ren v2 country

 . 
 . * Survey years.
 . table country, c(min s020 max s020)

 -------------------------------------
 country/regi |
 on           |  min(s020)   max(s020)
 -------------+-----------------------
       Spain |       2000        2000
         Usa |       1999        1999
      Canada |       2000        2000
       Japan |       2000        2000
      Mexico |       2000        2000
    S Africa |       2001        2001
      Sweden |       1999        1999
   Argentina |       1999        1999
     S Korea |       2001        2001
 Puerto Rico |       2001        2001
     Nigeria |       2000        2000
       Chile |       2000        2000
       India |       2001        2001
    Pakistan |       2001        2001
       China |       2001        2001
      Turkey |       2001        2001
        Peru |       2001        2001
   Venezuela |       2000        2000
    Zimbabwe |       2001        2001
 Philippines |       2001        2001
      Israel |       2001        2001
    Tanzania |       2001        2001
     Moldova |       2002        2002
 Saudi Arabia |       2003        2003
  Bangladesh |       2002        2002
   Indonesia |       2001        2001
     Vietnam |       2001        2001
     Albania |       2002        2002
      Uganda |       2001        2001
   Singapore |       2002        2002
      Serbia |       2001        2001
  Montenegro |       2001        2001
   Macedonia |       2001        2001
       Egypt |       2000        2000
     Morocco |       2001        2001
        Iran |       2000        2000
      Jordan |       2001        2001
      Bosnia |       2001        2001
     Algeria |       2002        2002
        Iraq |       2004        2004
  Kyrgyzstan |       2003        2003
 -------------------------------------

 . 
 . 
 . * Dependent variable: Support for sharia law
 . * ------------------------------------------
 . 
 . * Inspect the overall dependent variable.
 . fre iv166

 iv166 -- laws of the shari¥a
 ----------------------------------------------------------------------------------
                                     |      Freq.    Percent      Valid       Cum.
 -------------------------------------+--------------------------------------------
 Valid   -4 not asked                 |      45204      74.03      74.03      74.03
        1  agree strongly            |       5499       9.01       9.01      83.04
        2  agree                     |       3572       5.85       5.85      88.89
        3  neither agree or disagree |       2364       3.87       3.87      92.76
        4  disagree                  |       1335       2.19       2.19      94.94
        5  strongly disagree         |        771       1.26       1.26      96.21
        8  na                        |       1476       2.42       2.42      98.62
        9  dk                        |        841       1.38       1.38     100.00
        Total                        |      61062     100.00     100.00           
 ----------------------------------------------------------------------------------

 . 
 . * Clone the nonmissing values of the dependent variable (exclude 'DK/NA' codes).
 . clonevar sharia = iv166 if iv166 > 0 & iv166 < 8
 (47521 missing values generated)

 . 
 . * We use -clonevar- to create a variable with the same coding and labels as the
 . * original one, but exclude missing values from the clone with the -if- logical
 . * operator. The first argument is the name of the new variable that we created.
 . 
 . * This approach to data preparation allows to rename and recode while preserving
 . * the original variable. The new variable will appear at the end of the dataset,
 . * as the -d- command (for -describe-) would show.
 . 
 . * Inspect the clean version of the variable.
 . fre sharia

 sharia -- laws of the shari¥a
 ---------------------------------------------------------------------------------
                                    |      Freq.    Percent      Valid       Cum.
 ------------------------------------+--------------------------------------------
 Valid   1 agree strongly            |       5499       9.01      40.61      40.61
        2 agree                     |       3572       5.85      26.38      66.99
        3 neither agree or disagree |       2364       3.87      17.46      84.45
        4 disagree                  |       1335       2.19       9.86      94.31
        5 strongly disagree         |        771       1.26       5.69     100.00
        Total                       |      13541      22.18     100.00           
 Missing .                           |      47521      77.82                      
 Total                               |      61062     100.00                      
 ---------------------------------------------------------------------------------

 . 
 . * Find in which countries the variable was measured.
 . fre country if !mi(sharia)

 country -- country/region
 ---------------------------------------------------------------------
                        |      Freq.    Percent      Valid       Cum.
 ------------------------+--------------------------------------------
 Valid   29 Nigeria      |        626       4.62       4.62       4.62
        38 Pakistan     |       1949      14.39      14.39      19.02
        67 Saudi Arabia |       1413      10.43      10.43      29.45
        69 Bangladesh   |       1217       8.99       8.99      38.44
        70 Indonesia    |        929       6.86       6.86      45.30
        89 Egypt        |       2970      21.93      21.93      67.23
        92 Jordan       |       1176       8.68       8.68      75.92
        96 Algeria      |       1177       8.69       8.69      84.61
        97 Iraq         |       2084      15.39      15.39     100.00
        Total           |      13541     100.00     100.00           
 ---------------------------------------------------------------------

 . 
 . * Remove other countries.
 . drop if mi(sharia)
 (47521 observations deleted)

 . 
 . * In the first command, the -!mi- operator means 'not missing' and therefore
 . * produces the list of countries for which the DV is available. In the second
 . * command, -drop- removes all observations for which the DV is missing.
 . 
 . 
 . * Recoding to dummies
 . * -------------------
 . 
 . * Recall the DV frequencies.
 . fre sharia

 sharia -- laws of the shari¥a
 ---------------------------------------------------------------------------------
                                    |      Freq.    Percent      Valid       Cum.
 ------------------------------------+--------------------------------------------
 Valid   1 agree strongly            |       5499      40.61      40.61      40.61
        2 agree                     |       3572      26.38      26.38      66.99
        3 neither agree or disagree |       2364      17.46      17.46      84.45
        4 disagree                  |       1335       9.86       9.86      94.31
        5 strongly disagree         |        771       5.69       5.69     100.00
        Total                       |      13541     100.00     100.00           
 ---------------------------------------------------------------------------------

 . 
 . * Recode the variable to a simpler form: pro-sharia respondents vs others.
 . * The recoded variable is binary: it takes only two values, either 0 or 1.
 . * These variables are affectionately called 'dummies'.
 . recode sharia ///
 >         (1/2 = 1 "Support") ///
 >         (4/5 = 0 "Oppose") ///
 >         (else = .), gen(prosharia)
 (8042 differences between sharia and prosharia)

 . la var prosharia "Legislative enforcement of sharia (0/1)"

 . fre prosharia

 prosharia -- Legislative enforcement of sharia (0/1)
 ---------------------------------------------------------------
                  |      Freq.    Percent      Valid       Cum.
 ------------------+--------------------------------------------
 Valid   0 Oppose  |       2106      15.55      18.84      18.84
        1 Support |       9071      66.99      81.16     100.00
        Total     |      11177      82.54     100.00           
 Missing .         |       2364      17.46                      
 Total             |      13541     100.00                      
 ---------------------------------------------------------------

 . 
 . * Another way to understand a binary variable is to look at its mean: because
 . * the values of that variable are equal to either 0 or 1, its mean reads as the
 . * proportion of positive cases (1) within the total number of observations.
 . su prosharia

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
   prosharia |     11177    .8115773    .3910668          0          1

 . 
 . * Same thing, different command (more flexible; used later).
 . tabstat prosharia, s(n mean) c(s)

    variable |         N      mean
 -------------+--------------------
   prosharia |     11177  .8115773
 ----------------------------------

 . 
 . * Finally, you can generates dummies for each value of a variable, which here
 . * means generating five dummies starting with the 'sharia_' prefix:
 . tab sharia, gen(sharia_)

      laws of the shari¥a |      Freq.     Percent        Cum.
 --------------------------+-----------------------------------
           agree strongly |      5,499       40.61       40.61
                    agree |      3,572       26.38       66.99
 neither agree or disagree |      2,364       17.46       84.45
                 disagree |      1,335        9.86       94.31
        strongly disagree |        771        5.69      100.00
 --------------------------+-----------------------------------
                    Total |     13,541      100.00

 . 
 . * Show all variables named 'sharia_[whatever]'.
 . codebook sharia_*, c

 Variable     Obs Unique      Mean  Min  Max  Label
 ------------------------------------------------------------------------------------
 sharia_1   13541      2     .4061    0    1  sharia==agree strongly
 sharia_2   13541      2  .2637914    0    1  sharia==agree
 sharia_3   13541      2  .1745809    0    1  sharia==neither agree or disagree
 sharia_4   13541      2  .0985895    0    1  sharia==disagree
 sharia_5   13541      2  .0569382    0    1  sharia==strongly disagree
 ------------------------------------------------------------------------------------

 . 
 . 
 . * Stacked plots with dummies
 . * --------------------------
 . 
 . * One reason to recode is to have a look at simplified versions of the DV in
 . * graphs. Here's a dot plot showing the mean value of the DV (its proportion)
 . * in each country, sorted by descending order:
 . gr dot prosharia, over(country, sort(1)des) ///
 >    name(dv_dot, replace)

 . 
 . * Recode the DV to three groups.
 . recode sharia ///
 >         (1/2 = 1 "Agree") ///
 >         (3 = 2 "Neither") ///
 >         (4/5 = 3 "Disagree") ///
 >         (else = .), gen(sharia3)
 (8042 differences between sharia and sharia3)

 . la var sharia3 "Legislative enforcement of sharia (3 groups)"

 . 
 . * Recode each category to a dummy.
 . tab sharia3, gen(sharia3_)

 Legislative |
 enforcement |
  of sharia |
 (3 groups) |      Freq.     Percent        Cum.
 ------------+-----------------------------------
      Agree |      9,071       66.99       66.99
    Neither |      2,364       17.46       84.45
   Disagree |      2,106       15.55      100.00
 ------------+-----------------------------------
      Total |     13,541      100.00

 . d sharia3_*

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 sharia3_1       byte   %8.0g                  sharia3==Agree
 sharia3_2       byte   %8.0g                  sharia3==Neither
 sharia3_3       byte   %8.0g                  sharia3==Disagree

 . 
 . * Comparative plot at the country level, shown with tons of graphical options
 . * to illustrate a limitation of Stata: it requires some work to produce decent
 . * visualizations, especially with categorical variables.
 . gr bar sharia3_*, over(country, sort(1)des lab(angle(45))) stack percent ///
 >         ti("Support for sharia legislation") yti("% respondents") ///
 >         legend(row(1) order(1 "For" 2 "Neutral" 3 "Against")) ///
 >         note("World Values Survey 1999-2004. {it:N} = 13,541") ///
 >         scheme(burd3) name(dv_bar, replace)

 . 
 . * Identical plot, shown with horizontal bars and less options. Some settings
 . * that show up on my end are provided by the burd3 scheme, which is part of
 . * the course material; it will look different with other graph schemes.
 . gr hbar sharia3_*, over(country, sort(1)des) stack percent ///
 >         ti("Support for sharia legislation") yti("% respondents") ///
 >         legend(pos(1) row(1) order(1 "For" 2 "Neutral" 3 "Against")) ///
 >         note("World Values Survey 1999-2004. {it:N} = 13,541") ///
 >         scheme(burd3) name(dv_hbar, replace)

 . 
 . 
 . * =========================
 . * = INDEPENDENT VARIABLES =
 . * =========================
 . 
 . 
 . * Describe independent variables.
 . d v223 v225 v226 v241

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 v223            byte   %8.0g       v223       sex
 v225            byte   %8.0g       v225       age
 v226            byte   %8.0g       v226       highest educational level attained
 v241            byte   %8.0g       v241       size of town

 . 
 . * Overview of variable codes.
 . fre v223 v225 v226 v241

 v223 -- sex
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 male   |       6961      51.41      51.41      51.41
        2 female |       6580      48.59      48.59     100.00
        Total    |      13541     100.00     100.00           
 --------------------------------------------------------------

 v225 -- age
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   15    |         20       0.15       0.15       0.15
        16    |         67       0.49       0.49       0.64
        17    |        100       0.74       0.74       1.38
        18 18 |        286       2.11       2.11       3.49
        19    |        320       2.36       2.36       5.86
        20    |        375       2.77       2.77       8.63
        21    |        331       2.44       2.44      11.07
        22    |        491       3.63       3.63      14.70
        23    |        468       3.46       3.46      18.15
        24    |        503       3.71       3.71      21.87
        25    |        458       3.38       3.38      25.25
        26    |        468       3.46       3.46      28.71
        27    |        383       2.83       2.83      31.53
        28    |        413       3.05       3.05      34.58
        29    |        321       2.37       2.37      36.95
        30    |        492       3.63       3.63      40.59
        31    |        349       2.58       2.58      43.17
        32    |        450       3.32       3.32      46.49
        33    |        342       2.53       2.53      49.01
        34    |        346       2.56       2.56      51.57
        :     |          :          :          :          :
        72    |         32       0.24       0.24      98.71
        73    |         26       0.19       0.19      98.90
        74    |         18       0.13       0.13      99.03
        75    |         29       0.21       0.21      99.25
        76    |         20       0.15       0.15      99.39
        77    |         15       0.11       0.11      99.51
        78    |         11       0.08       0.08      99.59
        79    |          7       0.05       0.05      99.64
        80    |         11       0.08       0.08      99.72
        81    |          6       0.04       0.04      99.76
        82    |         10       0.07       0.07      99.84
        83    |          4       0.03       0.03      99.87
        85    |          2       0.01       0.01      99.88
        86    |          3       0.02       0.02      99.90
        87    |          3       0.02       0.02      99.93
        88    |          1       0.01       0.01      99.93
        90 90 |          1       0.01       0.01      99.94
        92    |          1       0.01       0.01      99.95
        93    |          2       0.01       0.01      99.96
        99 dk |          5       0.04       0.04     100.00
        Total |      13541     100.00     100.00           
 -----------------------------------------------------------

 v226 -- highest educational level attained
 -----------------------------------------------------------------------------------
                                      |      Freq.    Percent      Valid       Cum.
 --------------------------------------+--------------------------------------------
 Valid   1  no formal education        |       2169      16.02      16.02      16.02
        2  incomplete primary school  |       1218       8.99       8.99      25.01
        3  complete primary school    |       1791      13.23      13.23      38.24
        4  incomplete secondary       |        764       5.64       5.64      43.88
           school:                    |                                            
           technical/vocational type  |                                            
        5  complete secondary school: |       1886      13.93      13.93      57.81
           technical/vocational type  |                                            
        6  incomplete secondary:      |        805       5.94       5.94      63.75
           university-preparatory     |                                            
           type                       |                                            
        7  complete secondary:        |       1974      14.58      14.58      78.33
           university-preparatory     |                                            
           type                       |                                            
        8  some university without    |       1004       7.41       7.41      85.75
           degree                     |                                            
        9  university with degree     |       1881      13.89      13.89      99.64
        98 na                         |         14       0.10       0.10      99.74
        99 dk                         |         35       0.26       0.26     100.00
        Total                         |      13541     100.00     100.00           
 -----------------------------------------------------------------------------------

 v241 -- size of town
 -------------------------------------------------------------------------
                            |      Freq.    Percent      Valid       Cum.
 ----------------------------+--------------------------------------------
 Valid   -4 not asked        |       2084      15.39      15.39      15.39
        1  2,000 and less   |        597       4.41       4.41      19.80
        2  2,000-5,000      |       1550      11.45      11.45      31.25
        3  5,000-10,000     |       1777      13.12      13.12      44.37
        4  10,000-20,000    |       1006       7.43       7.43      51.80
        5  20,000-50,000    |       1374      10.15      10.15      61.95
        6  50,000-100,000   |        753       5.56       5.56      67.51
        7  100,000-500,000  |        958       7.07       7.07      74.58
        8  500,000 and more |       3416      25.23      25.23      99.81
        9  dk               |         26       0.19       0.19     100.00
        Total               |      13541     100.00     100.00           
 -------------------------------------------------------------------------

 . 
 . 
 . * IV: Gender
 . * ----------
 . 
 . fre v223

 v223 -- sex
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 male   |       6961      51.41      51.41      51.41
        2 female |       6580      48.59      48.59     100.00
        Total    |      13541     100.00     100.00           
 --------------------------------------------------------------

 . 
 . * Recode gender as a meaningful binary (either female or not) using a logical
 . * operator (in brackets), excluding missing observations from the operation and
 . * applying the 'female' label to the new 'female' dummy variable:
 . gen female:female = (v223 == 1) if !mi(v223)

 . 
 . * Label the values.
 . la def female 0 "Male" 1 "Female", replace

 . 
 . * Label the variable.
 . la var female "Gender"

 . 
 . * Final result.
 . fre female

 female -- Gender
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   0 Male   |       6580      48.59      48.59      48.59
        1 Female |       6961      51.41      51.41     100.00
        Total    |      13541     100.00     100.00           
 --------------------------------------------------------------

 . 
 . * Compute the average support for sharia law among each gender group. Since the
 . * recoded DV only takes values of 0 or 1, its mean indicates the percentage of
 . * sharia supporters in each gender group.
 . bys female: su prosharia

 ------------------------------------------------------------------------------------
 -> female = Male

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
   prosharia |      5415    .8158818    .3876163          0          1

 ------------------------------------------------------------------------------------
 -> female = Female

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
   prosharia |      5762    .8075321    .3942727          0          1


 . 
 . * The same result can be viewed as a frequency by crosstabulating the variables.
 . tab prosharia female, col nof

 Legislativ |
         e |
 enforcemen |
      t of |
    sharia |        Gender
     (0/1) |      Male     Female |     Total
 -----------+----------------------+----------
    Oppose |     18.41      19.25 |     18.84 
   Support |     81.59      80.75 |     81.16 
 -----------+----------------------+----------
     Total |    100.00     100.00 |    100.00 


 . 
 . 
 . * IV: Age
 . * -------
 . 
 . fre v225

 v225 -- age
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   15    |         20       0.15       0.15       0.15
        16    |         67       0.49       0.49       0.64
        17    |        100       0.74       0.74       1.38
        18 18 |        286       2.11       2.11       3.49
        19    |        320       2.36       2.36       5.86
        20    |        375       2.77       2.77       8.63
        21    |        331       2.44       2.44      11.07
        22    |        491       3.63       3.63      14.70
        23    |        468       3.46       3.46      18.15
        24    |        503       3.71       3.71      21.87
        25    |        458       3.38       3.38      25.25
        26    |        468       3.46       3.46      28.71
        27    |        383       2.83       2.83      31.53
        28    |        413       3.05       3.05      34.58
        29    |        321       2.37       2.37      36.95
        30    |        492       3.63       3.63      40.59
        31    |        349       2.58       2.58      43.17
        32    |        450       3.32       3.32      46.49
        33    |        342       2.53       2.53      49.01
        34    |        346       2.56       2.56      51.57
        :     |          :          :          :          :
        72    |         32       0.24       0.24      98.71
        73    |         26       0.19       0.19      98.90
        74    |         18       0.13       0.13      99.03
        75    |         29       0.21       0.21      99.25
        76    |         20       0.15       0.15      99.39
        77    |         15       0.11       0.11      99.51
        78    |         11       0.08       0.08      99.59
        79    |          7       0.05       0.05      99.64
        80    |         11       0.08       0.08      99.72
        81    |          6       0.04       0.04      99.76
        82    |         10       0.07       0.07      99.84
        83    |          4       0.03       0.03      99.87
        85    |          2       0.01       0.01      99.88
        86    |          3       0.02       0.02      99.90
        87    |          3       0.02       0.02      99.93
        88    |          1       0.01       0.01      99.93
        90 90 |          1       0.01       0.01      99.94
        92    |          1       0.01       0.01      99.95
        93    |          2       0.01       0.01      99.96
        99 dk |          5       0.04       0.04     100.00
        Total |      13541     100.00     100.00           
 -----------------------------------------------------------

 . 
 . * Strangely enough, '99' is a missing value here, so we replace '99' values with
 . * a missing value code. The -replace- command is the quickest way to do that.
 . replace v225 = . if v225 == 99
 (5 real changes made, 5 to missing)

 . 
 . * We can now clone the variable.
 . clonevar age = v225
 (5 missing values generated)

 . 
 . * Use -summarize- (or simply -su-) to get the summary statistics, as appropriate
 . * for continuous variables where the mean and standard deviation are meaningful.
 . * Do -not- use either -fre- or -tab- to summarize a continuous variable!
 . su age

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         age |     13536    36.32159    13.53227         15         93

 . 
 . * Histograms showing the distribution of age in each country.
 . hist age, by(country, note("")) bin(9) percent ///
 >         xti("Age distribution") ///
 >         name(age,replace)

 . 
 . * Recode to quartiles -- shown for demonstration purposes: recoding to groups
 . * makes much more sense here, but recoding to n-quantiles like percentiles or
 . * quartiles is useful in many explorative situations.
 . xtile age_q4 = age, nq(4)

 . 
 . * Check that the quartiles each capture roughly a quarter of the distribution.
 . fre age_q4

 age_q4 -- 4 quantiles of age
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   1     |       3419      25.25      25.26      25.26
        2     |       3564      26.32      26.33      51.59
        3     |       3197      23.61      23.62      75.21
        4     |       3356      24.78      24.79     100.00
        Total |      13536      99.96     100.00           
 Missing .     |          5       0.04                      
 Total         |      13541     100.00                      
 -----------------------------------------------------------

 . 
 . * Inspect how age varies within each quartile (e.g. compare top and bottom 25%).
 . tab age_q4, sum(age)

 4 quantiles |           Summary of age
     of age |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
          1 |   21.596666   2.4919297        3419
          2 |   29.857183   2.5735494        3564
          3 |   39.049734   2.8584182        3197
          4 |   55.589094   8.5925202        3356
 ------------+------------------------------------
      Total |   36.321587   13.532266       13536

 . 
 . * Expectedly, there is more variance in the last, older group. Let's finally get
 . * the range, or lower (min) and lower (max) bounds, of each age quartile.
 . table age_q4, c(min age max age)

 ----------------------------------
 4         |
 quantiles |
 of age    |   min(age)    max(age)
 ----------+-----------------------
        1 |         15          25
        2 |         26          34
        3 |         35          44
        4 |         45          93
 ----------------------------------

 . 
 . * Recode to four age groups. The -irecode- command creates categories based on
 . * continuous intervals: category 0 of age4 will contain observations of age up
 . * to 33, category 1 will contain those from 34 to 49, and so on.
 . gen age4:age4 = irecode(age, 33, 49, 64, .)
 (5 missing values generated)

 . 
 . * Check the results. This is a different -table- command than the -tab- one used
 . * previously, which we will get to use for more flexible crosstabulations.
 . table age4, c(min age max age)

 ----------------------------------
     age4 |   min(age)    max(age)
 ----------+-----------------------
        0 |         15          33
        1 |         34          49
        2 |         50          64
        3 |         65          93
 ----------------------------------

 . 
 . * And here's yet another way to crosstabulate: the -tab- command with the -sum- 
 . * option returns the average age in each age group, along with the SD and count.
 . * More on the SD (standard deviation) next week.
 . tab age4, sum(age)

            |           Summary of age
       age4 |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
          0 |   25.385867   4.5848243        6637
          1 |   40.285682   4.3926977        4484
          2 |   55.555377    4.016129        1869
          3 |   70.858974   5.3419134         546
 ------------+------------------------------------
      Total |   36.321587   13.532266       13536

 . 
 . * Write the value and variable labels.
 . la def age4 0 "16-33" 1 "34-49" 2 "50-64" 3 "65+", replace

 . la var age4 "Age groups"

 . fre age4

 age4 -- Age groups
 -------------------------------------------------------------
                |      Freq.    Percent      Valid       Cum.
 ----------------+--------------------------------------------
 Valid   0 16-33 |       6637      49.01      49.03      49.03
        1 34-49 |       4484      33.11      33.13      82.16
        2 50-64 |       1869      13.80      13.81      95.97
        3 65+   |        546       4.03       4.03     100.00
        Total   |      13536      99.96     100.00           
 Missing .       |          5       0.04                      
 Total           |      13541     100.00                      
 -------------------------------------------------------------

 . 
 . * Average support for sharia law by age group in each country.
 . gr dot prosharia, over(female) asyvars over(age4) by(country) ///
 >         name(dv_sex_age2, replace)

 .         
 . 
 . * IV: Education
 . * -------------
 . 
 . fre v226

 v226 -- highest educational level attained
 -----------------------------------------------------------------------------------
                                      |      Freq.    Percent      Valid       Cum.
 --------------------------------------+--------------------------------------------
 Valid   1  no formal education        |       2169      16.02      16.02      16.02
        2  incomplete primary school  |       1218       8.99       8.99      25.01
        3  complete primary school    |       1791      13.23      13.23      38.24
        4  incomplete secondary       |        764       5.64       5.64      43.88
           school:                    |                                            
           technical/vocational type  |                                            
        5  complete secondary school: |       1886      13.93      13.93      57.81
           technical/vocational type  |                                            
        6  incomplete secondary:      |        805       5.94       5.94      63.75
           university-preparatory     |                                            
           type                       |                                            
        7  complete secondary:        |       1974      14.58      14.58      78.33
           university-preparatory     |                                            
           type                       |                                            
        8  some university without    |       1004       7.41       7.41      85.75
           degree                     |                                            
        9  university with degree     |       1881      13.89      13.89      99.64
        98 na                         |         14       0.10       0.10      99.74
        99 dk                         |         35       0.26       0.26     100.00
        Total                         |      13541     100.00     100.00           
 -----------------------------------------------------------------------------------

 . 
 . * Recode to simpler educational attainment levels.
 . recode v226 ///
 >         (1/2 = 0 "None") ///
 >         (3/4 = 1 "Primary") ///
 >         (5/8 = 2 "Secondary") ///
 >         (9 = 3 "University") ///
 >         (else = .), gen(edu4)
 (13541 differences between v226 and edu4)

 . la var edu4 "Education"

 . fre edu4

 edu4 -- Education
 ------------------------------------------------------------------
                     |      Freq.    Percent      Valid       Cum.
 ---------------------+--------------------------------------------
 Valid   0 None       |       3387      25.01      25.10      25.10
        1 Primary    |       2555      18.87      18.94      44.04
        2 Secondary  |       5669      41.87      42.02      86.06
        3 University |       1881      13.89      13.94     100.00
        Total        |      13492      99.64     100.00           
 Missing .            |         49       0.36                      
 Total                |      13541     100.00                      
 ------------------------------------------------------------------

 . 
 . * Histograms showing the distribution of education in each country. Because the
 . * variable is categorical, the histograms require the -discrete- option to plot
 . * the histograms bin as zero-spaced frequency bars.
 . hist edu4, by(country, note("")) percent discrete xla(0(1)3) ///
 >         name(edu,replace)

 . 
 . 
 . * IV: Employment status
 . * ---------------------
 . 
 . fre v229

 v229 -- are you employed now
 ---------------------------------------------------------------------
                        |      Freq.    Percent      Valid       Cum.
 ------------------------+--------------------------------------------
 Valid   1 full time     |       3628      26.79      26.79      26.79
        2 part time     |        864       6.38       6.38      33.17
        3 self employed |       1661      12.27      12.27      45.44
        4 retired       |        540       3.99       3.99      49.43
        5 housewife     |       4127      30.48      30.48      79.91
        6 students      |       1260       9.31       9.31      89.21
        7 unemployed    |       1137       8.40       8.40      97.61
        8 other         |        232       1.71       1.71      99.32
        9 dk,na         |         92       0.68       0.68     100.00
        Total           |      13541     100.00     100.00           
 ---------------------------------------------------------------------

 . 
 . * Clone variable without missing values.
 . clonevar empl = v229 if v229 < 8
 (324 missing values generated)

 . fre empl

 empl -- are you employed now
 ---------------------------------------------------------------------
                        |      Freq.    Percent      Valid       Cum.
 ------------------------+--------------------------------------------
 Valid   1 full time     |       3628      26.79      27.45      27.45
        2 part time     |        864       6.38       6.54      33.99
        3 self employed |       1661      12.27      12.57      46.55
        4 retired       |        540       3.99       4.09      50.64
        5 housewife     |       4127      30.48      31.22      81.86
        6 students      |       1260       9.31       9.53      91.40
        7 unemployed    |       1137       8.40       8.60     100.00
        Total           |      13217      97.61     100.00           
 Missing .               |        324       2.39                      
 Total                   |      13541     100.00                      
 ---------------------------------------------------------------------

 . 
 . 
 . * IV: Household composition
 . * -------------------------
 . 
 . fre v106 v107

 v106 -- marital status
 ----------------------------------------------------------------------------------
                                     |      Freq.    Percent      Valid       Cum.
 -------------------------------------+--------------------------------------------
 Valid   1 married                    |       8309      61.36      61.36      61.36
        2 living together as married |        695       5.13       5.13      66.49
        3 divorced                   |        148       1.09       1.09      67.59
        4 separated                  |         59       0.44       0.44      68.02
        5 widowed                    |        507       3.74       3.74      71.77
        6 single                     |       3804      28.09      28.09      99.86
        8 na                         |          3       0.02       0.02      99.88
        9 dk                         |         16       0.12       0.12     100.00
        Total                        |      13541     100.00     100.00           
 ----------------------------------------------------------------------------------

 v107 -- have you had any children
 --------------------------------------------------------------------------
                             |      Freq.    Percent      Valid       Cum.
 -----------------------------+--------------------------------------------
 Valid   0 no child           |       4281      31.62      31.62      31.62
        1 1 child            |       1198       8.85       8.85      40.46
        2 2 children         |       1771      13.08      13.08      53.54
        3 3 children         |       1880      13.88      13.88      67.42
        4 4 children         |       1430      10.56      10.56      77.99
        5 5 children         |        975       7.20       7.20      85.19
        6 6 children         |        701       5.18       5.18      90.36
        7 7 children         |        431       3.18       3.18      93.55
        8 8 or more children |        533       3.94       3.94      97.48
        9 na                 |        341       2.52       2.52     100.00
        Total                |      13541     100.00     100.00           
 --------------------------------------------------------------------------

 . 
 . * Married dummy.
 . gen married = (v106 == 1) if v106 < 8
 (19 missing values generated)

 . tab v106 married

                      |        married
       marital status |         0          1 |     Total
 ----------------------+----------------------+----------
              married |         0      8,309 |     8,309 
 living together as ma |       695          0 |       695 
             divorced |       148          0 |       148 
            separated |        59          0 |        59 
              widowed |       507          0 |       507 
               single |     3,804          0 |     3,804 
 ----------------------+----------------------+----------
                Total |     5,213      8,309 |    13,522 


 . 
 . * Children dummy.
 . gen haskids = (v107 > 0) if v107 < 9
 (341 missing values generated)

 . tab v107 haskids

  have you had any |        haskids
          children |         0          1 |     Total
 -------------------+----------------------+----------
          no child |     4,281          0 |     4,281 
           1 child |         0      1,198 |     1,198 
        2 children |         0      1,771 |     1,771 
        3 children |         0      1,880 |     1,880 
        4 children |         0      1,430 |     1,430 
        5 children |         0        975 |       975 
        6 children |         0        701 |       701 
        7 children |         0        431 |       431 
 8 or more children |         0        533 |       533 
 -------------------+----------------------+----------
             Total |     4,281      8,919 |    13,200 


 . 
 . 
 . * IV: City size
 . * -------------
 . 
 . * Recode to simpler categories.
 . recode v241 ///
 >         (1/3 = 1 "< 10k") ///
 >         (4/6 = 2 "< 100k") ///
 >         (7 = 3 "< 500k") ///
 >         (8 = 4 "> 500k") ///
 >         (else = .), gen(city4)
 (12944 differences between v241 and city4)

 . la var city4 "City size"

 . fre city4

 city4 -- City size
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 < 10k  |       3924      28.98      34.33      34.33
        2 < 100k |       3133      23.14      27.41      61.74
        3 < 500k |        958       7.07       8.38      70.12
        4 > 500k |       3416      25.23      29.88     100.00
        Total    |      11431      84.42     100.00           
 Missing .        |       2110      15.58                      
 Total            |      13541     100.00                      
 --------------------------------------------------------------

 . 
 . 
 . * ===================== 
 . * = FINALIZED DATASET =
 . * =====================
 . 
 . 
 . * Finalizing a dataset before analysis involves doing two things. The first
 . * one consists in subsetting to fully measured data, which means dropping all 
 . * observations with missing values in the variables selected for analysis.
 . * This restriction is required by the kind of models that we will run later on.
 . * Prior to that, we will need to subset the data to the countries of interest.
 . 
 . * Recall how the country variable is coded.
 . fre country

 country -- country/region
 ---------------------------------------------------------------------
                        |      Freq.    Percent      Valid       Cum.
 ------------------------+--------------------------------------------
 Valid   29 Nigeria      |        626       4.62       4.62       4.62
        38 Pakistan     |       1949      14.39      14.39      19.02
        67 Saudi Arabia |       1413      10.43      10.43      29.45
        69 Bangladesh   |       1217       8.99       8.99      38.44
        70 Indonesia    |        929       6.86       6.86      45.30
        89 Egypt        |       2970      21.93      21.93      67.23
        92 Jordan       |       1176       8.68       8.68      75.92
        96 Algeria      |       1177       8.69       8.69      84.61
        97 Iraq         |       2084      15.39      15.39     100.00
        Total           |      13541     100.00     100.00           
 ---------------------------------------------------------------------

 . 
 . * Subset to two countries of interest.
 . keep if inlist(country, 89, 96)
 (9394 observations deleted)

 . 
 . * Pattern of missing values.
 . misstable pat sharia age female edu4 empl married haskids city4

       Missing-value patterns
         (1 means complete)

              |   Pattern
    Percent   |  1  2  3  4    5  6
  ------------+---------------------
       94%    |  1  1  1  1    1  1
              |
        4     |  1  1  1  1    1  0
       <1     |  1  1  1  1    0  1
       <1     |  1  1  1  0    1  0
       <1     |  1  1  0  1    1  1
       <1     |  1  1  1  1    0  0
       <1     |  0  1  1  1    1  1
       <1     |  1  0  1  1    1  1
       <1     |  0  0  0  1    0  1
       <1     |  1  0  1  1    0  0
       <1     |  1  0  1  1    1  0
       <1     |  1  1  0  1    0  1
  ------------+---------------------
      100%    |

  Variables are  (1) age  (2) edu4  (3) city4  (4) married  (5) empl  (6) haskids

 . 
 . * Studying the pattern of missing values is a crucial requirement: dropping
 . * observations with missing values might affect the representativeness of the
 . * data, or even bring it to such a low number of observations that statistical
 . * power (the capacity of your data to discriminate statistically significant
 . * relationships from insignificant ones) will be at risk. Adopt a reasonable
 . * strategy at that stage: find equivalents to variables that damage your sample,
 . * and adjust your research questions to the available data. Whatever choice you
 . * end up making, ensure that you understand how your finalized dataset relates
 . * to the original data with regards to representativeness.
 . 
 . * Subset to nonmissing observations.
 . drop if mi(sharia, age, female, edu4, empl, married, haskids, city4)
 (260 observations deleted)

 . 
 . * The second and last task is to get the final sample size (in each country).
 . bys country: count

 ------------------------------------------------------------------------------------
 -> country = Egypt
 2918
 ------------------------------------------------------------------------------------
 -> country = Algeria
  969

 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require fre

 . 
 . * Log results.
 . cap log using code/week4.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 4 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Social Determinants of Adult Obesity in the United States
 > 
 >  - DATA:   U.S. National Health Interview Survey (2009)
 > 
 >  - Since last week, you should now know what dataset and variables you plan to
 >    use for your research project. Please register your project online by writing
 >    your names, keywords, data source and class ID to the student projects table.
 > 
 >  - This week focuses on inspecting the normality of your dependent variable. The
 >    DV should be continuous for best results, or at least pseudo-continuous like
 >    a 10-point scale measurement.
 >    
 >  - Avoid selecting variables with four dimensions or less as your DV, unless you
 >    can learn to interpret logistic regression in just a few weeks at the end of 
 >    the course. This requires some math and is for the most adventurous only.
 > 
 >  - Assessing the normality of a variable is first and foremost a visual process.
 >    You will need to visualize your DV a lot at that stage of your work. There is
 >    no systematic way to assess normality, but your decision should take skewness
 >    and kurtosis into account.
 >    
 >    Last updated 2013-02-21.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load NHIS dataset.
 . use data/nhis2009, clear
 (U.S. National Health Interview Survey 2009)

 . 
 . * Subset to most recent year.
 . drop if year != 2009
 (227298 observations deleted)

 . 
 . 
 . * Dependent variable: Body Mass Index
 . * -----------------------------------
 . 
 . * Compute the Body Mass Index.
 . gen bmi = weight * 703 / height^2

 . la var bmi "Body Mass Index"

 . 
 . * Weight the data with NHIS individual weights.
 . svyset psu [pw = perweight], strata(strata)

      pweight: perweight
          VCE: linearized
  Single unit: missing
     Strata 1: strata
         SU 1: psu
        FPC 1: <zero>

 . 
 . 
 . * Independent variables
 . * ---------------------
 . 
 . * Inspect some of the variables.
 . d sex raceb earnings

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 sex             byte   %8.0g       sex_lbl    Sex
 raceb           float  %9.0g       raceb      Race
 earnings        byte   %23.0g      earnings_lbl
                                              Person's total earnings, previous
                                                calendar year

 . 
 . * Low-dimensional, categorical variables.
 . fre sex

 sex -- Sex
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 Male   |      10978      45.19      45.19      45.19
        2 Female |      13313      54.81      54.81     100.00
        Total    |      24291     100.00     100.00           
 --------------------------------------------------------------

 . fre raceb

 raceb -- Race
 ----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
 -------------------+--------------------------------------------
 Valid   1 White    |      14269      58.74      58.74      58.74
        2 Black    |       3893      16.03      16.03      74.77
        3 Hispanic |       4758      19.59      19.59      94.36
        4 Asian    |       1371       5.64       5.64     100.00
        Total      |      24291     100.00     100.00           
 ----------------------------------------------------------------

 . fre earnings

 earnings -- Person's total earnings, previous calendar year
 ---------------------------------------------------------------------------
                              |      Freq.    Percent      Valid       Cum.
 ------------------------------+--------------------------------------------
 Valid   0  NIU                |       7683      31.63      31.63      31.63
        1  $01 to $4999       |       1081       4.45       4.45      36.08
        2  $5000 to $9999     |        923       3.80       3.80      39.88
        3  $10000 to $14999   |       1252       5.15       5.15      45.03
        4  $15000 to $19999   |       1100       4.53       4.53      49.56
        5  $20000 to $24999   |       1235       5.08       5.08      54.65
        6  $25000 to $34999   |       2132       8.78       8.78      63.42
        7  $35000 to $44999   |       1777       7.32       7.32      70.74
        8  $45000 to $54999   |       1397       5.75       5.75      76.49
        9  $55000 to $64999   |        885       3.64       3.64      80.13
        10 $65000 to $74999   |        603       2.48       2.48      82.61
        11 $75000 and over    |       1741       7.17       7.17      89.78
        97 Unknown-refused    |       1292       5.32       5.32      95.10
        99 Unknown-don't know |       1190       4.90       4.90     100.00
        Total                 |      24291     100.00     100.00           
 ---------------------------------------------------------------------------

 . 
 . * The default -tab- command returns similar results, minus value labels.
 . tab sex

        Sex |      Freq.     Percent        Cum.
 ------------+-----------------------------------
       Male |     10,978       45.19       45.19
     Female |     13,313       54.81      100.00
 ------------+-----------------------------------
      Total |     24,291      100.00

 . 
 . * High-dimensional, continuous variables.
 . fre bmi, rows(30)

 bmi -- Body Mass Index
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   15.20329 |          3       0.01       0.01       0.01
        15.5041  |          1       0.00       0.00       0.02
        15.6605  |          1       0.00       0.00       0.02
        15.96345 |          1       0.00       0.00       0.02
        16.13866 |          4       0.02       0.02       0.04
        16.44353 |          1       0.00       0.00       0.05
        16.46143 |          1       0.00       0.00       0.05
        16.60013 |          1       0.00       0.00       0.05
        16.62282 |          2       0.01       0.01       0.06
        16.63905 |          3       0.01       0.01       0.07
        16.72362 |          1       0.00       0.00       0.08
        16.80544 |          1       0.00       0.00       0.08
        16.91334 |          3       0.01       0.01       0.09
        16.94559 |          2       0.01       0.01       0.10
        16.97183 |          2       0.01       0.01       0.11
        :        |          :          :          :          :
        47.4525  |          1       0.00       0.00      99.92
        47.66102 |          1       0.00       0.00      99.92
        47.79871 |          4       0.02       0.02      99.94
        47.84306 |          1       0.00       0.00      99.94
        47.86297 |          1       0.00       0.00      99.95
        47.98764 |          1       0.00       0.00      99.95
        48.42889 |          2       0.01       0.01      99.96
        48.46883 |          1       0.00       0.00      99.96
        48.55442 |          1       0.00       0.00      99.97
        48.70874 |          1       0.00       0.00      99.97
        48.81944 |          2       0.01       0.01      99.98
        49.40528 |          1       0.00       0.00      99.98
        49.60056 |          1       0.00       0.00      99.99
        50.38167 |          1       0.00       0.00      99.99
        50.48837 |          2       0.01       0.01     100.00
        Total    |      24291     100.00     100.00           
 --------------------------------------------------------------

 . 
 . 
 . * ================
 . * = DISTRIBUTION =
 . * ================
 . 
 . 
 . * Obtain summary statistics:
 . su bmi

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |     24291       27.27    5.134197   15.20329   50.48837

 . tabstat bmi, s(n mean sd min max)

    variable |         N      mean        sd       min       max
 -------------+--------------------------------------------------
         bmi |     24291     27.27  5.134197  15.20329  50.48837
 ----------------------------------------------------------------

 . 
 . su bmi, d

                       Body Mass Index
 -------------------------------------------------------------
      Percentiles      Smallest
 1%     18.30729       15.20329
 5%     20.11707       15.20329
 10%     21.26276       15.20329       Obs               24291
 25%     23.51343        15.5041       Sum of Wgt.       24291

 50%     26.57845                      Mean              27.27
                        Largest       Std. Dev.      5.134197
 75%     30.22843       49.60056
 90%     34.32617       50.38167       Variance       26.35998
 95%     36.91451       50.48837       Skewness       .7207431
 99%     41.59763       50.48837       Kurtosis       3.463278

 . tabstat bmi, s(p25 median p75 iqr)

    variable |       p25       p50       p75       iqr
 -------------+----------------------------------------
         bmi |  23.51343  26.57845  30.22843  6.715006
 ------------------------------------------------------

 . 
 . * Visualize the distribution:
 . hist bmi, percent bin(10)
 (bin=10, start=15.203287, width=3.5285078)

 . hist bmi, kdensity
 (bin=43, start=15.203287, width=.82058321)

 . 
 . * Histogram with normal distribution superimposed.
 . hist bmi, percent normal ///
 >         name(hist, replace)
 (bin=43, start=15.203287, width=.82058321)

 . 
 . * Kernel density.
 . kdensity bmi, normal legend(row(1)) title("") note("") ///
 >         name(kdens, replace)

 . 
 . * Box plots.
 . gr hbox bmi, over(raceb) ///
 >         name(bmi_race, replace)

 . 
 . gr hbox bmi, over(sex) asyvars over(raceb) ///
 >         name(bmi_race_sex, replace)

 . 
 . * The next commands use scalars to describe a distribution through its standard 
 . * deviation and outliers. This is a teaching example, not a course requirement.
 . 
 . * Obtain summary statistics.
 . su bmi, d

                       Body Mass Index
 -------------------------------------------------------------
      Percentiles      Smallest
 1%     18.30729       15.20329
 5%     20.11707       15.20329
 10%     21.26276       15.20329       Obs               24291
 25%     23.51343        15.5041       Sum of Wgt.       24291

 50%     26.57845                      Mean              27.27
                        Largest       Std. Dev.      5.134197
 75%     30.22843       49.60056
 90%     34.32617       50.38167       Variance       26.35998
 95%     36.91451       50.48837       Skewness       .7207431
 99%     41.59763       50.48837       Kurtosis       3.463278

 . 
 . * To show the results of a command, Stata saves them first to a temporary space
 . * in its memory, r(). The results of the last command are readable from there:
 . ret li

 scalars:
                  r(N) =  24291
              r(sum_w) =  24291
               r(mean) =  27.26999735254105
                r(Var) =  26.35997953797713
                 r(sd) =  5.134197068478881
           r(skewness) =  .7207431027835997
           r(kurtosis) =  3.46327812408999
                r(sum) =  662415.5056905746
                r(min) =  15.20328712463379
                r(max) =  50.48836517333984
                 r(p1) =  18.30729103088379
                 r(p5) =  20.1170654296875
                r(p10) =  21.26276016235352
                r(p25) =  23.513427734375
                r(p50) =  26.57844924926758
                r(p75) =  30.22843360900879
                r(p90) =  34.326171875
                r(p95) =  36.91451263427734
                r(p99) =  41.59763336181641

 . 
 . * Let's save some of these statistics to scalars, in order to access them later.
 . * Scalars and macros are programming commands that you will not need to learn to
 . * operate Stata at regular user-level. However, they happen to be useful to code
 . * some teaching examples and demonstrations, as shown below.
 . 
 . * Save the mean and standard deviation of the summarized variable.
 . sca de mean = r(mean)

 . sca de sd   = r(sd)

 . 
 . * Save the 25th and 75th percentiles and compute the interquartile range (IQR),
 . * which is the range from the first quartile (Q1) to the third quartile (Q3).
 . sca de q1  = r(p25)

 . sca de q3  = r(p75)

 . sca de iqr = q3 - q1

 . 
 . * List all saved scalars, which are used in the next sections in combination to
 . * the -di- command for quick verifications about the distribution of
 . * the dependent variable (BMI) in the sample.
 . sca li
       iqr =  6.7150059
        q3 =  30.228434
        q1 =  23.513428
        sd =  5.1341971
      mean =  27.269997

 . 
 . 
 . * Standard deviation
 . * ------------------
 . 
 . * We can verify what we learnt about the standard deviation by counting the
 . * number of BMI observations that fall between (mean - 1sd) and (mean + 1sd),
 . * and then by checking if this number comes close to 68% of all observations.
 . count if bmi > mean - sd & bmi < mean + sd
 16847

 . di r(N), "observations out of", _N, "(" 100 * round(r(N) / _N, .01) ///
 >         "% of the sample) are within one standard deviation from the mean."
 16847 observations out of 24291 (69% of the sample) are within one standard deviatio
 > n from the mean.

 . 
 . * The corresponding result is indeed close to 68% of all observations, and the
 . * same verification with the [mean - 2sd, mean + 2sd] range of BMI values is
 . * also satisfactorily close to including 95% of all observations.
 . count if bmi > mean - 2 * sd & bmi < mean + 2 * sd
 23219

 . di r(N), "observations out of", _N, "(" 100 * round(r(N) / _N, .01) ///
 >         "% of the sample) are within 2 standard deviations from the mean."
 23219 observations out of 24291 (96% of the sample) are within 2 standard deviations
 >  from the mean.

 . 
 . * The properties shown here hold for continuous variables that approach a
 . * normal distribution, as discussed below. We could go further and compute
 . * the [mean - 3sd, mean + 3sd] range, but the most extreme values of a
 . * distribution are more conveniently captured by the notion of outliers,
 . * i.e. observations that fall far from the median.
 . 
 . 
 . * Outliers
 . * --------
 . 
 . * Summarize mild (1.5 IQR) or extreme (3 IQR) outliers below Q1 and above Q3:
 . su bmi if bmi < q1 - 1.5 * iqr | bmi > q3 + 1.5 * iqr

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |       421    42.63485    2.153885   40.31065   50.48837

 . su bmi if bmi < q1 - 3 * iqr   | bmi > q3 + 3 * iqr

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |         3     50.4528    .0616016   50.38167   50.48837

 . 
 . 
 . * =============
 . * = NORMALITY =
 . * =============
 . 
 . 
 . * Continuous variables are expected to approach a normal distribution, a result
 . * more easily obtained at higher sample sizes. Let's check if the distribution
 . * of BMI values approaches normality, and if not, let's transform the variable
 . * to bring it closer to normality. We start with visual inspection and complete
 . * the assessment with two statistical measures.
 . 
 . 
 . * Visual assessment
 . * -----------------
 . 
 . * We draw a histogram with three different elements: the actual bins (bars)
 . * of the BMI variable, its kernel density, and an overimposed normal curve
 . * that we draw in a different colour using a few graph options.
 . hist bmi, bin(15) normal kdensity kdenopts(lp(dash) lc(black) bw(1.5)) ///
 >         note("Normal distribution (solid red) and kernel density (dashed black).")
 >  ///
 >         name(bmi, replace)
 (bin=15, start=15.203287, width=2.3523385)

 . 
 . * The histogram shows what we knew from reading the mean and median of the
 . * BMI values: the distribution is skewed to the left, implying that there are
 . * more observations below the mean of the distribution than above it.
 . 
 . * As a result, the distribution is asymmetrical, which we can verify using a
 . * particular graphical technique that emphasizes deviations from symmetry.
 . * Perfect symmetry corresponds to the straight red line.
 . symplot bmi, ti("Symmetry plot") ///
 >         name(bmi_sym, replace)

 . 
 . * Another visualization plots the quantiles of the variable against those of the
 . * normal distribution. Perfect correspondence between the two distributions is
 . * observed at the straight red line.
 . qnorm bmi, ti("Normal quantile plot") ///
 >         name(bmi_qnorm, replace)

 . 
 . * The departures observed here are situated at the tails of the distribution,
 . * which means that there is an excess of observations at these values.
 . 
 . 
 . * Formal assessment
 . * -----------------
 . 
 . * Moving to statistical measures of normality, we can measure skewness, which
 . * measures symmetry and approaches 0 in quasi-normal distributions, along with
 . * kurtosis, which measures the size of the distribution tails and approaches 3
 . * in quasi-normal distributions. Use the -summarize- command with the -detail-
 . * option, respectively abbreviated as -su- and -d-.
 . su bmi, d

                       Body Mass Index
 -------------------------------------------------------------
      Percentiles      Smallest
 1%     18.30729       15.20329
 5%     20.11707       15.20329
 10%     21.26276       15.20329       Obs               24291
 25%     23.51343        15.5041       Sum of Wgt.       24291

 50%     26.57845                      Mean              27.27
                        Largest       Std. Dev.      5.134197
 75%     30.22843       49.60056
 90%     34.32617       50.38167       Variance       26.35998
 95%     36.91451       50.48837       Skewness       .7207431
 99%     41.59763       50.48837       Kurtosis       3.463278

 . 
 . * There are more advanced tests to measure normality, but the tests above are
 . * sufficient to observe that we cannot assume the BMI variable to be normally
 . * distributed (i.e. we reject our distributional assumption).
 . 
 . 
 . * Variable transformation
 . * -----------------------
 . 
 . * A technique used to approach normality with a continuous variable consists
 . * in 'transforming' the variable with a mathematical operator that modifies
 . * its basic unit of measurement. We learnt that the distribution of BMI for
 . * its standard unit measurement is not normal, but perhaps the distribution
 . * of the same values is closer to normality if we take a different measure.
 . 
 . * The -gladder- command visualizes several common transformations all at once.
 . gladder bmi, ///
 >         name(gladder, replace)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)

 . 
 . * The logarithm transformation appears to approximate a normal distribution.
 . * We transform the variable accordingly.
 . gen logbmi = ln(bmi)

 . la var logbmi "Body Mass Index (log units)"

 . 
 . * Looking at skewness and kurtosis for the logged variable.
 . tabstat bmi logbmi, s(n sk kurtosis min max) c(s)

    variable |         N  skewness  kurtosis       min       max
 -------------+--------------------------------------------------
         bmi |     24291  .7207431  3.463278  15.20329  50.48837
      logbmi |     24291  .2346392  2.762445  2.721512  3.921743
 ----------------------------------------------------------------

 . 
 . * The log-BMI histogram shows some improvement towards normality.
 . hist logbmi, normal ///
 >     name(logbmi, replace)
 (bin=43, start=2.7215116, width=.02791236)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * Comparison plot
 . * ---------------
 . 
 . * Running the same graphs with a few options to combine them allows a quick
 . * visual comparison of the transformation.
 . 
 . * Part 1/4.
 . hist bmi, norm xti("") ysc(off) ti("Untransformed (metric)") bin(21) ///
 >         name(bmi1, replace)
 (bin=21, start=15.203287, width=1.6802418)
 (note: scheme burd not found, using s2color)

 . 
 . * Part 2/4.
 . gr hbox bmi, fysize(25) ///
 >         name(bmi2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Part 3/4.
 . hist logbmi, norm xti("") ysc(off) ti("Transformed (logged)") bin(21) ///
 >         name(bmi3, replace)
 (bin=21, start=2.7215116, width=.05715387)
 (note: scheme burd not found, using s2color)

 . 
 . * Part 4/4.
 . gr hbox logbmi, fysize(25) ///
 >         name(bmi4, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Final combined graph.
 . gr combine bmi1 bmi3 bmi2 bmi4, imargin(small) ysize(3) col(2) ///
 >         name(bmi_comparison, replace)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)

 . 
 . * Drop individual pieces.
 . gr drop bmi1 bmi2 bmi3 bmi4

 . gr di bmi_comparison

 . 
 . 
 . * ==================
 . * = SAMPLING ERROR =
 . * ==================
 . 
 . 
 . * Sort the data by order of survey collection.
 . sort serial

 . 
 . * Now here's a simple issue: if we subsample our data, the average BMI will not
 . * necessarily reflect the sample mean.
 . su bmi in 1/10

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |        10    29.10093     3.84851   21.78719   35.42366

 . 
 . * The problem applies to our entire sample: how can we confirm that it reflects
 . * the true population mean? We cannot, but we can enforce a precaution measure,
 . * following the assumption that the data follow a somewhat normal distribution.
 . 
 . 
 . * Confidence intervals with means
 . * -------------------------------
 . 
 . * The confidence interval reflects the standard error of the mean (SEM), itself
 . * a reflection of sample size. We will come back to the SEM equation next week.
 . 
 . * Mean BMI for the full sample with a 95% CI.
 . ci bmi

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      24291       27.27     .032942        27.20543    27.33457

 . 
 . * Mean BMI for the full sample with a 99% CI (more confidence, less precision).
 . ci bmi, level(99)

    Variable |        Obs        Mean    Std. Err.       [99% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      24291       27.27     .032942        27.18514    27.35486

 . 
 . * Mean BMI for the full sample with survey weights (better representativeness).
 . svy: mean bmi
 (running mean on estimation sample)

 Survey: Mean estimation

 Number of strata =     300        Number of obs    =     24291
 Number of PSUs   =     600        Population size  =  88553487
                                  Design df        =       300

 --------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
 -------------+------------------------------------------------
         bmi |   27.17653   .0430405      27.09183    27.26123
 --------------------------------------------------------------

 . 
 . * The confidence intervals for the full sample show a high precision, both at
 . * the 95% (alpha = 0.05) and 99% (alpha = 0.01) levels. This is due to the high
 . * number of observations provided for the BMI variable.
 . 
 . * If we compute the average BMI for subsamples of the population, such as one
 . * category of the population, the total number of observations will drop and
 . * the confidence interval will widen, as shown here with smaller subsamples:
 . ci bmi in 1/10

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |         10    29.10093    1.217006        26.34787    31.85399

 . ci bmi in 1/100

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        100    28.42492    .5618381        27.31011    29.53973

 . ci bmi in 1/1000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1000    27.28449    .1658582        26.95902    27.60996

 . ci bmi in 1/10000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      10000    27.24431    .0513402        27.14367    27.34495

 . 
 . * Confidence bands can become useful to detect spurious relationships. Let's
 . * take a look, for instance, at the number of years spent in the U.S.
 . fre yrsinus

 yrsinus -- Number of years spent in the U.S.
 -----------------------------------------------------------------------------------
                                      |      Freq.    Percent      Valid       Cum.
 --------------------------------------+--------------------------------------------
 Valid   0 NIU                         |      19456      80.10      80.10      80.10
        1 Less than 1 year            |         63       0.26       0.26      80.35
        2 1 year to less than 5 years |        480       1.98       1.98      82.33
        3 5 years to less than 10     |        723       2.98       2.98      85.31
          years                       |                                            
        4 10 years to less than 15    |        657       2.70       2.70      88.01
          years                       |                                            
        5 15 years or more            |       2912      11.99      11.99     100.00
        Total                         |      24291     100.00     100.00           
 -----------------------------------------------------------------------------------

 . replace yrsinus = . if yrsinus == 0
 (19456 real changes made, 19456 to missing)

 . 
 . * We know from previous analysis that BMI varies by gender and ethnicity.
 . * We now look for the effect of the number of years spent in the U.S. within
 . * each gender and ethnic categories.
 . gr dot bmi, over(sex) over(yrsinus) over(raceb) asyvars scale(.7) ///
 >         ti("Body Mass Index by age, sex, race and number of years in the U.S.") //
 > /
 >         yti("Mean BMI") ///
 >         name(bmi_sex_yrs, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The average BMI of Blacks who spent less than one year in the U.S. shows
 . * an outstanding difference for males and sexs, but this category holds
 . * so little observations that the difference should not be considered.
 . bys sex: ci bmi if raceb == 2 & yrsinus == 1

 ------------------------------------------------------------------------------------
 -> sex = Male

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |          2    21.01948    2.712192       -13.44218    55.48114

 ------------------------------------------------------------------------------------
 -> sex = Female

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |          1    30.34068           .               .           .

 . 
 . * Identically, the seemingly clean pattern among male and sex Asians is
 . * calculated on a low number of observations and requires verification of
 . * the confidence intervals. The pattern appears to be rather robust.
 . bys yrsinus: ci bmi if raceb == 4

 ------------------------------------------------------------------------------------
 -> yrsinus = Less than 1 year

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |         18    22.80192    .5404272        21.66172    23.94212

 ------------------------------------------------------------------------------------
 -> yrsinus = 1 year to less than 5 years

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        161    23.10644    .2474861        22.61768     23.5952

 ------------------------------------------------------------------------------------
 -> yrsinus = 5 years to less than 10 years

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        142    23.40691     .308459        22.79711    24.01672

 ------------------------------------------------------------------------------------
 -> yrsinus = 10 years to less than 15 years

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        123    23.54123    .3113926         22.9248    24.15767

 ------------------------------------------------------------------------------------
 -> yrsinus = 15 years or more

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        591    24.60317    .1529371        24.30281    24.90354

 ------------------------------------------------------------------------------------
 -> yrsinus = .

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        336    25.17433    .2357523        24.71058    25.63807

 . 
 . 
 . * Confidence intervals with proportions
 . * -------------------------------------
 . 
 . * A few things about confidence intervals with proportions, for which confidence
 . * bands follow a different method of calculation. Basically, categorical data is
 . * just dummies for a bunch of categories, and the distribution of binary data
 . * can hardly be normal. The binomial distributions applies instead.
 . ci sex, binomial

                                                         -- Binomial Exact --
    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------

 . 
 . * Categorical variables, which can be described through proportions, also
 . * come with confidence intervals that reflect the range of values that each
 . * category might take in the true population. The proportions of ethnic groups
 . * in the U.S., for instance, are somehwere in these intervals:
 . prop raceb

 Proportion estimation               Number of obs    =   24291

 --------------------------------------------------------------
             | Proportion   Std. Err.     [95% Conf. Interval]
 -------------+------------------------------------------------
 raceb        |
       White |   .5874192   .0031587      .5812279    .5936105
       Black |   .1602651   .0023538      .1556514    .1648788
    Hispanic |    .195875   .0025465      .1908838    .2008662
       Asian |   .0564407   .0014807      .0535384    .0593429
 --------------------------------------------------------------

 . 
 . * Actually, if you want to be completely correct, you need to weight the data
 . * with the svy: prefix to use the weight settings specified earlier. This will
 . * have a tremendous effect on your data in this case, shifting the proportion
 . * of White respondents from roughly 60% to roughly 70% of all U.S. adults, the
 . * reason being that other racial-ethnic groups are oversampled in NHIS data.
 . svy: prop raceb
 (running proportion on estimation sample)

 Survey: Proportion estimation

 Number of strata =     300        Number of obs    =     24291
 Number of PSUs   =     600        Population size  =  88553487
                                  Design df        =       300

 --------------------------------------------------------------
             |             Linearized
             | Proportion   Std. Err.     [95% Conf. Interval]
 -------------+------------------------------------------------
 raceb        |
       White |   .7101575   .0049232      .7004691     .719846
       Black |   .1264287   .0035906      .1193627    .1334947
    Hispanic |   .1253064   .0030643      .1192762    .1313367
       Asian |   .0381073   .0015406      .0350757     .041139
 --------------------------------------------------------------

 . 
 . * Identically to continuous variables, confidence intervals for categorical
 . * data will increase when the total number of observations decreases. The
 . * 95% CI for ethnicity on morbidly obese respondents illustrates that issue.
 . prop raceb if bmi > 40

 Proportion estimation               Number of obs    =     464

 --------------------------------------------------------------
             | Proportion   Std. Err.     [95% Conf. Interval]
 -------------+------------------------------------------------
 raceb        |
       White |   .5344828   .0231816      .4889285     .580037
       Black |   .2586207   .0203499      .2186312    .2986102
    Hispanic |   .2047414   .0187528      .1678902    .2415926
       Asian |   .0021552   .0021552       -.00208    .0063903
 --------------------------------------------------------------

 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require fre scheme-burd spineplot

 . 
 . * Log results.
 . cap log using code/week5.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 5 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Social Determinants of Adult Obesity in the United States
 > 
 >  - DATA:   U.S. National Health Interview Survey (2009)
 > 
 >    We study variations in the Body Mass Index (BMI) of insured and uninsured
 >    American adults, in order to show how differences observed between racial
 >    backgrounds echo socioeconomic inequalities in education and health care.
 >  
 >  - (H1): We first expect to observe larger numbers of overweight and obese
 >    respondents among non-White males and among older age groups.
 >    
 >  - (H2): We then expect education to be negatively associated with obesity, as
 >    higher attainment indicates access to prevention and higher income.
 > 
 >  - (H3): We finally expect health insurance coverage to limit health consumption
 >    in poorer households, possibly affecting BMI across the life course.
 >    
 >    Our data come from the most recent year of the National Health Interview
 >    Survey (NHIS). The sample used in the analysis contains = 21,770 individuals
 >    selected through state-level stratified probability sampling.
 > 
 >  - The lines above are a quick example of what you should be planning to write
 >    up in your first draft: a description of your data, followed by a list of
 >    clearly worded and substantively informed hypotheses.
 > 
 >  - Please make sure that your do-file is named like 'Briatte_Petev_1.do' (use
 >    your own family names, in alphabetical order). Name your paper the same way
 >    and print it to PDF format: do not circulate your work in editable formats.
 >  
 >  - To simplify your workflow, the course uses a paper template that you will
 >    share with your research partner(s) using Google Documents. This template  
 >    contains more instructions on the first draft.
 >    
 >  - Your first draft must inform the reader about simple things: What is your
 >    research question? Where does your data come from, how large is the sample
 >    and how was it designed? Include references to the data source and codebook.
 >    
 >  - Your paper also explains what choice of variables you have made, and with
 >    what theory to support that choice. You have to substantiate your decisions:
 >    providing a mere description of the measurements is insufficient.
 >     
 >  - In line with that idea, do NOT write your paper as a technical summary of
 >    what your code accomplishes: refer to variables not by names but by what they
 >    actually measure, and explain how they fit in your general reasoning.
 > 
 >  - Remember that you have been provided with example papers: use them to learn
 >    about the writing style and scientific tone to adopt in your own work. This
 >    requirement is covered at more length in the rest of the course material.
 > 
 >  - Your first do-file can imitate the course do-files in its structure. Your
 >    code should assess DV normality and explore differences in the DV with graphs
 >    and confidence intervals over categorical IVs. Analyze results in your paper.
 >  
 >  - Importantly, do NOT produce results in your do-file if you are not going to
 >    interpret them at a later stage: produce meaningful code that leads you to
 >    learn, understand and analyze the data.
 >   
 >  - Use the -stab- command at the end of this do-file to export summary stats
 >    to a simple table. The result will be a plain text file that you can copy
 >    and paste into Google Documents, or import into any other text editor.
 > 
 >    Last updated 2012-11-13.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load NHIS data.
 . use data/nhis2009, clear
 (U.S. National Health Interview Survey 2009)

 . 
 . * Individual survey weights.
 . svyset psu [pw = perweight], strata(strata)

      pweight: perweight
          VCE: linearized
  Single unit: missing
     Strata 1: strata
         SU 1: psu
        FPC 1: <zero>

 . 
 . 
 . * Dependent variable: Body Mass Index
 . * -----------------------------------
 . 
 . gen bmi = weight * 703 / height^2

 . la var bmi "Body Mass Index"

 . 
 . * Detailed summary statistics.
 . su bmi, d

                       Body Mass Index
 -------------------------------------------------------------
      Percentiles      Smallest
 1%     18.30296       14.63388
 5%     19.96686       14.92082
 10%     21.03148       15.05125       Obs              251589
 25%     23.22465       15.06112       Sum of Wgt.      251589

 50%     26.07836                      Mean            26.8551
                        Largest       Std. Dev.      5.001464
 75%     29.75496       51.49813
 90%     33.71531       51.70008       Variance       25.01464
 95%     36.32167       51.90204       Skewness       .7805844
 99%     41.19141       52.10399       Kurtosis       3.619894

 . 
 . 
 . * Breakdowns
 . * ----------
 . 
 . * Recoding BMI to 6 groups (best method: cutting the data to intervals).
 . gen bmi6:bmi6 = irecode(bmi, 0, 18.5, 25, 30, 35, 40, .)

 . la var bmi6 "Body Mass Index (categories)"

 . 
 . * Define the category labels.
 . la def bmi6 ///
 >         1 "Underweight" 2 "Normal" 3 "Overweight" ///
 >         4 "Obese" 5 "Severely obese" 6 "Morbidly obese", replace

 . 
 . * Breakdown of mean BMI by groups.
 . d bmi bmi6

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 bmi             float  %9.0g                  Body Mass Index
 bmi6            float  %14.0g      bmi6       Body Mass Index (categories)

 . tab bmi6, su(bmi)

  Body Mass |
      Index |
 (categories |     Summary of Body Mass Index
          ) |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
  Underweig |   17.758399   .62717744        3100
     Normal |   22.444485   1.6585607       97083
  Overweigh |    27.24205    1.436102       92316
      Obese |   32.126483   1.4160294       41238
  Severely  |   37.045664   1.3697775       13874
  Morbidly  |   42.417655   2.0628916        3978
 ------------+------------------------------------
      Total |   26.855097   5.0014639      251589

 . 
 . * Progression of BMI groups over years.
 . spineplot bmi6 year, scheme(burd6) ///
 >     name(bmi6, replace)

 . 
 . * Breakdown of BMI to percentiles.
 . xtile bmi_qt = bmi, nq(100)

 . 
 . * Verify the BMI of, e.g. the top 10% most obese.
 . su bmi if bmi_qt == 90

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         bmi |      2510    33.52061    .1231341    33.2846   33.71531

 . 
 . * Compute the mean BMI for each percentile.
 . bys bmi_qt: egen bmi_qm = mean(bmi)

 . 
 . * Plot the empirical cumulative distribution function (ECDF) of BMI.
 . sc bmi_qm bmi_qt, m(o) c(l) xla(0(10)100) ///
 >         yti("Body Mass Index") xti("Percentiles") ///
 >         name(bmi_ecdf, replace)

 . 
 . 
 . * Independent variables
 . * ---------------------
 . 
 . fre age sex raceb educrec1 earnings health uninsured ybarcare, r(10)

 age -- Age
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   18 18 |       2993       1.19       1.19       1.19
        19 19 |       3487       1.39       1.39       2.58
        20 20 |       3749       1.49       1.49       4.07
        21 21 |       3953       1.57       1.57       5.64
        22 22 |       4102       1.63       1.63       7.27
        :     |          :          :          :          :
        80 80 |       1845       0.73       0.73      97.62
        81 81 |       1717       0.68       0.68      98.31
        82 82 |       1571       0.62       0.62      98.93
        83 83 |       1390       0.55       0.55      99.48
        84 84 |       1298       0.52       0.52     100.00
        Total |     251589     100.00     100.00           
 -----------------------------------------------------------

 sex -- Sex
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 Male   |     113182      44.99      44.99      44.99
        2 Female |     138407      55.01      55.01     100.00
        Total    |     251589     100.00     100.00           
 --------------------------------------------------------------

 raceb -- Race
 ----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
 -------------------+--------------------------------------------
 Valid   1 White    |     160581      63.83      63.83      63.83
        2 Black    |      36030      14.32      14.32      78.15
        3 Hispanic |      45842      18.22      18.22      96.37
        4 Asian    |       9136       3.63       3.63     100.00
        Total      |     251589     100.00     100.00           
 ----------------------------------------------------------------

 educrec1 -- Educational attainment recode, nonintervalled
 ----------------------------------------------------------------------------------
                                     |      Freq.    Percent      Valid       Cum.
 -------------------------------------+--------------------------------------------
 Valid   13 Grade 12                  |     117721      46.79      46.79      46.79
        14 1 to 3 years of college   |      72298      28.74      28.74      75.53
        15 4 years of                |      40548      16.12      16.12      91.64
           college/Bachelor's degree |                                            
        16 5+ years of college       |      21022       8.36       8.36     100.00
        Total                        |     251589     100.00     100.00           
 ----------------------------------------------------------------------------------

 earnings -- Person's total earnings, previous calendar year
 --------------------------------------------------------------------------------
                                   |      Freq.    Percent      Valid       Cum.
 -----------------------------------+--------------------------------------------
 Valid   0  NIU                     |      77313      30.73      30.73      30.73
        1  $01 to $4999            |      11154       4.43       4.43      35.16
        2  $5000 to $9999          |      10468       4.16       4.16      39.32
        3  $10000 to $14999        |      13077       5.20       5.20      44.52
        4  $15000 to $19999        |      12189       4.84       4.84      49.37
        :                          |          :          :          :          :
        10 $65000 to $74999        |       4992       1.98       1.98      80.75
        11 $75000 and over         |      12179       4.84       4.84      85.59
        97 Unknown-refused         |      21877       8.70       8.70      94.29
        98 Unknown-not ascertained |         24       0.01       0.01      94.29
        99 Unknown-don't know      |      14354       5.71       5.71     100.00
        Total                      |     251589     100.00     100.00           
 --------------------------------------------------------------------------------

 health -- Health status
 -----------------------------------------------------------------
                    |      Freq.    Percent      Valid       Cum.
 --------------------+--------------------------------------------
 Valid   1 Excellent |      73004      29.02      29.04      29.04
        2 Very Good |      80816      32.12      32.14      61.18
        3 Good      |      65089      25.87      25.89      87.07
        4 Fair      |      24564       9.76       9.77      96.84
        5 Poor      |       7951       3.16       3.16     100.00
        Total       |     251424      99.93     100.00           
 Missing .           |        165       0.07                      
 Total               |     251589     100.00                      
 -----------------------------------------------------------------

 uninsured -- Health Insurance coverage status
 --------------------------------------------------------------------------
                             |      Freq.    Percent      Valid       Cum.
 -----------------------------+--------------------------------------------
 Valid   1 Not covered        |      43206      17.17      17.17      17.17
        2 Covered            |     207537      82.49      82.49      99.66
        9 Unknown-don't know |        846       0.34       0.34     100.00
        Total                |     251589     100.00     100.00           
 --------------------------------------------------------------------------

 ybarcare -- Needed but couldn't afford medical care, past 12 months
 --------------------------------------------------------------------------
                             |      Freq.    Percent      Valid       Cum.
 -----------------------------+--------------------------------------------
 Valid   1 No                 |     231191      91.89      91.89      91.89
        2 Yes                |      20246       8.05       8.05      99.94
        7 Unknown-refused    |         52       0.02       0.02      99.96
        9 Unknown-don't know |        100       0.04       0.04     100.00
        Total                |     251589     100.00     100.00           
 --------------------------------------------------------------------------

 . 
 . * Recode age to four groups (slow and risky method: using manual categories).
 . recode age ///
 >         (18/44 = 1 "18-44") ///
 >         (45/64 = 2 "45-64") ///
 >         (65/74 = 3 "65-74") ///
 >         (75/max = 4 "75+") (else = .), gen(age4)
 (251589 differences between age and age4)

 . la var age4 "Age groups (4)"

 . 
 . * Recode age to eight groups (nifty method: using decades, 10-19, 20-29, etc.).
 . gen age8 = 10 * floor(age / 10) if !mi(age)

 . la var age8 "Age groups (8)"

 . 
 . * Recode sex to dummy.
 . gen female:female = (sex == 2) if !mi(sex)

 . la def female 0 "Male" 1 "Female", replace

 . 
 . * Recode missing values of income.
 . replace earnings = . if inlist(earnings, 97, 99)
 (36231 real changes made, 36231 to missing)

 . 
 . * Recode missing values of insurance and medical care.
 . mvdecode ybarcare uninsured, mv(9)
    ybarcare: 100 missing values generated
   uninsured: 846 missing values generated

 . 
 . 
 . * Subsetting
 . * ----------
 . 
 . * Select observations from most recent year.
 . keep if year == 2009
 (227298 observations deleted)

 . 
 . * Patterns of missing values.
 . misstable pat bmi age female raceb educrec1 earnings health uninsured ybarcare

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Percent   |  1  2  3  4
  ------------+-------------
       90%    |  1  1  1  1
              |
       10     |  1  1  1  0
       <1     |  1  1  0  1
       <1     |  1  1  0  0
       <1     |  0  1  1  1
       <1     |  1  0  1  0
       <1     |  1  0  1  1
       <1     |  1  0  0  1
  ------------+-------------
      100%    |

  Variables are  (1) ybarcare  (2) health  (3) uninsured  (4) earnings

 . 
 . * Delete incomplete observations.
 . drop if mi(bmi, age, female, raceb, educrec1, earnings, uninsured, ybarcare)
 (2521 observations deleted)

 . 
 . * Final data, showing final sample size.
 . codebook bmi age female raceb educrec1 earnings health uninsured ybarcare, c

 Variable     Obs Unique      Mean       Min       Max  Label
 ------------------------------------------------------------------------------------
 bmi        21770   2091  27.32691  15.20329  50.48837  Body Mass Index
 age        21770     67  47.13036        18        84  Age
 female     21770      2  .5550299         0         1  
 raceb      21770      4  1.708406         1         4  Race
 educrec1   21770      4  13.91746        13        16  Educational attainment rec...
 earnings   21770     12  3.984153         0        11  Person's total earnings, p...
 health     21767      5  2.312675         1         5  Health status
 uninsured  21770      2  1.817685         1         2  Health Insurance coverage ...
 ybarcare   21770      2  1.103904         1         2  Needed but couldn't afford...
 ------------------------------------------------------------------------------------

 . 
 . 
 . * Normality
 . * ---------
 . 
 . hist bmi, bin(20) normal normopts(lp(dash)) ///
 >     kdensity kdenopts(k(biweight) bw(3) lc(black)) ///
 >     name(dv, replace)
 (bin=20, start=15.203287, width=1.7642539)

 . 
 . * Transformations (add 'g' to make the command -gladder- for a graphical check).
 . ladder bmi

 Transformation         formula               chi2(2)       P(chi2)
 ------------------------------------------------------------------
 cubic                  bmi^3                      .            .
 square                 bmi^2                      .            .
 identity               bmi                        .            .
 square root            sqrt(bmi)                  .        0.000
 log                    log(bmi)                   .        0.000
 1/(square root)        1/sqrt(bmi)                .        0.000
 inverse                1/bmi                      .        0.000
 1/square               1/(bmi^2)                  .            .
 1/cubic                1/(bmi^3)                  .            .

 . 
 . * Log-BMI transformation.
 . gen logbmi = ln(bmi)

 . la var logbmi "log(BMI)"

 . 
 . * Inspect improvement in normality.
 . tabstat bmi logbmi, s(skewness kurtosis) c(s)

    variable |  skewness  kurtosis
 -------------+--------------------
         bmi |  .7112319  3.426024
      logbmi |  .2275015  2.748605
 ----------------------------------

 . 
 . 
 . * ========================
 . * = CONFIDENCE INTERVALS =
 . * ========================
 . 
 . 
 . * IV: Age
 . * -------
 . 
 . * Plot BMI groups for each age decade.
 . spineplot bmi6 age8, scheme(burd6) ///
 >      name(age, replace)

 . 
 . * 95% CI estimates:
 . tab age4, su(bmi) // mean BMI in each age group

 Age groups |     Summary of Body Mass Index
        (4) |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
      18-44 |   26.814459   5.1771495       10209
      45-64 |   28.023031   5.1792507        7428
      65-74 |   27.827794   5.0723462        2464
        75+ |   26.623902   4.6287582        1669
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys age4: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> age4 = 18-44

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      10209    26.81446    .0512388        26.71402     26.9149

 ------------------------------------------------------------------------------------
 -> age4 = 45-64

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       7428    28.02303     .060094        27.90523    28.14083

 ------------------------------------------------------------------------------------
 -> age4 = 65-74

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       2464    27.82779    .1021853        27.62742    28.02817

 ------------------------------------------------------------------------------------
 -> age4 = 75+

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1669     26.6239    .1133017        26.40167    26.84613

 . 
 . 
 . * IV: Gender
 . * ----------
 . 
 . * Plot mean BMI groups for each gender group, for each age decade.
 . gr bar bmi, over(female) asyvars over(age8) yline(27) ///
 >     note("Horizontal line at sample mean.") ///
 >     name(sex_age, replace)

 . 
 . * 95% CI estimates:
 . tab female, su(bmi) // mean BMI in each gender group

            |     Summary of Body Mass Index
     female |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
       Male |   27.616417   4.4580387        9687
     Female |   27.094814    5.650074       12083
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys female: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> female = Male

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       9687    27.61642    .0452949        27.52763     27.7052

 ------------------------------------------------------------------------------------
 -> female = Female

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      12083    27.09481    .0514004        26.99406    27.19557

 . 
 . 
 . * IV: Race
 . * --------
 . 
 . * Plot BMI groups for each racial background:
 . spineplot bmi6 raceb, scheme(burd6) ///
 >     name(race, replace)

 . 
 . * Histogram by race and gender groups.
 . hist bmi, bin(10) xline(27) ///
 >         by(raceb female, cols(2) ///
 >         note("Vertical line at sample mean.") legend(off)) ///
 >         name(race_sex, replace)

 . 
 . * 95% CI estimates:
 . tab raceb, su(bmi) // mean BMI at each health level

            |     Summary of Body Mass Index
       Race |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
      White |   27.045805   5.0759617       12885
      Black |   28.590384   5.5785555        3509
   Hispanic |   27.952797   4.9386991        4215
      Asian |   24.355698   3.8536771        1161
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys raceb: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> raceb = White

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      12885    27.04581    .0447174        26.95815    27.13346

 ------------------------------------------------------------------------------------
 -> raceb = Black

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       3509    28.59038    .0941738        28.40574    28.77503

 ------------------------------------------------------------------------------------
 -> raceb = Hispanic

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       4215     27.9528    .0760701        27.80366    28.10193

 ------------------------------------------------------------------------------------
 -> raceb = Asian

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1161     24.3557    .1130991         24.1338     24.5776

 . 
 . 
 . * IV: Education
 . * -------------
 . 
 . * Shorter labels for a cleaner graph.
 . la def edu 13 "Grade 12" 14 "Coll 1-3 yrs" 15 "Coll 4" 16 "Coll 5+"

 . la val educrec1 edu

 . 
 . * (Reminder on labels: the first command, -la def-, creates new labels for the
 . * values of a variable; the second command, -la val-, assigns the value label
 . * to the target variable, which is educrec1 in this example.)
 . 
 . * Plot BMI groups for each educational level.
 . spineplot bmi6 educrec1, scheme(burd6) ///
 >     name(edu, replace)

 . 
 . * Plot racial backgrounds for each educational level.
 . spineplot raceb educrec1, ///
 >         name(edu_race, replace)

 . 
 . * 95% CI estimates:
 . tab educrec1, su(bmi) // mean BMI at each education level

 Educational |
 attainment |
    recode, |
 noninterval |     Summary of Body Mass Index
        led |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
   Grade 12 |   27.852907   5.2562117        9491
  Coll 1-3  |   27.445587   5.2305751        6550
     Coll 4 |   26.388879   4.7885939        3764
    Coll 5+ |   26.187578    4.702525        1965
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys educrec1: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> educrec1 = Grade 12

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       9491    27.85291    .0539532        27.74715    27.95867

 ------------------------------------------------------------------------------------
 -> educrec1 = Coll 1-3 yrs

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       6550    27.44559    .0646292        27.31889    27.57228

 ------------------------------------------------------------------------------------
 -> educrec1 = Coll 4

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       3764    26.38888    .0780519        26.23585    26.54191

 ------------------------------------------------------------------------------------
 -> educrec1 = Coll 5+

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1965    26.18758     .106084        25.97953    26.39563

 . 
 . 
 . * IV: Income
 . * ----------
 . 
 . * Generate variable defined by the income ceiling of each category.
 . gen inc = 5000 * earnings + 5000 * (earnings - 5) * (earnings > 5)

 . la var inc "Total earnings ($)"

 . 
 . * Plot racial backgrounds for each income band.
 . spineplot raceb inc if inc > 0, xla(,alt axis(2)) ///
 >         name(inc_race, replace)

 . 
 . * Plot educational levels for each income band.
 . spineplot educrec1 inc if inc > 0, scheme(burd4) xla(, alt axis(2)) ///
 >         name(inc_edu, replace)

 . 
 . * Plot income quartiles for each BMI group.
 . gr box inc if inc > 0, over(bmi6) ///
 >         name(inc, replace)

 . 
 . * Plot BMI quartiles for each income band (excluding outliers).
 . gr box bmi if inc > 0, over(inc) noout ///
 >         name(bmi_inc, replace)

 . 
 . * 95% CI estimates:
 . tab inc, su(bmi) // mean BMI at each education level

      Total |
   earnings |     Summary of Body Mass Index
        ($) |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
          0 |   27.361109   5.3342393        7662
       5000 |   26.298885   5.3198548        1075
      10000 |   26.830443   5.1843042         923
      15000 |   27.235495   5.4874418        1252
      20000 |   27.618259   5.4052868        1097
      25000 |   27.440014   5.1785359        1232
      35000 |   27.595926   5.0842359        2129
      45000 |   27.465086   5.0802835        1775
      55000 |   27.467011    4.895058        1397
      65000 |   27.582522   4.8332839         885
      75000 |   27.516439   4.7213988         603
      85000 |   27.098543   4.3958045        1740
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys inc: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> inc = 0

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       7662    27.36111    .0609399        27.24165    27.48057

 ------------------------------------------------------------------------------------
 -> inc = 5000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1075    26.29888    .1622541        25.98051    26.61726

 ------------------------------------------------------------------------------------
 -> inc = 10000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        923    26.83044    .1706435        26.49555    27.16534

 ------------------------------------------------------------------------------------
 -> inc = 15000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1252     27.2355    .1550843        26.93124    27.53975

 ------------------------------------------------------------------------------------
 -> inc = 20000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1097    27.61826    .1631982        27.29804    27.93848

 ------------------------------------------------------------------------------------
 -> inc = 25000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1232    27.44001    .1475372        27.15056    27.72947

 ------------------------------------------------------------------------------------
 -> inc = 35000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       2129    27.59593    .1101889        27.37984    27.81201

 ------------------------------------------------------------------------------------
 -> inc = 45000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1775    27.46509    .1205837        27.22858    27.70159

 ------------------------------------------------------------------------------------
 -> inc = 55000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1397    27.46701    .1309663         27.2101    27.72392

 ------------------------------------------------------------------------------------
 -> inc = 65000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        885    27.58252    .1624691        27.26365    27.90139

 ------------------------------------------------------------------------------------
 -> inc = 75000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |        603    27.51644    .1922702        27.13884    27.89404

 ------------------------------------------------------------------------------------
 -> inc = 85000

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       1740    27.09854    .1053813        26.89186    27.30523

 . 
 . 
 . * IV: Health insurance
 . * --------------------
 . 
 . * Plot BMI distribution for groups who have or do not have health coverage.
 . kdensity bmi if uninsured == 1, addplot(kdensity bmi if uninsured == 2) ///
 >         legend(order(1 "Not covered" 2 "Covered") row(1)) ///
 >         name(uninsured, replace)

 . 
 . * Exploration:
 . tab uninsured, su(bmi) // mean BMI at each health level

     Health |
  Insurance |
   coverage |     Summary of Body Mass Index
     status |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
  Not cover |   27.298409   5.1020606        3969
    Covered |   27.333267   5.1732139       17801
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys uninsured: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> uninsured = Not covered

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       3969    27.29841    .0809851        27.13963    27.45719

 ------------------------------------------------------------------------------------
 -> uninsured = Covered

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      17801    27.33327    .0387738        27.25727    27.40927

 . 
 . 
 . * IV: Health affordability
 . * ------------------------
 . 
 . * Plot BMI distribution for groups who could or coult not afford medical care.
 . kdensity bmi if ybarcare == 1, addplot(kdensity bmi if ybarcare == 2) ///
 >         legend(order(1 "Could afford medical care" 2 "Could not") row(1)) ///
 >         name(ybarcare, replace)

 . 
 . * Exploration:
 . tab ybarcare, su(bmi) // mean BMI at each health level

 Needed but |
   couldn't |
     afford |
    medical |
 care, past |     Summary of Body Mass Index
  12 months |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
         No |   27.254224    5.097839       19508
        Yes |   27.953789   5.6321726        2262
 ------------+------------------------------------
      Total |   27.326912   5.1602158       21770

 . bys ybarcare: ci bmi  // confidence bands

 ------------------------------------------------------------------------------------
 -> ybarcare = No

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |      19508    27.25422    .0364989        27.18268    27.32576

 ------------------------------------------------------------------------------------
 -> ybarcare = Yes

    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
 -------------+---------------------------------------------------------------
         bmi |       2262    27.95379    .1184213        27.72156    28.18601

 . 
 . 
 . * =============================
 . * = EXPORT SUMMARY STATISTICS =
 . * =============================
 . 
 . 
 . * The reader of your research does not know your data. A solution at that stage
 . * is therefore to produce a table that holds descriptive (summary) statistics
 . * for the variables that you have selected for analysis. This requires using a
 . * command that was written especially for the course, to make it very easy.
 . 
 . * The next command is part of the SRQM folder. If Stata returns an error when
 . * you run it, set the folder as your working directory and type -run profile-
 . * to run the course setup, then try the command again. If you still experience
 . * problems with the -stab- command, please send a detailed email on the issue.
 . 
 . stab using week5_stats.txt, replace ///
 >         mean(bmi age) ///
 >         prop(female raceb educrec1 earnings uninsured ybarcare)
 installing estout first...
 checking estout consistency and verifying not already installed...
 installing into /Users/fr/Library/Application Support/Stata/ado/stbplus/...
 installation complete.
 (note: file week5_stats.txt not found)

 Variable                     mean           sd          min          max         mea
 > n           sd          min          max         mean           sd          min   
 >        max         mean           sd          min          max         mean       
 >     sd          min          max         mean           sd          min          m
 > ax         mean           sd          min          max         mean           sd  
 >         min          max         mean           sd          min          max      
 >    mean           sd          min          max

                                %            %            %            %            
 > %            %            %            %            %            %

 Race                            %            %            %            %            
 > %            %            %            %            %            %

 Educational attain~n            %            %            %            %            
 > %            %            %            %            %            %

 Person's total ear~u            %            %            %            %            
 > %            %            %            %            %            %

 Health Insurance c~s            %            %            %            %            
 > %            %            %            %            %            %

 Needed but couldn'~c            %            %            %            %            
 > %            %            %            %            %            %

 N = 217700
 File: week5_stats.txt

 . 
 . /* Syntax of the -stab- command:
 > 
 >  - using FILE  - name of the exported file; plain text (.txt) recommended
 >  - replace     - overwrite any previously existing file
 >  - mean()      - summarizes a list of continuous variables (mean, sd, min, max)
 >  - prop()      - summarizes a list of categorical variables (frequencies)
 > 
 >   In the example above, the -stab- command will export a single file to the
 >   working directory (week5_stats.txt) containing summary statistics for the
 >   final sample, as a plain text file of tab-separated values. */
 . 
 . * Last reminder: your code is the technical document, whereas your paper is the 
 . * substantive document. Make sure that the paper is not a descriptive write-up
 . * of what happens in your code: you need to produce analytical value-added by
 . * explaining what you are hypothesizing about the relationships in the data.
 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Thanks for following!
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require fre renvars scheme-burd spineplot tab_chi

 . 
 . * Log results.
 . cap log using code/week6.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 6 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Opposition to Torture in Israel
 >  
 >  - DATA:   European Social Survey Round 4 (2008)
 >  
 >    This do-file introduces the topic of significance tests, i.e. statistical
 >    tools to assess whether an association that shows up in the data is different
 >    from the kind of arrangement that might be observed in random data.
 >    
 >    Associations are relationships between two of your variables. They correspond
 >    to real-world relationships, like the association between income and gender.
 >    Significance tests are helpful to observe and measure these phenomena.
 >    
 >    The null hypothesis, which is the kind of hypothesis that gets tested in a 
 >    significance test, is different from the substantive hypotheses that you 
 >    previously formulated about your data. It is usually denoted "H_0".
 >    
 >    The null hypothesis is the extent to which it is possible to reproduce the
 >    association that you observe in the data by statistical accident. It measures
 >    the consistency of your data with randomness.
 >    
 >    A significance test never proves anything. It can only reject the possibility
 >    that an association in your data is consistent with accidental situations.
 >    The aim of a significance test is therefore to reject the null hypothesis.
 >    
 >    To obtain that kind of proof by contradiction, the significance test will
 >    estimate how likely it is to reach the same kind of association that you
 >    observe from random data. This likelihood is called the p-value of the test.
 >    
 >    A small p-value means that is highly unlikely to produce the same association
 >    as the one you observe out of randomness. Note how far that result is from an
 >    assessment of whether your hypothesis is right or wrong!
 >    
 >    The notions covered in the paragraphs above cannot be introduced technically,
 >    as short comments accompanying Stata commands. They require that you actually
 >    open your textbooks and read at length about statistical estimation.
 >    
 >    There are many different kinds of hypothesis tests: we will cover the t-test,
 >    the proportions test, the Chi-squared test and finally linear correlation.
 >    The Stata Guide also covers these tests. Make sure to read what you need to!
 > 
 >    Last updated 2013-05-29.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load ESS dataset.
 . use data/ess2008, clear
 (European Social Survey 2008)

 . 
 . * Survey weights.
 . svyset [pw = dweight] // weighting scheme set to country-specific population

      pweight: dweight
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

 . 
 . * Rename variables to short handles.
 . renvars agea gndr hinctnta eduyrs \ age sex income edu // socio-demographics

 . renvars rlgdnm lrscale tvpol \ denom pol tv            // religion, politics

 . 
 . * Have a quick look.
 . codebook cntry age sex income edu denom pol tv, c

 Variable     Obs Unique      Mean  Min  Max  Label
 ------------------------------------------------------------------------------------
 cntry      56752     29         .    .    .  Country
 age        56544     87  47.53717   15  123  Age of respondent, calculated
 sex        56722      2  1.545379    1    2  Gender
 income     41120     10   5.26177    1   10  Household's total net income, all so...
 edu        56238     41  11.93741    0   50  Years of full-time education completed
 denom      37067      8  2.498907    1    8  Religion or denomination belonging t...
 pol        47569     11    5.1991    0   10  Placement on left right scale
 tv         54265      8  1.976191    0    7  TV watching, news/politics/current a...
 ------------------------------------------------------------------------------------

 . 
 . 
 . * Subsetting
 . * ----------
 . 
 . * Delete incomplete observations.
 . drop if mi(age, sex, income, edu, denom, pol, tv)
 (36034 observations deleted)

 . 
 . 
 . * Dependent variable: Justifiability of torture in event of preventing terrorism
 . * ------------------------------------------------------------------------------
 . 
 . fre trrtort

 trrtort -- Torture in country never justified even to prevent terrorist attack
 -----------------------------------------------------------------------------------
                                      |      Freq.    Percent      Valid       Cum.
 --------------------------------------+--------------------------------------------
 Valid   1  Agree strongly             |       5668      27.36      28.16      28.16
        2  Agree                      |       6639      32.04      32.98      61.13
        3  Neither agree nor disagree |       3276      15.81      16.27      77.41
        4  Disagree                   |       3183      15.36      15.81      93.22
        5  Disagree strongly          |       1365       6.59       6.78     100.00
        Total                         |      20131      97.17     100.00           
 Missing .a                            |         24       0.12                      
        .b                            |        546       2.64                      
        .c                            |         17       0.08                      
        Total                         |        587       2.83                      
 Total                                 |      20718     100.00                      
 -----------------------------------------------------------------------------------

 . 
 . * Generate dummies called 'torture_1 torture_2' etc. for each DV category.
 . tab trrtort, gen(torture_)

  Torture in country never |
 justified even to prevent |
          terrorist attack |      Freq.     Percent        Cum.
 ---------------------------+-----------------------------------
            Agree strongly |      5,668       28.16       28.16
                     Agree |      6,639       32.98       61.13
 Neither agree nor disagree |      3,276       16.27       77.41
                  Disagree |      3,183       15.81       93.22
         Disagree strongly |      1,365        6.78      100.00
 ---------------------------+-----------------------------------
                     Total |     20,131      100.00

 . 
 . * Country-level breakdown using stacked bars and 5-pt scale graph scheme.
 . gr hbar torture_? [aw = dweight], stack ///
 >         over(cntry, sort(1)des lab(labsize(*.8))) ///
 >     yti("Torture is never justified even to prevent terrorism") ///
 >     legend(rows(1) ///
 >     order(1 "Strongly agree" 2 "" 3 "Neither" 4 "" 5 "Strongly disagree")) ///
 >     name(torture1, replace) scheme(burd5)

 . 
 . * Binary recoding (1 = torture is never justifiable; undecideds removed).
 . recode trrtort ///
 >     (1/2 = 1 "Never justifiable") ///
 >     (4/5 = 0 "Sometimes justifiable") ///
 >     (3 = .) (else = .), gen(torture)
 (15050 differences between trrtort and torture)

 . la var torture "Opposition to torture"

 . 
 . * Average opposition to torture in Europe.
 . fre torture

 torture -- Opposition to torture
 -----------------------------------------------------------------------------
                                |      Freq.    Percent      Valid       Cum.
 --------------------------------+--------------------------------------------
 Valid   0 Sometimes justifiable |       4548      21.95      26.98      26.98
        1 Never justifiable     |      12307      59.40      73.02     100.00
        Total                   |      16855      81.35     100.00           
 Missing .                       |       3863      18.65                      
 Total                           |      20718     100.00                      
 -----------------------------------------------------------------------------

 . tab torture [aw = dweight * pweight] // weighted by overall European population

 Opposition to torture |      Freq.     Percent        Cum.
 ----------------------+-----------------------------------
 Sometimes justifiable | 4,577.4821       27.16       27.16
    Never justifiable | 12,277.518       72.84      100.00
 ----------------------+-----------------------------------
                Total |     16,855      100.00

 . 
 . * Average opposition to torture in each country.
 . gr dot torture [aw = dweight], over(cntry, sort(1) des) scale(.75) ///
 >     name(torture2, replace)

 . 
 . * Create a dummy for Israel vs. other European countries.
 . gen israel:israel = (cntry == "IL")

 . la def israel 1 "Israel" 0 "Other EU"

 . 
 . * Estimate DV proportions in Israel.
 . prop torture if israel

 Proportion estimation               Number of obs    =    1039

      _prop_1: torture = Sometimes justifiable
      _prop_2: torture = Never justifiable

 --------------------------------------------------------------
             | Proportion   Std. Err.     [95% Conf. Interval]
 -------------+------------------------------------------------
 torture      |
     _prop_1 |   .4513956   .0154458      .4210871    .4817041
     _prop_2 |   .5486044   .0154458      .5182959    .5789129
 --------------------------------------------------------------

 . 
 . * Compare average opposition to torture inside and outside Israel.
 . prtest torture, by(israel)

 Two-sample test of proportions              Other EU: Number of obs =    15816
                                              Israel: Number of obs =     1039
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    Other EU |   .7420966   .0034786                      .7352786    .7489146
      Israel |   .5486044   .0154383                      .5183458     .578863
 -------------+----------------------------------------------------------------
        diff |   .1934922   .0158254                       .162475    .2245094
             |  under Ho:   .0142156    13.61   0.000
 ------------------------------------------------------------------------------
        diff = prop(Other EU) - prop(Israel)                      z =  13.6112
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 1.0000         Pr(|Z| < |z|) = 0.0000          Pr(Z > z) = 0.0000

 . 
 . * Subset to all European countries but Israel.
 . keep if israel
 (19351 observations deleted)

 . 
 . * Final sample size.
 . count
 1367

 . 
 . 
 . * ======================
 . * = SIGNIFICANCE TESTS =
 . * ======================
 . 
 . 
 . * IV: Age
 . * -------
 . 
 . su age

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         age |      1367    48.46818    18.43854         15         97

 . 
 . * Check normality.
 . hist age, bin(15) normal ///
 >     name(age, replace)
 (bin=15, start=15, width=5.4666667)

 . 
 . * Recoding to 4 age groups:
 . gen age4:age4 = irecode(age, 24, 44, 64)          // quick recode

 . table age4, c(min age max age n age)              // check result

 ----------------------------------------------
     age4 |   min(age)    max(age)      N(age)
 ----------+-----------------------------------
        0 |         15          24         151
        1 |         25          44         439
        2 |         45          64         491
        3 |         65          97         286
 ----------------------------------------------

 . la def age4 0 "15-24" 1 "25-44" 2 "45-64" 3 "65+" // value labels

 . la var age4 "Age (4 groups)"                      // label result

 . fre age4                                          // final result

 age4 -- Age (4 groups)
 -------------------------------------------------------------
                |      Freq.    Percent      Valid       Cum.
 ----------------+--------------------------------------------
 Valid   0 15-24 |        151      11.05      11.05      11.05
        1 25-44 |        439      32.11      32.11      43.16
        2 45-64 |        491      35.92      35.92      79.08
        3 65+   |        286      20.92      20.92     100.00
        Total   |       1367     100.00     100.00           
 -------------------------------------------------------------

 . 
 . * Spineplot.
 . spineplot torture age4, ///
 >         name(dv_age, replace)

 . 
 . * Comparison of average age in each category.
 . ttest age, by(torture)

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 Sometime |     469    46.07036    .8408054    18.20882    44.41814    47.72258
 Never ju |     570     49.4193    .7527722    17.97219    47.94075    50.89785
 ---------+--------------------------------------------------------------------
 combined |    1039     47.9076    .5629982    18.14741    46.80286    49.01235
 ---------+--------------------------------------------------------------------
    diff |           -3.348936    1.127112               -5.560616   -1.137255
 ------------------------------------------------------------------------------
    diff = mean(Sometime) - mean(Never ju)                        t =  -2.9713
 Ho: diff = 0                                     degrees of freedom =     1037

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0015         Pr(|T| > |t|) = 0.0030          Pr(T > t) = 0.9985

 . 
 . 
 . * IV: Gender
 . * ----------
 . 
 . fre sex

 sex -- Gender
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 Male   |        647      47.33      47.33      47.33
        2 Female |        720      52.67      52.67     100.00
        Total    |       1367     100.00     100.00           
 --------------------------------------------------------------

 . 
 . gen female:female = (sex==2) if !mi(sex) // dummify

 . la def female 0 "Male" 1 "Female"

 . la var female "Gender"

 . 
 . * Conditional probabilities:
 . tab torture female, col nof // column percentages

                      |        Gender
 Opposition to torture |      Male     Female |     Total
 ----------------------+----------------------+----------
 Sometimes justifiable |     47.58      42.72 |     45.14 
    Never justifiable |     52.42      57.28 |     54.86 
 ----------------------+----------------------+----------
                Total |    100.00     100.00 |    100.00 


 . tab torture female, row nof // rows percentages

                      |        Gender
 Opposition to torture |      Male     Female |     Total
 ----------------------+----------------------+----------
 Sometimes justifiable |     52.45      47.55 |    100.00 
    Never justifiable |     47.54      52.46 |    100.00 
 ----------------------+----------------------+----------
                Total |     49.76      50.24 |    100.00 


 . 
 . * Comparison of proportions in each category.
 . prtest female, by(torture)

 Two-sample test of proportions          Sometimes ju: Number of obs =      469
                                        Never justif: Number of obs =      570
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 Sometimes ju |   .4754797   .0230601                      .4302828    .5206767
 Never justif |   .5245614   .0209174                       .483564    .5655588
 -------------+----------------------------------------------------------------
        diff |  -.0490817   .0311337                     -.1101025    .0119392
             |  under Ho:   .0311709    -1.57   0.115
 ------------------------------------------------------------------------------
        diff = prop(Sometimes ju) - prop(Never justif)            z =  -1.5746
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.0577         Pr(|Z| < |z|) = 0.1153          Pr(Z > z) = 0.9423

 . 
 . 
 . * IV: Income deciles
 . * ------------------
 . 
 . fre income

 income -- Household's total net income, all sources
 ------------------------------------------------------------------------
                           |      Freq.    Percent      Valid       Cum.
 ---------------------------+--------------------------------------------
 Valid   1  J - 1st decile  |        109       7.97       7.97       7.97
        2  R - 2nd decile  |        157      11.49      11.49      19.46
        3  C - 3rd decile  |        238      17.41      17.41      36.87
        4  M - 4th decile  |        173      12.66      12.66      49.52
        5  F - 5th decile  |        155      11.34      11.34      60.86
        6  S - 6th decile  |        129       9.44       9.44      70.30
        7  K - 7th decile  |        109       7.97       7.97      78.27
        8  P - 8th decile  |        101       7.39       7.39      85.66
        9  D - 9th decile  |         99       7.24       7.24      92.90
        10 H - 10th decile |         97       7.10       7.10     100.00
        Total              |       1367     100.00     100.00           
 ------------------------------------------------------------------------

 . 
 . * Simpler coding (no value labels).
 . gen inc = income

 . 
 . * Spineplot.
 . spineplot torture inc

 . 
 . * Chi-squared test.
 . tab inc torture, row nof  // row percentages

           | Opposition to torture
       inc | Sometimes  Never jus |     Total
 -----------+----------------------+----------
         1 |     44.44      55.56 |    100.00 
         2 |     45.90      54.10 |    100.00 
         3 |     51.69      48.31 |    100.00 
         4 |     41.67      58.33 |    100.00 
         5 |     42.48      57.52 |    100.00 
         6 |     43.14      56.86 |    100.00 
         7 |     43.37      56.63 |    100.00 
         8 |     49.37      50.63 |    100.00 
         9 |     48.24      51.76 |    100.00 
        10 |     35.44      64.56 |    100.00 
 -----------+----------------------+----------
     Total |     45.14      54.86 |    100.00 


 . tab inc torture, col nof  // column percentages

           | Opposition to torture
       inc | Sometimes  Never jus |     Total
 -----------+----------------------+----------
         1 |      8.53       8.77 |      8.66 
         2 |     11.94      11.58 |     11.74 
         3 |     19.62      15.09 |     17.13 
         4 |      9.59      11.05 |     10.39 
         5 |     10.23      11.40 |     10.88 
         6 |      9.38      10.18 |      9.82 
         7 |      7.68       8.25 |      7.99 
         8 |      8.32       7.02 |      7.60 
         9 |      8.74       7.72 |      8.18 
        10 |      5.97       8.95 |      7.60 
 -----------+----------------------+----------
     Total |    100.00     100.00 |    100.00 


 . tab inc torture, cell nof // cell percentages

           | Opposition to torture
       inc | Sometimes  Never jus |     Total
 -----------+----------------------+----------
         1 |      3.85       4.81 |      8.66 
         2 |      5.39       6.35 |     11.74 
         3 |      8.85       8.28 |     17.13 
         4 |      4.33       6.06 |     10.39 
         5 |      4.62       6.26 |     10.88 
         6 |      4.23       5.58 |      9.82 
         7 |      3.46       4.52 |      7.99 
         8 |      3.75       3.85 |      7.60 
         9 |      3.95       4.23 |      8.18 
        10 |      2.69       4.91 |      7.60 
 -----------+----------------------+----------
     Total |     45.14      54.86 |    100.00 


 . tab inc torture, chi2     // Chi-squared test

           | Opposition to torture
       inc | Sometimes  Never jus |     Total
 -----------+----------------------+----------
         1 |        40         50 |        90 
         2 |        56         66 |       122 
         3 |        92         86 |       178 
         4 |        45         63 |       108 
         5 |        48         65 |       113 
         6 |        44         58 |       102 
         7 |        36         47 |        83 
         8 |        39         40 |        79 
         9 |        41         44 |        85 
        10 |        28         51 |        79 
 -----------+----------------------+----------
     Total |       469        570 |     1,039 

          Pearson chi2(9) =   8.1436   Pr = 0.520

 . 
 . 
 . * IV: Education
 . * -------------
 . 
 . fre edu

 edu -- Years of full-time education completed
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   0     |          7       0.51       0.51       0.51
        3     |          5       0.37       0.37       0.88
        4     |         10       0.73       0.73       1.61
        5     |          9       0.66       0.66       2.27
        6     |         13       0.95       0.95       3.22
        7     |         13       0.95       0.95       4.17
        8     |         88       6.44       6.44      10.61
        9     |         35       2.56       2.56      13.17
        10    |         71       5.19       5.19      18.36
        11    |         65       4.75       4.75      23.12
        12    |        501      36.65      36.65      59.77
        13    |         40       2.93       2.93      62.69
        14    |         86       6.29       6.29      68.98
        15    |        111       8.12       8.12      77.10
        16    |        157      11.49      11.49      88.59
        17    |         50       3.66       3.66      92.25
        18    |         46       3.37       3.37      95.61
        19    |         23       1.68       1.68      97.29
        20    |         18       1.32       1.32      98.61
        21    |          6       0.44       0.44      99.05
        22    |          4       0.29       0.29      99.34
        23    |          1       0.07       0.07      99.41
        24    |          2       0.15       0.15      99.56
        25    |          4       0.29       0.29      99.85
        26    |          2       0.15       0.15     100.00
        Total |       1367     100.00     100.00           
 -----------------------------------------------------------

 . 
 . * Verify normality.
 . hist edu, bin(10) normal ///
 >     name(edu, replace)
 (bin=10, start=0, width=2.6)

 . 
 . * Comparison of average educational attainment in each category.
 . ttest edu, by(torture)

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 Sometime |     469    13.02559    .1578023     3.41743     12.7155    13.33568
 Never ju |     570    12.51579    .1412224    3.371638    12.23841    12.79317
 ---------+--------------------------------------------------------------------
 combined |    1039    12.74591    .1054875    3.400232    12.53892     12.9529
 ---------+--------------------------------------------------------------------
    diff |            .5097969    .2114893                 .094801    .9247927
 ------------------------------------------------------------------------------
    diff = mean(Sometime) - mean(Never ju)                        t =   2.4105
 Ho: diff = 0                                     degrees of freedom =     1037

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9919         Pr(|T| > |t|) = 0.0161          Pr(T > t) = 0.0081

 . 
 . 
 . * IV: Religious faith
 . * -------------------
 . 
 . fre denom

 denom -- Religion or denomination belonging to at present
 ---------------------------------------------------------------------------
                              |      Freq.    Percent      Valid       Cum.
 ------------------------------+--------------------------------------------
 Valid   1 Roman Catholic      |         26       1.90       1.90       1.90
        2 Protestant          |          1       0.07       0.07       1.98
        3 Eastern Orthodox    |         15       1.10       1.10       3.07
        4 Other Christian     |          2       0.15       0.15       3.22
          denomination        |                                            
        5 Jewish              |       1132      82.81      82.81      86.03
        6 Islamic             |        186      13.61      13.61      99.63
        7 Eastern religions   |          3       0.22       0.22      99.85
        8 Other non-Christian |          2       0.15       0.15     100.00
          religions           |                                            
        Total                 |       1367     100.00     100.00           
 ---------------------------------------------------------------------------

 . 
 . * Recoding to simpler groups.
 . recode denom (1/4 = 1 "Christian") ///
 >     (5 = 2 "Jewish") (6 = 3 "Muslim") (else = .), gen(faith3)
 (1341 differences between denom and faith3)

 . la var faith3 "Religious faith"

 . 
 . * Conditional probabilities:
 . tab torture faith3, col nof    // column percentages

                      |         Religious faith
 Opposition to torture | Christian     Jewish     Muslim |     Total
 ----------------------+---------------------------------+----------
 Sometimes justifiable |     60.61      42.42      56.95 |     45.12 
    Never justifiable |     39.39      57.58      43.05 |     54.88 
 ----------------------+---------------------------------+----------
                Total |    100.00     100.00     100.00 |    100.00 


 . tab torture faith3, row nof    // rows percentages

                      |         Religious faith
 Opposition to torture | Christian     Jewish     Muslim |     Total
 ----------------------+---------------------------------+----------
 Sometimes justifiable |      4.28      77.30      18.42 |    100.00 
    Never justifiable |      2.29      86.27      11.44 |    100.00 
 ----------------------+---------------------------------+----------
                Total |      3.19      82.22      14.59 |    100.00 


 . 
 . * Chi-squared test:
 . tab torture faith3, exp chi2 // expected frequencies

 +--------------------+
 | Key                |
 |--------------------|
 |     frequency      |
 | expected frequency |
 +--------------------+

                      |         Religious faith
 Opposition to torture | Christian     Jewish     Muslim |     Total
 ----------------------+---------------------------------+----------
 Sometimes justifiable |        20        361         86 |       467 
                      |      14.9      384.0       68.1 |     467.0 
 ----------------------+---------------------------------+----------
    Never justifiable |        13        490         65 |       568 
                      |      18.1      467.0       82.9 |     568.0 
 ----------------------+---------------------------------+----------
                Total |        33        851        151 |     1,035 
                      |      33.0      851.0      151.0 |   1,035.0 

          Pearson chi2(2) =  14.2396   Pr = 0.001

 . tabchi torture faith3, noe p // Pearson residuals

          observed frequency
          Pearson residual

 -------------------------------------------------------
                      |         Religious faith        
 Opposition to torture | Christian     Jewish     Muslim
 ----------------------+--------------------------------
 Sometimes justifiable |        20        361         86
                      |     1.324     -1.173      2.165
                      | 
    Never justifiable |        13        490         65
                      |    -1.201      1.063     -1.963
 -------------------------------------------------------

          Pearson chi2(2) =  14.2396   Pr = 0.001
 likelihood-ratio chi2(2) =  14.1847   Pr = 0.001

 . 
 . * Create a binary variable for each category.
 . tab faith3, gen(faith_)

  Religious |
      faith |      Freq.     Percent        Cum.
 ------------+-----------------------------------
  Christian |         44        3.23        3.23
     Jewish |      1,132       83.11       86.34
     Muslim |        186       13.66      100.00
 ------------+-----------------------------------
      Total |      1,362      100.00

 . d faith_?

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 faith_1         byte   %8.0g                  faith3==Christian
 faith_2         byte   %8.0g                  faith3==Jewish
 faith_3         byte   %8.0g                  faith3==Muslim

 . codebook faith_?, c

 Variable    Obs Unique      Mean  Min  Max  Label
 ------------------------------------------------------------------------------------
 faith_1    1362      2  .0323054    0    1  faith3==Christian
 faith_2    1362      2  .8311307    0    1  faith3==Jewish
 faith_3    1362      2  .1365639    0    1  faith3==Muslim
 ------------------------------------------------------------------------------------

 . 
 . * Inspect underlying distribution by country.
 . tab cntry faith3

           |         Religious faith
   Country | Christian     Jewish     Muslim |     Total
 -----------+---------------------------------+----------
        IL |        44      1,132        186 |     1,362 
 -----------+---------------------------------+----------
     Total |        44      1,132        186 |     1,362 


 . 
 . * Comparing Christian respondents to all others.
 . prtest torture, by(faith_1)

 Two-sample test of proportions                     0: Number of obs =     1002
                                                   1: Number of obs =       33
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
           0 |   .5538922   .0157036                      .5231138    .5846707
           1 |   .3939394   .0850581                      .2272285    .5606502
 -------------+----------------------------------------------------------------
        diff |   .1599528   .0864956                     -.0095754     .329481
             |  under Ho:   .0880383     1.82   0.069
 ------------------------------------------------------------------------------
        diff = prop(0) - prop(1)                                  z =   1.8169
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.9654         Pr(|Z| < |z|) = 0.0692          Pr(Z > z) = 0.0346

 . 
 . * Comparing Jewish respondents to all others.
 . prtest torture, by(faith_2)

 Two-sample test of proportions                     0: Number of obs =      184
                                                   1: Number of obs =      851
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
           0 |    .423913   .0364312                      .3525092    .4953169
           1 |   .5757932   .0169417                       .542588    .6089983
 -------------+----------------------------------------------------------------
        diff |  -.1518801   .0401778                     -.2306271   -.0731331
             |  under Ho:   .0404565    -3.75   0.000
 ------------------------------------------------------------------------------
        diff = prop(0) - prop(1)                                  z =  -3.7542
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.0001         Pr(|Z| < |z|) = 0.0002          Pr(Z > z) = 0.9999

 . 
 . * Comparing Muslim respondents to all others.
 . prtest torture, by(faith_3)

 Two-sample test of proportions                     0: Number of obs =      884
                                                   1: Number of obs =      151
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
           0 |   .5690045   .0166559                      .5363596    .6016495
           1 |   .4304636    .040294                      .3514888    .5094384
 -------------+----------------------------------------------------------------
        diff |   .1385409   .0436008                       .053085    .2239969
             |  under Ho:   .0438175     3.16   0.002
 ------------------------------------------------------------------------------
        diff = prop(0) - prop(1)                                  z =   3.1618
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.9992         Pr(|Z| < |z|) = 0.0016          Pr(Z > z) = 0.0008

 . 
 . 
 . * IV: Political positioning
 . * -------------------------
 . 
 . fre pol

 pol -- Placement on left right scale
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   0  Left  |         29       2.12       2.12       2.12
        1  1     |         49       3.58       3.58       5.71
        2  2     |        100       7.32       7.32      13.02
        3  3     |         96       7.02       7.02      20.04
        4  4     |        103       7.53       7.53      27.58
        5  5     |        303      22.17      22.17      49.74
        6  6     |        180      13.17      13.17      62.91
        7  7     |        160      11.70      11.70      74.62
        8  8     |        153      11.19      11.19      85.81
        9  9     |         96       7.02       7.02      92.83
        10 Right |         98       7.17       7.17     100.00
        Total    |       1367     100.00     100.00           
 --------------------------------------------------------------

 . 
 . * Verifying normality.
 . hist pol, discrete percent addl
 (start=0, width=1)

 . 
 . * Recoding to simpler categories
 . recode pol (0/4 = 1 "Left") (5 = 2 "Centre") (6/10 = 3 "Right"), gen(pol3)
 (1318 differences between pol and pol3)

 . la var pol3 "Political positioning"

 . 
 . * Conditional probabilities:
 . tab torture pol3, col nof    // column percentages

                      |      Political positioning
 Opposition to torture |      Left     Centre      Right |     Total
 ----------------------+---------------------------------+----------
 Sometimes justifiable |     50.70      42.47      43.26 |     45.14 
    Never justifiable |     49.30      57.53      56.74 |     54.86 
 ----------------------+---------------------------------+----------
                Total |    100.00     100.00     100.00 |    100.00 


 . tab torture pol3, row nof    // rows percentages

                      |      Political positioning
 Opposition to torture |      Left     Centre      Right |     Total
 ----------------------+---------------------------------+----------
 Sometimes justifiable |     30.92      19.83      49.25 |    100.00 
    Never justifiable |     24.74      22.11      53.16 |    100.00 
 ----------------------+---------------------------------+----------
                Total |     27.53      21.08      51.40 |    100.00 


 . 
 . * Chi-squared test:
 . tab torture pol3, exp chi2   // expected frequencies

 +--------------------+
 | Key                |
 |--------------------|
 |     frequency      |
 | expected frequency |
 +--------------------+

                      |      Political positioning
 Opposition to torture |      Left     Centre      Right |     Total
 ----------------------+---------------------------------+----------
 Sometimes justifiable |       145         93        231 |       469 
                      |     129.1       98.9      241.0 |     469.0 
 ----------------------+---------------------------------+----------
    Never justifiable |       141        126        303 |       570 
                      |     156.9      120.1      293.0 |     570.0 
 ----------------------+---------------------------------+----------
                Total |       286        219        534 |     1,039 
                      |     286.0      219.0      534.0 |   1,039.0 

          Pearson chi2(2) =   4.9652   Pr = 0.084

 . 
 . 
 . * IV: Media exposure
 . * ------------------
 . 
 . fre tv

 tv -- TV watching, news/politics/current affairs on average weekday
 -----------------------------------------------------------------------------------
                                      |      Freq.    Percent      Valid       Cum.
 --------------------------------------+--------------------------------------------
 Valid   0 No time at all              |        147      10.75      10.75      10.75
        1 Less than 0,5 hour          |        323      23.63      23.63      34.38
        2 0,5 hour to 1 hour          |        325      23.77      23.77      58.16
        3 More than 1 hour, up to 1,5 |        264      19.31      19.31      77.47
          hours                       |                                            
        4 More than 1,5 hours, up to  |         84       6.14       6.14      83.61
          2 hours                     |                                            
        5 More than 2 hours, up to    |         86       6.29       6.29      89.90
          2,5 hours                   |                                            
        6 More than 2,5 hours, up to  |         31       2.27       2.27      92.17
          3 hours                     |                                            
        7 More than 3 hours           |        107       7.83       7.83     100.00
        Total                         |       1367     100.00     100.00           
 -----------------------------------------------------------------------------------

 . 
 . * Alternative reading (binary mean). The nolabel (nol) option gets rid of the
 . * value labels and makes the output table a tad softer on the reader's eye.
 . tab tv, summ(torture) nol

         TV |
  watching, |
 news/politi |
 cs/current |
 affairs on |
    average |  Summary of Opposition to torture
    weekday |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
          0 |   .57943925   .49597214         107
          1 |   .52610442   .50032377         249
          2 |   .47058824   .50018612         238
          3 |   .54404145   .49935191         193
          4 |   .63333333    .4859611          60
          5 |   .66216216   .47620149          74
          6 |   .46428571    .5078745          28
          7 |   .66666667   .47404546          90
 ------------+------------------------------------
      Total |   .54860443   .49787165        1039

 . 
 . * Alternative reading (plot).
 . tab tv, plot

 TV watching, news/politics/current |
        affairs on average weekday |      Freq.
 -----------------------------------+------------+------------------------------
                    No time at all |        147 |**************
                Less than 0,5 hour |        323 |******************************
                0,5 hour to 1 hour |        325 |******************************
 More than 1 hour, up to 1,5 hours |        264 |************************
 More than 1,5 hours, up to 2 hours |         84 |********
 More than 2 hours, up to 2,5 hours |         86 |********
 More than 2,5 hours, up to 3 hours |         31 |***
                 More than 3 hours |        107 |**********
 -----------------------------------+------------+------------------------------
                             Total |      1,367 

 . 
 . * Recoding to binary.
 . recode tv (0/3 = 0 "Low") (4/7 = 1 "High"), gen(media)
 (1220 differences between tv and media)

 . la var media "Media exposure"

 . 
 . * Chi-squared test:
 . tab torture media, exp chi2  // expected frequencies

 +--------------------+
 | Key                |
 |--------------------|
 |     frequency      |
 | expected frequency |
 +--------------------+

                      |    Media exposure
 Opposition to torture |       Low       High |     Total
 ----------------------+----------------------+----------
 Sometimes justifiable |       377         92 |       469 
                      |     355.2      113.8 |     469.0 
 ----------------------+----------------------+----------
    Never justifiable |       410        160 |       570 
                      |     431.8      138.2 |     570.0 
 ----------------------+----------------------+----------
                Total |       787        252 |     1,039 
                      |     787.0      252.0 |   1,039.0 

          Pearson chi2(1) =  10.0094   Pr = 0.002

 . tabchi torture media, noe p  // Pearson residuals

          observed frequency
          Pearson residual

 --------------------------------------
                      | Media exposure
 Opposition to torture |    Low    High
 ----------------------+---------------
 Sometimes justifiable |    377      92
                      |  1.154  -2.039
                      | 
    Never justifiable |    410     160
                      | -1.047   1.850
 --------------------------------------

          Pearson chi2(1) =  10.0094   Pr = 0.002
 likelihood-ratio chi2(1) =  10.1292   Pr = 0.001

 . 
 . * Comparing respondents with high TV exposure to others.
 . prtest torture, by(media)

 Two-sample test of proportions                   Low: Number of obs =      787
                                                High: Number of obs =      252
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         Low |   .5209657   .0178074                      .4860638    .5558676
        High |   .6349206   .0303287                      .5754776    .6943637
 -------------+----------------------------------------------------------------
        diff |  -.1139549     .03517                     -.1828869    -.045023
             |  under Ho:   .0360187    -3.16   0.002
 ------------------------------------------------------------------------------
        diff = prop(Low) - prop(High)                             z =  -3.1638
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.0008         Pr(|Z| < |z|) = 0.0016          Pr(Z > z) = 0.9992

 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require mkcorr renvars

 . 
 . * Log results.
 . cap log using code/week7.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 7 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Fertility and Education, Part 1
 > 
 >  - DATA:   Quality of Government (2013)
 > 
 >    This do-file is the last one that we will run on the topic of association.
 >    You are expected to submit the second draft of your work very soon: the draft
 >    paper that you will be submitting will be mostly significance tests, so make
 >    sure that you have done all the necessary readings and practice by then.
 >    
 >    Note that significance tests should not be used blindly: run them only when
 >    you observe a particular association that you want to quantify, such as a
 >    difference in means or proportions. Also remember that a significance test
 >    is not a means to test a substantive hypothesis.
 >    
 >    At that stage, it will become indispensable that you have caught up with the
 >    textbook readings, and that you understand enough about Stata syntax to focus
 >    on interpreting rather than coding. Use the course material to bring yourself
 >    up to speed with both Stata and essential statistical theory.
 > 
 >    Last updated 2013-05-28.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load QOG dataset.
 . use data/qog2013, clear
 (Quality of Government 2013)

 . 
 . * Rename variables to short handles.
 . renvars wdi_fr bl_asy25mf undp_hdi ti_cpi gid_wip \ births schooling hdi corruptio
 > n femparl

 . 
 . * Compute GDP per capita.
 . gen gdpc = unna_gdp / unna_pop
 (2 missing values generated)

 . la var gdpc "Real GDP per capita (constant USD)"

 . 
 . * Recode to less, shorter labels.
 . recode ht_region (6/10 = 6), gen(region)
 (44 differences between ht_region and region)

 . la var region "Geographical region"

 . la val region region

 . la def region 1 "E. Europe and PSU" 2 "Lat. America" ///
 >         3 "N. Africa and M. East" 4 "Sub-Sah. Africa" ///
 >         5 "W. Europe and N. America" 6 "Asia, Pacific and Carribean" ///
 >         , replace

 . 
 . 
 . * Finalized sample
 . * ----------------
 . 
 . * Have a quick look.
 . codebook births schooling gdpc hdi corruption femparl region, c

 Variable    Obs Unique      Mean       Min       Max  Label
 ------------------------------------------------------------------------------------
 births      187    179  2.900285     1.149     7.115  Fertility Rate (Births per ...
 schooling   143    143  7.813079  1.202597  13.27008  Average Schooling Years, Fe...
 gdpc        191    191  10927.42   137.082  129959.4  Real GDP per capita (consta...
 hdi         185    162  .6554973      .277      .941  Human Development Index
 corruption  181     68  3.982868  1.009626       9.4  Corruption Perceptions Index
 femparl     116     93  16.33103         0      56.3  Women in Parliament (%)
 region      193      6  3.911917         1         6  Geographical region
 ------------------------------------------------------------------------------------

 . 
 . * Check missing values.
 . misstable pat births schooling gdpc hdi corruption femparl region ccodewb, freq

         Missing-value patterns
           (1 means complete)

              |   Pattern
    Frequency |  1  2  3  4    5  6  7
  ------------+------------------------
           91 |  1  1  1  1    1  1  1
              |
           48 |  1  1  1  1    1  1  0
           21 |  1  1  1  1    1  0  1
           15 |  1  1  1  1    1  0  0
            5 |  1  1  1  1    0  0  0
            4 |  1  1  0  0    0  0  0
            2 |  1  1  1  0    1  0  1
            1 |  0  1  1  1    1  0  0
            1 |  0  1  1  1    1  1  0
            1 |  1  0  0  0    1  1  0
            1 |  1  0  1  1    1  1  1
            1 |  1  1  0  1    0  0  0
            1 |  1  1  1  0    0  0  0
            1 |  1  1  1  1    0  1  1
  ------------+------------------------
          193 |

  Variables are  (1) ccodewb  (2) gdpc  (3) births  (4) hdi  (5) corruption
                 (6) schooling  (7) femparl

 . 
 . * You would usually delete incomplete observations at that stage, and then count
 . * the number of observations in your finalized sample. We exceptionally keep the
 . * missing values here to illustrate how pairwise and listwise correlation works.
 . 
 . 
 . 
 . * ===============
 . * = CORRELATION =
 . * ===============
 . 
 . 
 . * (1) Fertility rates and schooling years
 . * ---------------------------------------
 . 
 . scatter births schooling, ///
 >         name(fert_edu, replace)

 . 
 . pwcorr births schooling, obs sig

             |   births school~g
 -------------+------------------
      births |   1.0000 
             |
             |      187
             |
   schooling |  -0.7394   1.0000 
             |   0.0000
             |      142      143
             |

 . 
 . 
 . * (2) Schooling years and (log) Gross Domestic Product
 . * ----------------------------------------------------
 . 
 . sc gdpc schooling, ///
 >         name(gdpc_edu, replace)

 . 
 . * A first look at the scatterplot shows no clear linear pattern, but we know
 . * from a previous session that the logarithmic variable transformation can be
 . * used to visualize exponential relationships differently. Consequently, we
 . * try to visualise the same variables with a logarithmic scale for GDP per capita.
 . sc gdpc schooling, ysc(log) ///
 >         name(gdpc_edu, replace)

 . 
 . * In this classical case, log units are more informative than metric ones to
 . * identify the relationship between the dependent and independent variables.
 . gen log_gdpc = ln(gdpc)
 (2 missing values generated)

 . la var log_gdpc "Real GDP per capita (log)"

 . 
 . * Verify the transformation.
 . sc log_gdpc schooling, ///
 >         name(gdpc_edu, replace)

 . 
 . * Obtain summary statistics.
 . su log_gdpc schooling

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
    log_gdpc |       191    8.159699    1.592255   4.920579   11.77498
   schooling |       143    7.813079    2.904687   1.202597   13.27008

 . 
 . * Visual inspection of the relationship within the mean-mean quadrants.
 . sc log_gdpc schooling, yline(7.5) xline(6) ///
 >         name(log_gdpc_schooling, replace)

 . 
 . * Verify inspection computationally.
 . pwcorr gdpc log_gdpc schooling, obs sig

             |     gdpc log_gdpc school~g
 -------------+---------------------------
        gdpc |   1.0000 
             |
             |      191
             |
    log_gdpc |   0.7657   1.0000 
             |   0.0000
             |      191      191
             |
   schooling |   0.5537   0.7732   1.0000 
             |   0.0000   0.0000
             |      141      141      143
             |

 . 
 . 
 . * (3) Corruption and human development
 . * ------------------------------------
 . 
 . * Before graphing the variables, we need to pass a few graph options, because
 . * the Corruption Perception Index is reverse-coded (0 marks high corruption,
 . * and 10 marks very low corruption). To enhance visual interpretation, we
 . * therefore use an inverted axis scale, and add horizontal axis labels to it.
 . sc corruption hdi, ysc(rev) ///
 >         xla(0 "Low" 1 "High") yla(0 "Highly corrupt" 10 "Lowly corrupt", angle(h))
 >  ///
 >         name(corruption_hdi, replace)

 . 
 . * The pattern that appears graphically is not linear: corruption is stationary
 . * for low to medium values of HDI, and then rapidly drops towards high values.
 . * Given its shape, this relationship is thus more likely to be quadratic, i.e.
 . * of the form y = x^n where y is corruption, x is HDI and n > 1 is a power.
 . * If the correlation coefficient is statistically significant, we might treat
 . * the relationship between corruption and HDI as approximately linear, but we
 . * will lose some of the information observed visually by doing so.
 . pwcorr corruption hdi, obs sig

             | corrup~n      hdi
 -------------+------------------
  corruption |   1.0000 
             |
             |      181
             |
         hdi |   0.7244   1.0000 
             |   0.0000
             |      178      185
             |

 . 
 . 
 . * (4) Female government ministers and corruption
 . * ----------------------------------------------
 . 
 . * Obtain summary statistics.
 . su femparl corruption

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
     femparl |       116    16.33103    10.11587          0       56.3
  corruption |       181    3.982868    2.089537   1.009626        9.4

 . 
 . * Visual inspection of the relationship within the mean-mean quadrants.
 . sc femparl corruption, yline(15) xline(4) ///
 >         name(femparl_corruption, replace)

 . 
 . * No clear pattern emerges from the scatterplot above. Never force a pattern
 . * onto the data: relationships should be apparent, not constructed. If there is
 . * no straightforward relationship, disregard it. Identically, never include a
 . * graph in your work if the relationship that it intends to show will not
 . * strike the reader between the eyes (i.e. run an intra-ocular trauma test).
 . * Inconclusive visual inspection can come with significant correlations, as is
 . * the case here if you actually compute the coefficient, but visual inspection
 . * and theoretical elaboration provide no substantive justification for it.
 . 
 . 
 . * ================
 . * = SCATTERPLOTS =
 . * ================
 . 
 . 
 . * Scatterplot matrixes
 . * --------------------
 . 
 . * Start with visual inspection of the data organized as a scatterplot matrix.
 . * A scatterplot matrix contains all possible bivariate relationships between
 . * any number of variables. Building a matrix of your DV and IVs allows to spot
 . * relationships between IVs, which will be useful later on in your analysis.
 . * Note that the example below shows the untransformed measure of GDP per capita.
 . gr mat births schooling log_gdpc corruption femparl, ///
 >         name(gr_matrix, replace)

 . 
 . * You could also look at a sparser version of the matrix that shows only half of
 . * all plots for a subset of geographical regions.
 . gr mat births schooling log_gdpc corruption femparl if inlist(region, 4, 5), half 
 > ///
 >         name(gr_matrix_regions4_5, replace)

 . 
 . * The most practical way to consider all possible correlations in a list of
 . * predictors (or independent variables) is to build a correlation matrix out
 . * of their respective pairwise correlations. "Pair-wise" indicates that the
 . * correlation coefficient uses only pairs of valid, nonmissing observations,
 . * and disregards all observations where any of the variables is missing.
 . pwcorr births schooling log_gdpc corruption femparl

             |   births school~g log_gdpc corrup~n  femparl
 -------------+---------------------------------------------
      births |   1.0000 
   schooling |  -0.7394   1.0000 
    log_gdpc |  -0.7001   0.7732   1.0000 
  corruption |  -0.5175   0.6494   0.8033   1.0000 
     femparl |   0.0066  -0.0436  -0.0674   0.0315   1.0000 

 . 
 . * The most common way to indicate statistically significant correlations in
 . * a correlation matrix is to use asterisks (stars) to mark them when their
 . * p-value is below the level of statistical significance.
 . pwcorr births schooling log_gdpc corruption femparl, star(.05)

             |   births school~g log_gdpc corrup~n  femparl
 -------------+---------------------------------------------
      births |   1.0000 
   schooling |  -0.7394*  1.0000 
    log_gdpc |  -0.7001*  0.7732*  1.0000 
  corruption |  -0.5175*  0.6494*  0.8033*  1.0000 
     femparl |   0.0066  -0.0436  -0.0674   0.0315   1.0000 

 . 
 . * For explorative purposes, another option can be used to print out only the
 . * statistically significant correlations, which comes in handy especially in
 . * very large matrixes with majorily insignificant correlation coefficients.
 . pwcorr births schooling log_gdpc corruption femparl, print(.05)

             |   births school~g log_gdpc corrup~n  femparl
 -------------+---------------------------------------------
      births |   1.0000 
   schooling |  -0.7394   1.0000 
    log_gdpc |  -0.7001   0.7732   1.0000 
  corruption |  -0.5175   0.6494   0.8033   1.0000 
     femparl |                                       1.0000 

 . 
 . * Export a correlation matrix.
 . mkcorr births schooling gdpc corruption femparl, ///
 >         lab num sig log("week7_correlations.txt") replace
 (note: file week7_correlations.txt not found)

 . 
 . 
 . * Scatterplots with marker labels
 . * -------------------------------
 . 
 . * Stata requires passing a lot of options to produce informative graphs. If you
 . * are using a set of consistent options on several graphs, you can store these
 . * in a global macro and apply them by calling the macro with a dollar sign ($).
 . * The following global macro is a list of graph options to make scatterplots
 . * more informative by showing country codes instead of anonymous data points:
 . global ccode "ms(i) mlabpos(0) mlab(ccodewb) legend(off)"

 . 
 . * The options contained in the global macro make the marker symbol invisible,
 . * then center the marker label and fill it with the ccodewb variable (holding
 . * country codes from the World Bank) in replacement of the usual dot markers.
 . * In the following plots, passing the $ccode option will result in actually
 . * passing these graph options, stored in the ccode ("country codes") macro.
 . * Note that this is a hack, and that you would not normally fiddle with global
 . * macros if you were programming Stata at a more advanced level: you would use
 . * local macros, which are more complex in usage and therefore avoided here.
 . 
 . * Improve previous example.
 . sc births schooling, $ccode ///
 >         name(fert_edu1, replace)

 . 
 . * Add a color difference to Western states by overlaying multiple scatterplots.
 . sc births schooling, $ccode || ///
 >         sc births schooling if region == 5, $ccode ///
 >         name(fert_edu2, replace)

 . 
 . * Add a tone and color difference to subsaharan African states (more options!).
 . sc births schooling, mlabc(gs10) $ccode || ///
 >         sc births schooling if region == 4, $ccode ///
 >         name(fert_edu3, replace)

 . 
 . * There are binders full of Stata graph options like these. Have a look at the
 . * help pages for two-way graphs (h tw) for a list that applies to scatterplots.
 . 
 . 
 . * Scatterplots with histograms
 . * ----------------------------
 . 
 . * Or, how to combine graphs with insane axis options.
 . sc births schooling, ///
 >         yti("") xti("") ysc(alt) yla(none, angle(v)) xsc(alt) xla(none, grid gmax)
 >  ///
 >         name(plot2, replace) plotregion(style(none))

 . 
 . * Plot 1 is top-left.
 . tw hist births, ///
 >         xsc(alt rev) xla(none) xti("") horiz fxsize(25) ///
 >         name(plot1, replace) plotregion(style(none))

 . 
 . * Plot 3 is bottom-right.
 . tw hist schooling, ///
 >         ysc(alt rev) yla(none, nogrid angle(v)) yti("") xla(,grid gmax) fysize(25)
 >  ///
 >         name(plot3, replace) plotregion(style(none))

 . 
 . * Combined plots with square ratio (y-size = x-size).
 . gr combine plot1 plot2 plot3, ///
 >         imargin(0 0 0 0) hole(3) ysiz(5) xsiz(5) ///
 >         name(fert_edu4, replace)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)

 . 
 . * Cleanup, focus on result.
 . gr drop plot1 plot2 plot3

 . gr di fert_edu4

 . 
 . 
 . * Scatterplots with smoothed lines
 . * --------------------------------
 . 
 . * Another way to visualize the quality of a linear fit is to plot a smoothed fit
 . * with the -lowess- command, to show departures from linearity in the IV effect:
 . lowess births schooling, ///
 >         name(fert_edu_lowess, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The LOWESS smoother available with -lowess- in Stata can operate as a moving 
 . * average (running mean) or as a least squares estimator, which is the default.
 . * The core mechanics of a least squares estimator are on next week's menu.
 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Log results.
 . cap log using code/week8.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 8 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Fertility and Education, Part 2
 > 
 >  - DATA:   Quality of Government (2013)
 > 
 >    This do-file is a continuation from last week's do-file, which we start by
 >    running in the background. This will prepare the data by renaming variables,
 >    logging GDP per capita and recoding geographical regions to less categories
 >    and shorter labels.
 >    
 >    We then explore simple linear regression using a similar set of variables as
 >    the one used last week. Some variables are interpreted on non-linear scales.
 >    Dummies (and categorical variables generally) can also be passed to a simple
 >    linear regression equation, with another slight adjustment in interpretation.
 > 
 >    Our next two sessions will move from these fundamentals about regression to
 >    multiple linear regression, and then to logistic models for binary dependent
 >    variables. Make sure that you understand the logic of ordinary least squares
 >    (OLS) in order to include simple linear regression models in your next draft.
 > 
 >    Last updated 2013-05-28.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Replicate last week and clear graphs. The data left in memory is a modified
 . * version the Quality of Government dataset, with all necessary recodes and
 . * renames already performed. It is very common to use different do-files for
 . * different tasks. In this example, the previous do-file is used for data
 . * management and the current do-file is used for analysis.
 . do code/week7.do

 . 
 . * Check setup.
 . run setup/require mkcorr renvars

 . 
 . * Log results.
 . cap log using code/week7.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 7 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Fertility and Education, Part 1
 > 
 >  - DATA:   Quality of Government (2013)
 > 
 >    This do-file is the last one that we will run on the topic of association.
 >    You are expected to submit the second draft of your work very soon: the draft
 >    paper that you will be submitting will be mostly significance tests, so make
 >    sure that you have done all the necessary readings and practice by then.
 >    
 >    Note that significance tests should not be used blindly: run them only when
 >    you observe a particular association that you want to quantify, such as a
 >    difference in means or proportions. Also remember that a significance test
 >    is not a means to test a substantive hypothesis.
 >    
 >    At that stage, it will become indispensable that you have caught up with the
 >    textbook readings, and that you understand enough about Stata syntax to focus
 >    on interpreting rather than coding. Use the course material to bring yourself
 >    up to speed with both Stata and essential statistical theory.
 > 
 >    Last updated 2013-05-28.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load QOG dataset.
 . use data/qog2013, clear
 (Quality of Government 2013)

 . 
 . * Rename variables to short handles.
 . renvars wdi_fr bl_asy25mf undp_hdi ti_cpi gid_wip \ births schooling hdi corruptio
 > n femparl

 . 
 . * Compute GDP per capita.
 . gen gdpc = unna_gdp / unna_pop
 (2 missing values generated)

 . la var gdpc "Real GDP per capita (constant USD)"

 . 
 . * Recode to less, shorter labels.
 . recode ht_region (6/10 = 6), gen(region)
 (44 differences between ht_region and region)

 . la var region "Geographical region"

 . la val region region

 . la def region 1 "E. Europe and PSU" 2 "Lat. America" ///
 >         3 "N. Africa and M. East" 4 "Sub-Sah. Africa" ///
 >         5 "W. Europe and N. America" 6 "Asia, Pacific and Carribean" ///
 >         , replace

 . 
 . 
 . * Finalized sample
 . * ----------------
 . 
 . * Have a quick look.
 . codebook births schooling gdpc hdi corruption femparl region, c

 Variable    Obs Unique      Mean       Min       Max  Label
 ------------------------------------------------------------------------------------
 births      187    179  2.900285     1.149     7.115  Fertility Rate (Births per ...
 schooling   143    143  7.813079  1.202597  13.27008  Average Schooling Years, Fe...
 gdpc        191    191  10927.42   137.082  129959.4  Real GDP per capita (consta...
 hdi         185    162  .6554973      .277      .941  Human Development Index
 corruption  181     68  3.982868  1.009626       9.4  Corruption Perceptions Index
 femparl     116     93  16.33103         0      56.3  Women in Parliament (%)
 region      193      6  3.911917         1         6  Geographical region
 ------------------------------------------------------------------------------------

 . 
 . * Check missing values.
 . misstable pat births schooling gdpc hdi corruption femparl region ccodewb, freq

         Missing-value patterns
           (1 means complete)

              |   Pattern
    Frequency |  1  2  3  4    5  6  7
  ------------+------------------------
           91 |  1  1  1  1    1  1  1
              |
           48 |  1  1  1  1    1  1  0
           21 |  1  1  1  1    1  0  1
           15 |  1  1  1  1    1  0  0
            5 |  1  1  1  1    0  0  0
            4 |  1  1  0  0    0  0  0
            2 |  1  1  1  0    1  0  1
            1 |  0  1  1  1    1  0  0
            1 |  0  1  1  1    1  1  0
            1 |  1  0  0  0    1  1  0
            1 |  1  0  1  1    1  1  1
            1 |  1  1  0  1    0  0  0
            1 |  1  1  1  0    0  0  0
            1 |  1  1  1  1    0  1  1
  ------------+------------------------
          193 |

  Variables are  (1) ccodewb  (2) gdpc  (3) births  (4) hdi  (5) corruption
                 (6) schooling  (7) femparl

 . 
 . * You would usually delete incomplete observations at that stage, and then count
 . * the number of observations in your finalized sample. We exceptionally keep the
 . * missing values here to illustrate how pairwise and listwise correlation works.
 . 
 . 
 . 
 . * ===============
 . * = CORRELATION =
 . * ===============
 . 
 . 
 . * (1) Fertility rates and schooling years
 . * ---------------------------------------
 . 
 . scatter births schooling, ///
 >         name(fert_edu, replace)
 (note: scheme burd not found, using s2color)

 . 
 . pwcorr births schooling, obs sig

             |   births school~g
 -------------+------------------
      births |   1.0000 
             |
             |      187
             |
   schooling |  -0.7394   1.0000 
             |   0.0000
             |      142      143
             |

 . 
 . 
 . * (2) Schooling years and (log) Gross Domestic Product
 . * ----------------------------------------------------
 . 
 . sc gdpc schooling, ///
 >         name(gdpc_edu, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * A first look at the scatterplot shows no clear linear pattern, but we know
 . * from a previous session that the logarithmic variable transformation can be
 . * used to visualize exponential relationships differently. Consequently, we
 . * try to visualise the same variables with a logarithmic scale for GDP per capita.
 . sc gdpc schooling, ysc(log) ///
 >         name(gdpc_edu, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * In this classical case, log units are more informative than metric ones to
 . * identify the relationship between the dependent and independent variables.
 . gen log_gdpc = ln(gdpc)
 (2 missing values generated)

 . la var log_gdpc "Real GDP per capita (log)"

 . 
 . * Verify the transformation.
 . sc log_gdpc schooling, ///
 >         name(gdpc_edu, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Obtain summary statistics.
 . su log_gdpc schooling

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
    log_gdpc |       191    8.159699    1.592255   4.920579   11.77498
   schooling |       143    7.813079    2.904687   1.202597   13.27008

 . 
 . * Visual inspection of the relationship within the mean-mean quadrants.
 . sc log_gdpc schooling, yline(7.5) xline(6) ///
 >         name(log_gdpc_schooling, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Verify inspection computationally.
 . pwcorr gdpc log_gdpc schooling, obs sig

             |     gdpc log_gdpc school~g
 -------------+---------------------------
        gdpc |   1.0000 
             |
             |      191
             |
    log_gdpc |   0.7657   1.0000 
             |   0.0000
             |      191      191
             |
   schooling |   0.5537   0.7732   1.0000 
             |   0.0000   0.0000
             |      141      141      143
             |

 . 
 . 
 . * (3) Corruption and human development
 . * ------------------------------------
 . 
 . * Before graphing the variables, we need to pass a few graph options, because
 . * the Corruption Perception Index is reverse-coded (0 marks high corruption,
 . * and 10 marks very low corruption). To enhance visual interpretation, we
 . * therefore use an inverted axis scale, and add horizontal axis labels to it.
 . sc corruption hdi, ysc(rev) ///
 >         xla(0 "Low" 1 "High") yla(0 "Highly corrupt" 10 "Lowly corrupt", angle(h))
 >  ///
 >         name(corruption_hdi, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The pattern that appears graphically is not linear: corruption is stationary
 . * for low to medium values of HDI, and then rapidly drops towards high values.
 . * Given its shape, this relationship is thus more likely to be quadratic, i.e.
 . * of the form y = x^n where y is corruption, x is HDI and n > 1 is a power.
 . * If the correlation coefficient is statistically significant, we might treat
 . * the relationship between corruption and HDI as approximately linear, but we
 . * will lose some of the information observed visually by doing so.
 . pwcorr corruption hdi, obs sig

             | corrup~n      hdi
 -------------+------------------
  corruption |   1.0000 
             |
             |      181
             |
         hdi |   0.7244   1.0000 
             |   0.0000
             |      178      185
             |

 . 
 . 
 . * (4) Female government ministers and corruption
 . * ----------------------------------------------
 . 
 . * Obtain summary statistics.
 . su femparl corruption

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
     femparl |       116    16.33103    10.11587          0       56.3
  corruption |       181    3.982868    2.089537   1.009626        9.4

 . 
 . * Visual inspection of the relationship within the mean-mean quadrants.
 . sc femparl corruption, yline(15) xline(4) ///
 >         name(femparl_corruption, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * No clear pattern emerges from the scatterplot above. Never force a pattern
 . * onto the data: relationships should be apparent, not constructed. If there is
 . * no straightforward relationship, disregard it. Identically, never include a
 . * graph in your work if the relationship that it intends to show will not
 . * strike the reader between the eyes (i.e. run an intra-ocular trauma test).
 . * Inconclusive visual inspection can come with significant correlations, as is
 . * the case here if you actually compute the coefficient, but visual inspection
 . * and theoretical elaboration provide no substantive justification for it.
 . 
 . 
 . * ================
 . * = SCATTERPLOTS =
 . * ================
 . 
 . 
 . * Scatterplot matrixes
 . * --------------------
 . 
 . * Start with visual inspection of the data organized as a scatterplot matrix.
 . * A scatterplot matrix contains all possible bivariate relationships between
 . * any number of variables. Building a matrix of your DV and IVs allows to spot
 . * relationships between IVs, which will be useful later on in your analysis.
 . * Note that the example below shows the untransformed measure of GDP per capita.
 . gr mat births schooling log_gdpc corruption femparl, ///
 >         name(gr_matrix, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * You could also look at a sparser version of the matrix that shows only half of
 . * all plots for a subset of geographical regions.
 . gr mat births schooling log_gdpc corruption femparl if inlist(region, 4, 5), half 
 > ///
 >         name(gr_matrix_regions4_5, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The most practical way to consider all possible correlations in a list of
 . * predictors (or independent variables) is to build a correlation matrix out
 . * of their respective pairwise correlations. "Pair-wise" indicates that the
 . * correlation coefficient uses only pairs of valid, nonmissing observations,
 . * and disregards all observations where any of the variables is missing.
 . pwcorr births schooling log_gdpc corruption femparl

             |   births school~g log_gdpc corrup~n  femparl
 -------------+---------------------------------------------
      births |   1.0000 
   schooling |  -0.7394   1.0000 
    log_gdpc |  -0.7001   0.7732   1.0000 
  corruption |  -0.5175   0.6494   0.8033   1.0000 
     femparl |   0.0066  -0.0436  -0.0674   0.0315   1.0000 

 . 
 . * The most common way to indicate statistically significant correlations in
 . * a correlation matrix is to use asterisks (stars) to mark them when their
 . * p-value is below the level of statistical significance.
 . pwcorr births schooling log_gdpc corruption femparl, star(.05)

             |   births school~g log_gdpc corrup~n  femparl
 -------------+---------------------------------------------
      births |   1.0000 
   schooling |  -0.7394*  1.0000 
    log_gdpc |  -0.7001*  0.7732*  1.0000 
  corruption |  -0.5175*  0.6494*  0.8033*  1.0000 
     femparl |   0.0066  -0.0436  -0.0674   0.0315   1.0000 

 . 
 . * For explorative purposes, another option can be used to print out only the
 . * statistically significant correlations, which comes in handy especially in
 . * very large matrixes with majorily insignificant correlation coefficients.
 . pwcorr births schooling log_gdpc corruption femparl, print(.05)

             |   births school~g log_gdpc corrup~n  femparl
 -------------+---------------------------------------------
      births |   1.0000 
   schooling |  -0.7394   1.0000 
    log_gdpc |  -0.7001   0.7732   1.0000 
  corruption |  -0.5175   0.6494   0.8033   1.0000 
     femparl |                                       1.0000 

 . 
 . * Export a correlation matrix.
 . mkcorr births schooling gdpc corruption femparl, ///
 >         lab num sig log("week7_correlations.txt") replace

 . 
 . 
 . * Scatterplots with marker labels
 . * -------------------------------
 . 
 . * Stata requires passing a lot of options to produce informative graphs. If you
 . * are using a set of consistent options on several graphs, you can store these
 . * in a global macro and apply them by calling the macro with a dollar sign ($).
 . * The following global macro is a list of graph options to make scatterplots
 . * more informative by showing country codes instead of anonymous data points:
 . global ccode "ms(i) mlabpos(0) mlab(ccodewb) legend(off)"

 . 
 . * The options contained in the global macro make the marker symbol invisible,
 . * then center the marker label and fill it with the ccodewb variable (holding
 . * country codes from the World Bank) in replacement of the usual dot markers.
 . * In the following plots, passing the $ccode option will result in actually
 . * passing these graph options, stored in the ccode ("country codes") macro.
 . * Note that this is a hack, and that you would not normally fiddle with global
 . * macros if you were programming Stata at a more advanced level: you would use
 . * local macros, which are more complex in usage and therefore avoided here.
 . 
 . * Improve previous example.
 . sc births schooling, $ccode ///
 >         name(fert_edu1, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Add a color difference to Western states by overlaying multiple scatterplots.
 . sc births schooling, $ccode || ///
 >         sc births schooling if region == 5, $ccode ///
 >         name(fert_edu2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Add a tone and color difference to subsaharan African states (more options!).
 . sc births schooling, mlabc(gs10) $ccode || ///
 >         sc births schooling if region == 4, $ccode ///
 >         name(fert_edu3, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * There are binders full of Stata graph options like these. Have a look at the
 . * help pages for two-way graphs (h tw) for a list that applies to scatterplots.
 . 
 . 
 . * Scatterplots with histograms
 . * ----------------------------
 . 
 . * Or, how to combine graphs with insane axis options.
 . sc births schooling, ///
 >         yti("") xti("") ysc(alt) yla(none, angle(v)) xsc(alt) xla(none, grid gmax)
 >  ///
 >         name(plot2, replace) plotregion(style(none))
 (note: scheme burd not found, using s2color)

 . 
 . * Plot 1 is top-left.
 . tw hist births, ///
 >         xsc(alt rev) xla(none) xti("") horiz fxsize(25) ///
 >         name(plot1, replace) plotregion(style(none))
 (note: scheme burd not found, using s2color)

 . 
 . * Plot 3 is bottom-right.
 . tw hist schooling, ///
 >         ysc(alt rev) yla(none, nogrid angle(v)) yti("") xla(,grid gmax) fysize(25)
 >  ///
 >         name(plot3, replace) plotregion(style(none))
 (note: scheme burd not found, using s2color)

 . 
 . * Combined plots with square ratio (y-size = x-size).
 . gr combine plot1 plot2 plot3, ///
 >         imargin(0 0 0 0) hole(3) ysiz(5) xsiz(5) ///
 >         name(fert_edu4, replace)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)
 (note: scheme burd not found, using s2color)

 . 
 . * Cleanup, focus on result.
 . gr drop plot1 plot2 plot3

 . gr di fert_edu4

 . 
 . 
 . * Scatterplots with smoothed lines
 . * --------------------------------
 . 
 . * Another way to visualize the quality of a linear fit is to plot a smoothed fit
 . * with the -lowess- command, to show departures from linearity in the IV effect:
 . lowess births schooling, ///
 >         name(fert_edu_lowess, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The LOWESS smoother available with -lowess- in Stata can operate as a moving 
 . * average (running mean) or as a least squares estimator, which is the default.
 . * The core mechanics of a least squares estimator are on next week's menu.
 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . gr drop _all

 . 
 . * Graph macro. If you remember what we did last week, we used a macro to label
 . * the data points with country codes instead of using anonymous dots. Since we
 . * have executed last week's do-file in the background, this macro is available
 . * in memory, so we will be able to use '$ccode' to produce better scatterplots
 . * in this do-file too. We will also be able to use the following macro, which
 . * will remove the legend and dash the regression line of our linear fits.
 . global ci "legend(off) lp(dash)"

 . 
 . 
 . * =====================
 . * = REGRESSION MODELS =
 . * =====================
 . 
 . 
 . * (1) Fertility Rates and Schooling Years
 . * ---------------------------------------
 . 
 . * We are looking again at the relationship between fertility and education that
 . * we already observed in our previous do-file. At that stage, we assume that you
 . * have a substantive model to explain the relationship that you are studying, or
 . * the results of the model will land nowhere and serve no analytical purpose.
 . 
 . * Visual fit.
 . sc births schooling, $ccode ///
 >     legend(off) yti("Fertility rate (births per woman)") ///
 >     name(fert_edu1, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Linear fit.
 . tw (sc births schooling, $ccode) (lfit births schooling, $ci), ///
 >     yti("Fertility rate (births per woman)") ///
 >     name(fert_edu2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Add 95% CI.
 . tw (sc births schooling, $ccode) (lfitci births schooling, $ci), ///
 >     yti("Fertility rate (births per woman)") ///
 >     name(fert_edu3, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Estimate the predicted effect of the education level on the fertility rate.
 . * Function: number of births = _cons (alpha) + Coef (beta) * schooling years.
 . * Equation: predicted Y (DV) = alpha + beta X (IV) + epsilon (error term).
 . reg births schooling

      Source |       SS       df       MS              Number of obs =     142
 -------------+------------------------------           F(  1,   140) =  168.81
       Model |  148.247435     1  148.247435           Prob > F      =  0.0000
    Residual |  122.945776   140  .878184113           R-squared     =  0.5466
 -------------+------------------------------           Adj R-squared =  0.5434
       Total |  271.193211   141   1.9233561           Root MSE      =  .93711

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.3533024   .0271923   -12.99   0.000    -.4070631   -.2995418
       _cons |   5.528379   .2259655    24.47   0.000     5.081633    5.975125
 ------------------------------------------------------------------------------

 . 
 . 
 . * Plotting regression results
 . * ---------------------------
 . 
 . * Simple residuals-versus-fitted plot.
 . rvfplot, yline(0) ///
 >     name(rvfplot, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Get fitted values.
 . cap drop yhat

 . predict yhat
 (option xb assumed; fitted values)
 (50 missing values generated)

 . 
 . * Get residuals.
 . cap drop r

 . predict r, resid
 (51 missing values generated)

 . 
 . * Plot residuals against predicted values of IV.
 . sc r yhat, yline(0) $ccode ///
 >     name(rvfplot2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Plot DV with observed and predicted values of IV.
 . sc births schooling || conn yhat schooling, ///
 >     name(dv_yhat, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * Small multiples
 . * ---------------
 . 
 . * Draw scatterplots and linear fits for each region. Visualizing small multiples
 . * requires using an independent variable with a limited number of categories and
 . * might reveal additional strengths or weaknesses of your model.
 . sc births schooling || lfit births schooling, by(region) ///
 >      name(lfit_region, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Run the linear regression models for each region. Observe how the standard
 . * errors and p-values of the regression coefficients widen when the regional
 . * sample size falls at lower numbers of observations.
 . bys region: reg births schooling

 ------------------------------------------------------------------------------------
 -> region = E. Europe and PSU

      Source |       SS       df       MS              Number of obs =      20
 -------------+------------------------------           F(  1,    18) =    2.45
       Model |  .711741135     1  .711741135           Prob > F      =  0.1349
    Residual |  5.22767201    18  .290426223           R-squared     =  0.1198
 -------------+------------------------------           Adj R-squared =  0.0709
       Total |  5.93941314    19  .312600692           Root MSE      =  .53891

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.1997857   .1276207    -1.57   0.135    -.4679069    .0683355
       _cons |   3.833607   1.363468     2.81   0.012     .9690679    6.698146
 ------------------------------------------------------------------------------

 ------------------------------------------------------------------------------------
 -> region = Lat. America

      Source |       SS       df       MS              Number of obs =      20
 -------------+------------------------------           F(  1,    18) =   14.09
       Model |  3.27708082     1  3.27708082           Prob > F      =  0.0015
    Residual |   4.1868556    18  .232603089           R-squared     =  0.4391
 -------------+------------------------------           Adj R-squared =  0.4079
       Total |  7.46393642    19  .392838759           Root MSE      =  .48229

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2559176   .0681812    -3.75   0.001    -.3991609   -.1126743
       _cons |    4.50041   .5331816     8.44   0.000     3.380237    5.620583
 ------------------------------------------------------------------------------

 ------------------------------------------------------------------------------------
 -> region = N. Africa and M. East

      Source |       SS       df       MS              Number of obs =      18
 -------------+------------------------------           F(  1,    16) =    3.87
       Model |  3.34242722     1  3.34242722           Prob > F      =  0.0668
    Residual |  13.8214319    16  .863839496           R-squared     =  0.1947
 -------------+------------------------------           Adj R-squared =  0.1444
       Total |  17.1638592    17  1.00963877           Root MSE      =  .92943

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2038483   .1036317    -1.97   0.067    -.4235377    .0158411
       _cons |   4.182561   .7720988     5.42   0.000     2.545785    5.819337
 ------------------------------------------------------------------------------

 ------------------------------------------------------------------------------------
 -> region = Sub-Sah. Africa

      Source |       SS       df       MS              Number of obs =      32
 -------------+------------------------------           F(  1,    30) =   29.16
       Model |  23.3848885     1  23.3848885           Prob > F      =  0.0000
    Residual |  24.0580149    30   .80193383           R-squared     =  0.4929
 -------------+------------------------------           Adj R-squared =  0.4760
       Total |  47.4429034    31  1.53041624           Root MSE      =  .89551

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.4246033   .0786294    -5.40   0.000    -.5851859   -.2640206
       _cons |   6.665125   .4098737    16.26   0.000     5.828051    7.502198
 ------------------------------------------------------------------------------

 ------------------------------------------------------------------------------------
 -> region = W. Europe and N. America

      Source |       SS       df       MS              Number of obs =      23
 -------------+------------------------------           F(  1,    21) =    6.13
       Model |  .395655645     1  .395655645           Prob > F      =  0.0219
    Residual |  1.35538605    21  .064542193           R-squared     =  0.2260
 -------------+------------------------------           Adj R-squared =  0.1891
       Total |  1.75104169    22  .079592804           Root MSE      =  .25405

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |   .1026995   .0414793     2.48   0.022     .0164386    .1889605
       _cons |   .6359472   .4511737     1.41   0.173      -.30232    1.574214
 ------------------------------------------------------------------------------

 ------------------------------------------------------------------------------------
 -> region = Asia, Pacific and Carribean

      Source |       SS       df       MS              Number of obs =      29
 -------------+------------------------------           F(  1,    27) =    5.72
       Model |  5.47598716     1  5.47598716           Prob > F      =  0.0240
    Residual |  25.8508772    27  .957439897           R-squared     =  0.1748
 -------------+------------------------------           Adj R-squared =  0.1442
       Total |  31.3268644    28  1.11881658           Root MSE      =  .97849

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.1658877   .0693647    -2.39   0.024    -.3082123    -.023563
       _cons |   3.682463   .5326461     6.91   0.000     2.589564    4.775363
 ------------------------------------------------------------------------------

 . 
 . * Detailed residuals-versus-fitted plots.
 . sc r yhat, yline(0) by(region, total) $ccode ///
 >     name(rvfplot2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * Fitting a transformed IV
 . * ------------------------
 . 
 . * The -qfit- command shows that a more advanced model might better explain the
 . * DV-IV relationship, as it looks less linear than quadratic: Y = a + bX could
 . * be replaced with Y = a + bX^2 to observe a more correct fit.
 . tw (sc births schooling, $ccode) (qfit births schooling, $ci), ///
 >     name(fert_edu_qfit, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * In this case, using the square root of the independent variable might provide
 . * better estimates of its actual effect on the dependent variable. We could have
 . * diagnosed that earlier by looking at the normality of the schooling variable,
 . * for which a square root transformation is recommended by the ladder commands.
 . 
 . * Variable transformation.
 . gen sqrt_schooling = sqrt(schooling)
 (50 missing values generated)

 . la var sqrt_schooling "Average schooling years (sqrt)"

 . 
 . * Visual inspection.
 . tw (sc births sqrt_schooling, $ccode) (lfit births sqrt_schooling, $ci), ///
 >     name(fert_edu_qfit, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Regression model of the form Y = alpha + beta sqrt(X).
 . reg births sqrt_schooling

      Source |       SS       df       MS              Number of obs =     142
 -------------+------------------------------           F(  1,   140) =  190.62
       Model |   156.35743     1   156.35743           Prob > F      =  0.0000
    Residual |  114.835781   140  .820255576           R-squared     =  0.5766
 -------------+------------------------------           Adj R-squared =  0.5735
       Total |  271.193211   141   1.9233561           Root MSE      =  .90568

 --------------------------------------------------------------------------------
        births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 ---------------+----------------------------------------------------------------
 sqrt_schooling |  -1.857605   .1345453   -13.81   0.000    -2.123608   -1.591601
         _cons |    7.85353    .375534    20.91   0.000     7.111079    8.595981
 --------------------------------------------------------------------------------

 . 
 . * Reading the regression coefficient for schooling is less intuitive when it is
 . * computed on the square root of the variable: it requires a short equation to
 . * produce real-world examples of what the model means. However, more variance
 . * in the data is explained when the model is written in this more complex form.
 . 
 . * Visualization with solved square root units.
 . tw (sc births sqrt_schooling, $ccode) (lfit births sqrt_schooling, $ci), ///
 >     xla(1 "1" 1.5 "2.25" 2 "4" 2.5 "6.25" 3 "9" 3.5 "12.25") ///
 >     xti("Average schooling years") note("Horizontal axis in squared units.") ///
 >     name(fert_edu_sqrt, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * (2) Fertility Rates and (Log) Gross Domestic Product
 . * ----------------------------------------------------
 . 
 . * As always, start with a visual inspection of the relationship.
 . tw (sc births log_gdpc, $ccode) (lfit births log_gdpc, $ci), ///
 >     name(fert_gdpc, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The interpretation of the coefficient for GDP per capita is going to be less
 . * intuitive due to its logarithmic units, but the transformation was necessary
 . * to identify the linear relationship between the two variables.
 . 
 . * Regression model of the form Y = alpha + beta ln(X).
 . reg births log_gdpc

      Source |       SS       df       MS              Number of obs =     186
 -------------+------------------------------           F(  1,   184) =  176.93
       Model |  186.917353     1  186.917353           Prob > F      =  0.0000
    Residual |  194.390066   184  1.05646775           R-squared     =  0.4902
 -------------+------------------------------           Adj R-squared =  0.4874
       Total |  381.307418   185  2.06112118           Root MSE      =  1.0278

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    log_gdpc |  -.6375227   .0479291   -13.30   0.000    -.7320839   -.5429615
       _cons |   8.065113    .396594    20.34   0.000     7.282656    8.847569
 ------------------------------------------------------------------------------

 . 
 . 
 . * Fitting 'lin-log' equations
 . * ---------------------------
 . 
 . * The relationship is a 'lin-log' equation, such that a 1% increase in X (IV) is
 . * associated with a 0.01 * beta unit increase in Y (DV). In this model, it means
 . * that a 15% increase in GDP per cap. is associated with -.74 * log(1.15) = -.16
 . * births per woman. For GDP per capita to reduce fertility by 1 birth per woman,
 . * this model would require exp(100/74) = 3.8, a 280% increase in GDP per capita.
 . * This is easy to observe from the reverse equation: -.74 * log(3.8) = -1.
 . 
 . * Why is that number so high? Recall how linear regression works: by computing
 . * the average marginal change that occurs in the DV (the coefficient) for each
 . * unit of the IV. This is the average marginal effect, computed over the whole
 . * sample. If GDP per capita expresses decreasing returns on fertility, then the
 . * average effect is bound to be higher than what is actually required at lower 
 . * levels of GDP per capita. What an econometrician would do in that case is to
 . * compute semi-elasticities (because the model is semi-logarithmic), but if you
 . * only need to quantify the average relationship, converting by hand is enough.
 . 
 . 
 . * (3) Corruption and Human Development
 . * ------------------------------------
 . 
 . * Visualizing a nonlinear, quadratic fit with corruption as the DV.
 . tw (sc corruption hdi, $ccode) (qfit corruption hdi, $ci), ///
 >     ysc(rev) yla(0 "High" 10 "Low") yti("Level of corruption") ///
 >     name(cpi_hdi, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Before interpreting the model, deal with the reverse-coding issue.
 . gen corrupt = 10 - corruption
 (12 missing values generated)

 . la var corrupt "Corruption Perception Index"

 . 
 . * Regression model in first approximation (linear form).
 . reg corrupt hdi

      Source |       SS       df       MS              Number of obs =     178
 -------------+------------------------------           F(  1,   176) =  194.29
       Model |  401.935273     1  401.935273           Prob > F      =  0.0000
    Residual |  364.107271   176  2.06879131           R-squared     =  0.5247
 -------------+------------------------------           Adj R-squared =  0.5220
       Total |  766.042544   177  4.32792398           Root MSE      =  1.4383

 ------------------------------------------------------------------------------
     corrupt |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         hdi |  -8.630701   .6191935   -13.94   0.000    -9.852701   -7.408702
       _cons |   11.61443   .4174372    27.82   0.000      10.7906    12.43825
 ------------------------------------------------------------------------------

 . 
 . 
 . * Fitting a quadratic term
 . * ------------------------
 . 
 . * A more thorough exploration of residuals will be covered in later sessions
 . * on regression diagnostics, but here is a snapshot of what we can do and
 . * understand by studying residuals in a bit more depth.
 . cap drop yhat

 . predict yhat
 (option xb assumed; fitted values)
 (8 missing values generated)

 . 
 . * Plot of linear fitted values.
 . sc corrupt yhat hdi, yla(0 "Lowly corrupt" 10 "Highly corrupt") ///
 >     connect(i l) sort(yhat) ///
 >     name(r_linear, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The curvilinearity approaches the function f: y = x^2 and can be taken care
 . * of by squaring HDI and fitting the model again with the quadratic term. The
 . * final mode is therefore a the equation Y = alpha + beta_1 X + beta_2 X^2.
 . gen hdi2 = hdi^2
 (8 missing values generated)

 . 
 . * Regression model in second approximation (added quadratic term).
 . reg corrupt hdi hdi2

      Source |       SS       df       MS              Number of obs =     178
 -------------+------------------------------           F(  2,   175) =  204.26
       Model |  536.306972     2  268.153486           Prob > F      =  0.0000
    Residual |  229.735572   175   1.3127747           R-squared     =  0.7001
 -------------+------------------------------           Adj R-squared =  0.6967
       Total |  766.042544   177  4.32792398           Root MSE      =  1.1458

 ------------------------------------------------------------------------------
     corrupt |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         hdi |   27.96508   3.650673     7.66   0.000     20.76007     35.1701
        hdi2 |  -29.54748    2.92053   -10.12   0.000    -35.31148   -23.78349
       _cons |   1.209076   1.080905     1.12   0.265    -.9242118    3.342363
 ------------------------------------------------------------------------------

 . 
 . * Residuals of the quadratic model.
 . cap drop yhat2

 . predict yhat2
 (option xb assumed; fitted values)
 (8 missing values generated)

 . 
 . * Comparison of both fits.
 . sc corrupt yhat2 hdi, yla(0 "Highly corrupt" 10 "Lowly corrupt") ///
 >     c(i l) sort(yhat) || sc yhat hdi, c(l) legend(order(2 3) ///
 >     lab(2 "Quadratic fit") lab(3 "Linear fit")) ///
 >     name(r_curvilinear, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * (4) Fertility and Democracy
 . * ---------------------------
 . 
 . * Create dummy.
 . gen democracy:democracy = (chga_hinst < 3) if !mi(chga_hinst)
 (1 missing value generated)

 . la def democracy 0 "Dictatorship" 1 "Democracy", replace

 . 
 . * Visualization of the difference in mean of the DV.
 . gr bar births, over(democracy) asyvars over(region, lab(alt)) ///
 >     name(fert_democ, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * Fitting a dummy predictor
 . * -------------------------
 . 
 . * Visualization of the "linear" fit using the dummy.
 . sc births democracy || lfit births democracy, $ci ///
 >     xsc(r(-.5 1.5)) xla(0 "Dictatorship" 1 "Democracy") xti("") ///
 >     name(fert_democ, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * You actually know this result in a different form:
 . ttest births, by(democracy)

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 Dictator |      74    3.427811     .171837    1.478198     3.08534    3.770282
 Democrac |     113    2.554826    .1240722    1.318906    2.308992    2.800659
 ---------+--------------------------------------------------------------------
 combined |     187    2.900285    .1056745    1.445077     2.69181     3.10876
 ---------+--------------------------------------------------------------------
    diff |            .8729851    .2069604                .4646792    1.281291
 ------------------------------------------------------------------------------
    diff = mean(Dictator) - mean(Democrac)                        t =   4.2181
 Ho: diff = 0                                     degrees of freedom =      185

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

 . 
 . * This is actually identical to the following model:
 . reg births i.democracy

      Source |       SS       df       MS              Number of obs =     187
 -------------+------------------------------           F(  1,   185) =   17.79
       Model |  34.0786397     1  34.0786397           Prob > F      =  0.0000
    Residual |  354.335548   185  1.91532729           R-squared     =  0.0877
 -------------+------------------------------           Adj R-squared =  0.0828
       Total |  388.414188   186  2.08824832           Root MSE      =   1.384

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 1.democracy |  -.8729851   .2069604    -4.22   0.000    -1.281291   -.4646792
       _cons |   3.427811   .1608813    21.31   0.000     3.110413    3.745209
 ------------------------------------------------------------------------------

 . 
 . * In this model, democracy is understood as a categorical variable because we
 . * added the "i." prefix to it. The coefficient reveals that the fertility rates
 . * of democracies is, on average, significantly lower than in non-democracies.
 . * There is no regression coefficient for dictatorships: since democracy is a
 . * dummy, it takes only two values, 0 or 1. The coefficient is therefore null
 . * when democracy equals 0. Let's look at null models (Y = alpha) for a proof.
 . 
 . * Y = alpha + beta (democracy = 0) = alpha.
 . reg births if !democracy

      Source |       SS       df       MS              Number of obs =      74
 -------------+------------------------------           F(  0,    73) =    0.00
       Model |           0     0           .           Prob > F      =       .
    Residual |   159.51009    73  2.18506972           R-squared     =  0.0000
 -------------+------------------------------           Adj R-squared =  0.0000
       Total |   159.51009    73  2.18506972           Root MSE      =  1.4782

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
       _cons |   3.427811    .171837    19.95   0.000      3.08534    3.770282
 ------------------------------------------------------------------------------

 . 
 . * Y = alpha + beta (democracy = 1) = alpha + beta.
 . reg births if democracy

      Source |       SS       df       MS              Number of obs =     113
 -------------+------------------------------           F(  0,   112) =    0.00
       Model |           0     0           .           Prob > F      =       .
    Residual |  194.825458   112  1.73951302           R-squared     =  0.0000
 -------------+------------------------------           Adj R-squared =  0.0000
       Total |  194.825458   112  1.73951302           Root MSE      =  1.3189

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
       _cons |   2.554826   .1240722    20.59   0.000     2.308992    2.800659
 ------------------------------------------------------------------------------

 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require estout fre leanout mkcorr renvars

 . 
 . * Log results.
 . cap log using code/week9.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 9 -------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Fertility and Education, Part 3
 > 
 >  - DATA:   Quality of Government (2013)
 >  
 >    This is our final do-file with the Quality of Government example that we have
 >    been running over three sessions. It explains how to build on correlation and
 >    simple linear regression to produce complete linear regression models.
 >    
 >    The code contains details on several aspects of multiple linear regression.
 >    It also shows how to use the -estout- command to store and export the results
 >    of regression models.
 >    
 >    For your second draft, go as far as possible with multiple linear regression.
 >    Start with correlations if applicable, then go forward with simple linear
 >    regressions (add scatterplots if your predictors are continuous).
 > 
 >    Follow the instructions from the draft paper template. If you manage to go as
 >    far as diagnosing your model, discuss them and add interaction terms if you
 >    detect issues of multicollinearity.
 >    
 >    The next sessions will provide another way to model the data for dependent
 >    variables that are (or are closer to being) categorical in nature, and will
 >    go deeper into the core mechanics of regression modelling.
 > 
 >    Last updated 2013-05-28.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load QOG dataset.
 . use data/qog2013, clear
 (Quality of Government 2013)

 . 
 . * Rename variables to short handles.
 . renvars wdi_fr bl_asy25mf wdi_hiv ciri_wosoc \ births schooling hiv womenrights

 . 
 . * Transformation of real GDP per capita to logged units.
 . gen log_gdpc = ln(unna_gdp / unna_pop)
 (2 missing values generated)

 . la var log_gdpc "Real GDP/capita (constant USD, logged)"

 . 
 . * Dummy for the highest quartile of HIV/AIDS prevalence.
 . su hiv, d

       Prevalence of HIV (% of Population Aged 15-49)
 -------------------------------------------------------------
      Percentiles      Smallest
 1%           .1             .1
 5%           .1             .1
 10%           .1             .1       Obs                 147
 25%           .2             .1       Sum of Wgt.         147

 50%           .4                      Mean           1.922449
                        Largest       Std. Dev.       4.32927
 75%          1.3           17.2
 90%          4.8             23       Variance       18.74257
 95%         11.3           24.1       Skewness       3.769518
 99%         24.1           25.8       Kurtosis       17.76868

 . gen aids = (hiv > 1.5) if !mi(hiv)
 (46 missing values generated)

 . la var aids "Highest HIV/AIDS prevalence quartile"

 . 
 . * Recode regions to less, shorter labels.
 . recode ht_region (6/10 = 6), gen(region)
 (44 differences between ht_region and region)

 . la var region "Geographical region"

 . la val region region

 . la def region 1 "E. Europe and PSU" 2 "Lat. America" ///
 >     3 "N. Africa and M. East" 4 "Sub-Sah. Africa" ///
 >     5 "W. Europe and N. America" 6 "Asia, Pacific and Carribean" ///
 >     , replace

 . 
 . 
 . * Subsetting
 . * ----------
 . 
 . * Check missing values.
 . misstable pat births schooling log_gdpc aids, freq

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Frequency |  1  2  3  4
  ------------+-------------
          124 |  1  1  1  1
              |
           23 |  1  1  0  0
           22 |  1  1  1  0
           17 |  1  1  0  1
            5 |  1  0  0  0
            1 |  0  0  0  1
            1 |  0  1  1  1
  ------------+-------------
          193 |

  Variables are  (1) log_gdpc  (2) births  (3) aids  (4) schooling

 . 
 . * Check sampling bias due to low availability of schooling years.
 . gen mi = mi(schooling)

 . gr hbar (count) schooling (count) mi, over(region, sort(2)des) stack ///
 >     legend(order(1 "N(schooling)" 2 "Missing data")) ///
 >     name(mi, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Delete incomplete observations.
 . drop if mi(births, schooling, log_gdpc, aids, womenrights)
 (74 observations deleted)

 . 
 . * Final sample size.
 . count
  119

 . 
 . 
 . * Export summary statistics
 . * -------------------------
 . 
 . * The next command is part of the SRQM folder. If Stata returns an error when
 . * you run it, set the folder as your working directory and type -run profile-
 . * to run the course setup, then try the command again. If you still experience
 . * problems with the -stab- command, please send a detailed email on the issue.
 . 
 . stab using week9_stats.txt, replace ///
 >     mean(births schooling log_gdpc) ///
 >     prop(aids region)
 (note: file week9_stats.txt not found)

 Variable                     mean           sd          min          max         mea
 > n           sd          min          max         mean           sd          min   
 >        max         mean           sd          min          max         mean       
 >     sd          min          max         mean           sd          min          m
 > ax         mean           sd          min          max         mean           sd  
 >         min          max         mean           sd          min          max      
 >    mean           sd          min          max

 Highest HIV/AIDS p~r            %            %            %            %            
 > %            %            %            %            %            %

 Geographical region             %            %            %            %            
 > %            %            %            %            %            %

 N = 1190
 File: week9_stats.txt

 . 
 . /* Syntax of the -stab- command:
 > 
 >  - using FILE  - name of the exported file; plain text (.txt) recommended
 >  - replace     - overwrite any previously existing file
 >  - mean()      - summarizes a list of continuous variables (mean, sd, min, max)
 >  - prop()      - summarizes a list of categorical variables (frequencies)
 > 
 >   In the example above, the -stab- command will export two files to the working
 >   directory, containing summary statistics (week9_stats.txt) and a correlation
 >   matrix (week9_correlations.txt) created with the -corr()- argument. */
 . 
 . 
 . * =====================
 . * = ASSOCIATION TESTS =
 . * =====================
 . 
 . 
 . * Coefficients matrix.
 . corr births schooling log_gdpc
 (obs=119)

             |   births school~g log_gdpc
 -------------+---------------------------
      births |   1.0000
   schooling |  -0.7473   1.0000
    log_gdpc |  -0.7311   0.8013   1.0000


 . 
 . * Scatterplot matrix.
 . gr mat births schooling log_gdpc, half ///
 >     name(mat, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Export method using -mkcorr-.
 . mkcorr births schooling log_gdpc, ///
 >         lab num sig log("week9_mkcorr.txt") replace
 (note: file week9_mkcorr.txt not found)

 . 
 . * Export method using -estout-.
 . eststo clear

 . qui estpost correlate births schooling log_gdpc, matrix listwise

 . esttab using "week9_estpost.txt", unstack not compress label replace
 (note: file week9_estpost.txt not found)
 (output written to week9_estpost.txt)

 . 
 . 
 . * =====================
 . * = REGRESSION MODELS =
 . * =====================
 . 
 . 
 . * Simple linear regressions
 . * -------------------------
 . 
 . * We have covered simple linear regression last week, and we briefly mentioned 
 . * 'lin-log' equations then. There are more situations to cover in theory, so 
 . * review both notions together. Recall, first, the regression equation in the 
 . * simplest case, where all variables are linear: Y = a + BX.
 . 
 . * IV: Education.
 . sc births schooling || lfit births schooling, ///
 >         name(simplereg1, replace)
 (note: scheme burd not found, using s2color)

 . reg births schooling

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  1,   117) =  147.96
       Model |   132.04738     1   132.04738           Prob > F      =  0.0000
    Residual |  104.415906   117  .892443638           R-squared     =  0.5584
 -------------+------------------------------           Adj R-squared =  0.5547
       Total |  236.463285   118  2.00392615           Root MSE      =  .94469

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.3511008   .0288641   -12.16   0.000    -.4082646   -.2939371
       _cons |   5.503641   .2408313    22.85   0.000     5.026687    5.980595
 ------------------------------------------------------------------------------

 . 
 . * An increase in one unit of schooling (years) is associated to a negative 
 . * variation of -.4 births, or rather, 2-3 additional years of schooling are
 . * associated with birth rates that are one child lower on average. When the
 . * IV is logged, things get complex because the association rule changes.
 . 
 . * IV: Real GDP per capita.
 . sc births log_gdpc || lfit births log_gdpc, ///
 >         name(simplereg2, replace)
 (note: scheme burd not found, using s2color)

 . reg births log_gdpc

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  1,   117) =  134.38
       Model |   126.40484     1   126.40484           Prob > F      =  0.0000
    Residual |  110.058446   117  .940670476           R-squared     =  0.5346
 -------------+------------------------------           Adj R-squared =  0.5306
       Total |  236.463285   118  2.00392615           Root MSE      =  .96988

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    log_gdpc |  -.6289964   .0542607   -11.59   0.000    -.7364567    -.521536
       _cons |   7.916398   .4527607    17.48   0.000      7.01973    8.813067
 ------------------------------------------------------------------------------

 . 
 . * In this 'lin-log' equation, a 1% increase in GDP per capita is associated to a
 . * 0.01 * -.8 variation in the birth rate, or more exactly, -.8 * log(1.01). The
 . * mathematical trick is now to reverse the equation to understand the mechanism:
 . 
 . * Inverting the terms.
 . reg log_gdpc births

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  1,   117) =  134.38
       Model |   170.79196     1   170.79196           Prob > F      =  0.0000
    Residual |  148.705522   117  1.27098737           R-squared     =  0.5346
 -------------+------------------------------           Adj R-squared =  0.5306
       Total |  319.497482   118  2.70760578           Root MSE      =  1.1274

 ------------------------------------------------------------------------------
    log_gdpc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
      births |  -.8498687   .0733143   -11.59   0.000    -.9950639   -.7046736
       _cons |   10.53596   .2278731    46.24   0.000     10.08467    10.98725
 ------------------------------------------------------------------------------

 . 
 . * The equation is now log-linear ('log-lin') instead of being 'lin-log'. The
 . * interpretation is: an increase in one child per woman is associated to GDP
 . * per capita that is 100 * -.8 = 80% lower (remember: on average).
 . 
 . * Illustrate the principle with two regions different by one child per woman.
 . tab region if region > 4, su(births)

            |  Summary of Fertility Rate (Births
 Geographica |             per Woman)
   l region |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
  W. Europe |   1.7452913   .28212197          23
  Asia, Pac |       2.465   1.0945649          24
 ------------+------------------------------------
      Total |   2.1128021   .87712753          47

 . tab region if region > 4, su(wdi_gdpc)

            |   Summary of GDP per Capita, PPP
 Geographica |    (Constant International USD)
   l region |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
  W. Europe |   33611.455   9581.3219          23
  Asia, Pac |   8293.5917   10993.016          22
 ------------+------------------------------------
      Total |   21233.833   16351.978          45

 . 
 . * In 'lin-log' and 'log-lin' equations, changes are proportionate rather than
 . * absolute. In a 'log-log' model, interpretation is proportionate on both sides
 . * of the equation: a 1% change in X is associated to a B% change in Y.
 . 
 . * IV-IV interaction.
 . sc schooling log_gdpc || lfit schooling log_gdpc, ///
 >         name(simplereg3, replace)
 (note: scheme burd not found, using s2color)

 . reg schooling log_gdpc

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  1,   117) =  209.83
       Model |  687.719811     1  687.719811           Prob > F      =  0.0000
    Residual |  383.469122   117  3.27751387           R-squared     =  0.6420
 -------------+------------------------------           Adj R-squared =  0.6390
       Total |  1071.18893   118  9.07787231           Root MSE      =  1.8104

 ------------------------------------------------------------------------------
   schooling |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    log_gdpc |   1.467142   .1012835    14.49   0.000     1.266555    1.667728
       _cons |  -4.218189   .8451274    -4.99   0.000     -5.89192   -2.544458
 ------------------------------------------------------------------------------

 . 
 . 
 . * Multiple linear regression
 . * --------------------------
 . 
 . * With schooling in metric units.
 . reg births schooling log_gdpc

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  2,   116) =   89.72
       Model |  143.622115     2  71.8110573           Prob > F      =  0.0000
    Residual |  92.8411708   116   .80035492           R-squared     =  0.6074
 -------------+------------------------------           Adj R-squared =  0.6006
       Total |  236.463285   118  2.00392615           Root MSE      =  .89463

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2118932   .0456853    -4.64   0.000    -.3023786   -.1214078
    log_gdpc |   -.318119   .0836518    -3.80   0.000     -.483802    -.152436
       _cons |   7.022593    .459947    15.27   0.000      6.11161    7.933576
 ------------------------------------------------------------------------------

 . 
 . * Recall the last model.
 . reg

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  2,   116) =   89.72
       Model |  143.622115     2  71.8110573           Prob > F      =  0.0000
    Residual |  92.8411708   116   .80035492           R-squared     =  0.6074
 -------------+------------------------------           Adj R-squared =  0.6006
       Total |  236.463285   118  2.00392615           Root MSE      =  .89463

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2118932   .0456853    -4.64   0.000    -.3023786   -.1214078
    log_gdpc |   -.318119   .0836518    -3.80   0.000     -.483802    -.152436
       _cons |   7.022593    .459947    15.27   0.000      6.11161    7.933576
 ------------------------------------------------------------------------------

 . 
 . * Recall the last model, with cleaner output.
 . leanout:
 
 Dependent variable: births

          Variable    Coef     SE      95%  CI
  -----------------------------------------------
         schooling   -0.2    0.0   ( -0.3, -0.1)
          log_gdpc   -0.3    0.1   ( -0.5, -0.2)
             _cons    7.0    0.5   (  6.1,  7.9)
  -----------------------------------------------
 Number of observations = 119
 Root Mean Squared Error =   0.9

 . 
 . 
 . * Standardised ('beta') coefficients
 . * ----------------------------------
 . 
 . * With standardised, or 'beta', coefficients (abbreviated to -b- hereinafter).
 . reg births schooling log_gdpc, beta

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  2,   116) =   89.72
       Model |  143.622115     2  71.8110573           Prob > F      =  0.0000
    Residual |  92.8411708   116   .80035492           R-squared     =  0.6074
 -------------+------------------------------           Adj R-squared =  0.6006
       Total |  236.463285   118  2.00392615           Root MSE      =  .89463

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|                     Beta
 -------------+----------------------------------------------------------------
   schooling |  -.2118932   .0456853    -4.64   0.000                -.4509913
    log_gdpc |   -.318119   .0836518    -3.80   0.000                -.3697784
       _cons |   7.022593    .459947    15.27   0.000                        .
 ------------------------------------------------------------------------------

 . 
 . * Proof of concept: Each variable in the equation has a different distribution
 . * and therefore a different standard deviation. As such, regression with metric
 . * coefficients cannot inform us of how variables perform against each other in
 . * explaining variance, because different metrics make coefficients uncomparable.
 . * One unit of births, for instance, is one child, while one log-unit of GDP per
 . * capita is, after unlogging, millions of U.S. dollars: their coefficient are
 . * produced in these units and their values are therefore incommensurable.
 . su births schooling log_gdpc

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
      births |       119    2.770129    1.415601      1.149      7.115
   schooling |       119    7.785548    3.012951   1.202597   13.27008
    log_gdpc |       119    8.181716     1.64548   5.194377   11.30008

 . 
 . * If each variable had a mean of 0 and variance of 1, then the coefficients
 . * would become comparable because they would be following the unique metric
 . * of a standard normal distribution. Standardising is the name of that process
 . * that loses the metric, sensible units of variables to create a fictional view
 . * of coefficients that indicates which coefficient produces the biggest effect
 . * on the dependent variable and thus explains most variance within the model.
 . egen std_births    = std(births)

 . egen std_schooling = std(schooling)

 . egen std_log_gdpc  = std(log_gdpc)

 . su std_*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
  std_births |       119   -6.38e-10           1  -1.145187   3.069277
 std_school~g |       119    4.76e-09           1  -2.184885   1.820319
 std_log_gdpc |       119   -2.85e-09           1  -1.815482   1.895109

 . 
 . * Compare both regression outputs. The first one is the linear regression that
 . * produces identical coefficients to the right hand column of the second one.
 . reg std_*

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  2,   116) =   89.72
       Model |  71.6703624     2  35.8351812           Prob > F      =  0.0000
    Residual |  46.3296378   116   .39939343           R-squared     =  0.6074
 -------------+------------------------------           Adj R-squared =  0.6006
       Total |         118   118           1           Root MSE      =  .63198

 -------------------------------------------------------------------------------
   std_births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 --------------+----------------------------------------------------------------
 std_schooling |  -.4509913    .097236    -4.64   0.000    -.6435796   -.2584031
 std_log_gdpc |  -.3697784    .097236    -3.80   0.000    -.5623666   -.1771901
        _cons |   4.53e-10   .0579331     0.00   1.000    -.1147439    .1147439
 -------------------------------------------------------------------------------

 . reg births schooling log_gdpc, b

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  2,   116) =   89.72
       Model |  143.622115     2  71.8110573           Prob > F      =  0.0000
    Residual |  92.8411708   116   .80035492           R-squared     =  0.6074
 -------------+------------------------------           Adj R-squared =  0.6006
       Total |  236.463285   118  2.00392615           Root MSE      =  .89463

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|                     Beta
 -------------+----------------------------------------------------------------
   schooling |  -.2118932   .0456853    -4.64   0.000                -.4509913
    log_gdpc |   -.318119   .0836518    -3.80   0.000                -.3697784
       _cons |   7.022593    .459947    15.27   0.000                        .
 ------------------------------------------------------------------------------

 . 
 . * Using the second command shown above is much quicker than using the 'std_*'
 . * trick that is featured here only as a teaching example. Note, finally, that
 . * you should NOT report standardized coefficients: their use is controversial,
 . * and their interpretation is less substantive than unstandardized ones. Your
 . * focus should always be on results expressed in meaningful units.
 . 
 . 
 . * Dummies (categorical variables)
 . * -------------------------------
 . 
 . * Visualizing two categories (Asia and Africa) within the sample.
 . tw (sc births schooling if region == 4, ms(O)) ///
 >     (sc births schooling if region == 6, ms(O)) ///
 >     (sc births schooling if !inlist(region,4,6), mc(gs10)) ///
 >     (lfit births schooling, lc(gs10)), ///
 >     legend(order(1 "African countries" 3 "Rest of sample" ///
 >     2 "Asian countries" 4 "Fitted values") row(2)) yti("Fertility rate") ///
 >     name(reg_geo1, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Previous regression model.
 . reg births schooling log_gdpc

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  2,   116) =   89.72
       Model |  143.622115     2  71.8110573           Prob > F      =  0.0000
    Residual |  92.8411708   116   .80035492           R-squared     =  0.6074
 -------------+------------------------------           Adj R-squared =  0.6006
       Total |  236.463285   118  2.00392615           Root MSE      =  .89463

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2118932   .0456853    -4.64   0.000    -.3023786   -.1214078
    log_gdpc |   -.318119   .0836518    -3.80   0.000     -.483802    -.152436
       _cons |   7.022593    .459947    15.27   0.000      6.11161    7.933576
 ------------------------------------------------------------------------------

 . 
 . * Previous regression model with geographical region and HIV dummies.
 . reg births schooling log_gdpc i.region

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  7,   111) =   49.80
       Model |   179.35545     7  25.6222071           Prob > F      =  0.0000
    Residual |  57.1078356   111  .514485006           R-squared     =  0.7585
 -------------+------------------------------           Adj R-squared =  0.7433
       Total |  236.463285   118  2.00392615           Root MSE      =  .71728

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.0848665   .0461435    -1.84   0.069    -.1763029    .0065699
    log_gdpc |  -.4523074   .0892088    -5.07   0.000    -.6290805   -.2755342
             |
      region |
          2  |   .3142904    .269067     1.17   0.245    -.2188838    .8474646
          3  |   .6165616   .3584902     1.72   0.088    -.0938108    1.326934
          4  |   1.499294   .2951472     5.08   0.000     .9144403    2.084148
          5  |   .9249256   .2878394     3.21   0.002     .3545527    1.495298
          6  |  -.0055409   .2628438    -0.02   0.983    -.5263834    .5153016
             |
       _cons |    6.48954   .5903739    10.99   0.000     5.319675    7.659405
 ------------------------------------------------------------------------------

 . 
 . * Proof of concept: A dummy simply codes for a particular category against all
 . * others. Running a dummy in a regression models adds a component to the linear
 . * equation for which the variable is equal either 0 or 1. Consequently, its
 . * coefficient indicates how each category performs in relation to the baseline.
 . * The baseline is, by default, the first category in the variable. Looking at
 . * predicted values, we can draw parallel regression lines for dummies.
 . 
 . * Bivariate regression model, for demonstration purposes.
 . reg births schooling i.region

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  6,   112) =   44.09
       Model |  166.129565     6  27.6882609           Prob > F      =  0.0000
    Residual |  70.3337198   112  .627979641           R-squared     =  0.7026
 -------------+------------------------------           Adj R-squared =  0.6866
       Total |  236.463285   118  2.00392615           Root MSE      =  .79245

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2413827   .0378915    -6.37   0.000    -.3164598   -.1663056
             |
      region |
          2  |   .0798961   .2928465     0.27   0.785    -.5003418    .6601339
          3  |  -.0090852   .3718599    -0.02   0.981     -.745878    .7277076
          4  |   1.413288   .3255417     4.34   0.000     .7682689    2.058307
          5  |   .0439706   .2535334     0.17   0.863    -.4583734    .5463145
          6  |  -.1728805   .2880932    -0.60   0.550    -.7437003    .3979393
             |
       _cons |   4.308699   .4467725     9.64   0.000     3.423477    5.193922
 ------------------------------------------------------------------------------

 . 
 . * Storing fitted (predicted) values.
 . cap drop yhat

 . predict yhat
 (option xb assumed; fitted values)

 . 
 . * Regression lines for the predicted values of Asia and Africa.
 . tw (sc births schooling if region == 4, mc(blue) ms(O)) ///
 >     (sc births schooling if region == 6, mc(red) ms(O)) ///
 >     (sc births schooling if !inlist(region,4,6), mc(gs10)) ///
 >     (rcap yhat births schooling if region == 4, ///
 >         c(l) lc(blue) lp(dash) msize(tiny)) ///
 >     (rcap yhat births schooling if region == 6, ///
 >         c(l) lc(red) lp(dash) msize(tiny)) ///
 >     (sc yhat schooling if region == 4, c(l) ms(i) mc(blue) lc(blue)) ///
 >     (sc yhat schooling if region == 6, c(l) ms(i) mc(red) lc(red)), ///
 >     legend(order(1 "African countries" 6 "Fitted values (Africa)" ///
 >         4 "Residuals (Africa)" ///
 >         2 "Asian countries" 7 "Fitted values (Asia)" ///
 >         5 "Residuals (Asia)") row(2)) ///
 >     name(reg_geo2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The example above is just a teaching demonstration: geographical continents
 . * are not appropriate as predictors. Let's now run some substantive examples,
 . * using a dummy and a 4-level categorical predictor.
 . 
 . * Visualizing HIV/AIDS dummy within the sample.
 . tw (sc births schooling if !aids, ms(O)) ///
 >     (sc births schooling if aids, ms(O)) ///
 >     (lfit births schooling, lc(gs10)), ///
 >     legend(order(2 "High AIDS prevalence" 1 "Rest of sample") row(1)) ///
 >     yti("Fertility rate") ///
 >     name(reg_aids1, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Regression line for the HIV/AIDS dummy.
 . tw (sc yhat aids) (lfit yhat aids), xlab(0 "Low" 1 "High") ///
 >         name(reg_aids2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Comparison of t-test and regression results for a single dummy.
 . ttest yhat, by(aids)

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
       0 |      96       2.417    .1005316    .9850044     2.21742    2.616581
       1 |      23    4.244056    .1541236    .7391509    3.924423    4.563689
 ---------+--------------------------------------------------------------------
 combined |     119    2.770129      .10877     1.18654    2.554734    2.985523
 ---------+--------------------------------------------------------------------
    diff |           -1.827056    .2190775               -2.260927   -1.393184
 ------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -8.3398
 Ho: diff = 0                                     degrees of freedom =      117

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

 . reg yhat i.aids

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  1,   117) =   69.55
       Model |   61.937793     1   61.937793           Prob > F      =  0.0000
    Residual |  104.191771   117  .890527959           R-squared     =  0.3728
 -------------+------------------------------           Adj R-squared =  0.3675
       Total |  166.129564   118  1.40787766           Root MSE      =  .94368

 ------------------------------------------------------------------------------
        yhat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
      1.aids |   1.827056   .2190775     8.34   0.000     1.393184    2.260927
       _cons |      2.417   .0963137    25.10   0.000     2.226256    2.607744
 ------------------------------------------------------------------------------

 . 
 . * Switching to fertility and women's rights.
 . fre womenrights

 womenrights -- Women's Social Rights
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   0     |         27      22.69      22.69      22.69
        1     |         44      36.97      36.97      59.66
        2     |         27      22.69      22.69      82.35
        3     |         21      17.65      17.65     100.00
        Total |        119     100.00     100.00           
 -----------------------------------------------------------

 . 
 . * We start by visualizing the average fertility for each level of rights. The 
 . * plot contains a LOWESS smoothed trend to show the DV mean at each level.
 . sc births womenrights, yti("Fertility rate") || lowess births womenrights, ///
 >         name(fert_womenrights, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Regression model.
 . reg births i.womenrights

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  3,   115) =   13.41
       Model |  61.2711874     3  20.4237291           Prob > F      =  0.0000
    Residual |  175.192098   115  1.52340955           R-squared     =  0.2591
 -------------+------------------------------           Adj R-squared =  0.2398
       Total |  236.463285   118  2.00392615           Root MSE      =  1.2343

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 womenrights |
          1  |  -.5677339   .3017375    -1.88   0.062    -1.165418      .02995
          2  |  -1.565604   .3359243    -4.66   0.000    -2.231005   -.9002023
          3  |  -1.937111   .3591182    -5.39   0.000    -2.648455   -1.225767
             |
       _cons |   3.677111   .2375344    15.48   0.000     3.206601    4.147621
 ------------------------------------------------------------------------------

 . 
 . * The baseline category here is womenrights = 0 (no women's rights). Compared
 . * to countries in this category, other countries have lower mean fertility rates
 . * and the effect increases as women's rights increases from categories 1 to 3.
 . 
 . * Change the baseline category to highest level "3" of women's rights. This is
 . * convenient when you need to compare from a reference category that is not the
 . * first one in the coding of the variable, which is the Stata default.
 . reg births ib1.womenrights

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  3,   115) =   13.41
       Model |  61.2711874     3  20.4237291           Prob > F      =  0.0000
    Residual |  175.192098   115  1.52340955           R-squared     =  0.2591
 -------------+------------------------------           Adj R-squared =  0.2398
       Total |  236.463285   118  2.00392615           Root MSE      =  1.2343

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 womenrights |
          0  |   .5677339   .3017375     1.88   0.062      -.02995    1.165418
          2  |  -.9978699   .3017375    -3.31   0.001    -1.595554   -.4001859
          3  |  -1.369377   .3273626    -4.18   0.000     -2.01782    -.720935
             |
       _cons |   3.109377   .1860724    16.71   0.000     2.740804    3.477951
 ------------------------------------------------------------------------------

 . 
 . 
 . * =========================
 . * = REGRESSION DIAGNOSTICS =
 . * =========================
 . 
 . 
 . * Rerun regression model. Note that the "i." prefix is optional for dummies.
 . reg births schooling log_gdpc aids i.womenrights

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  6,   112) =   38.25
       Model |  158.915569     6  26.4859282           Prob > F      =  0.0000
    Residual |   77.547716   112  .692390321           R-squared     =  0.6721
 -------------+------------------------------           Adj R-squared =  0.6545
       Total |  236.463285   118  2.00392615           Root MSE      =   .8321

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2113981   .0454762    -4.65   0.000    -.3015035   -.1212928
    log_gdpc |   -.307614   .0913113    -3.37   0.001    -.4885357   -.1266923
        aids |   .8945646    .217741     4.11   0.000     .4631387     1.32599
             |
 womenrights |
          1  |    .268684   .2215173     1.21   0.228    -.1702241    .7075921
          2  |   .1263621   .2696793     0.47   0.640    -.4079728     .660697
          3  |   .6720761    .331005     2.03   0.045     .0162321     1.32792
             |
       _cons |   6.513273   .5711178    11.40   0.000     5.381676    7.644869
 ------------------------------------------------------------------------------

 . 
 . * Storing fitted (predicted) values.
 . cap drop yhat

 . predict yhat
 (option xb assumed; fitted values)

 . 
 . 
 . * (1) Standardized residuals
 . * --------------------------
 . 
 . * Store the unstandardized (metric) residuals.
 . cap drop r

 . predict r, resid

 . 
 . * Assess the normality of residuals.
 . kdensity r, norm legend(off) ti("") ///
 >     name(diag_kdens, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Homoskedasticity of the residuals versus fitted values (DV).
 . rvfplot, yline(0) ms(i) mlab(ccodewb) name(diag_rvf, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * Store the standardized residuals.
 . cap drop rsta

 . predict rsta, rsta

 . 
 . * Identify outliers beyond 2 standard deviation units.
 . sc rsta yhat, yline(-2 2) || sc rsta yhat if abs(rsta) > 2, ///
 >     ylab(-3(1)3) mlab(ccodewb) legend(lab(2 "Outliers")) ///
 >     name(diag_rsta, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * (2) Heteroskedasticity
 . * ----------------------
 . 
 . * Homoskedasticity of the residuals versus one predictor (IV), also showing the
 . * outliers above two standard deviation units (standardised residuals). This is
 . * a more complex diagnostic that shows how one variable influences the model in
 . * the background of the main regression equation. It might show some predictors
 . * are responsible for the overall sampling distribution of the residuals, which
 . * means that the model is captive of a restricted number of predictors.
 . sc r schooling, ///
 >         yline(0) mlab(ccodewb) legend(lab(2 "Outliers")) ///
 >         name(diag_edu1, replace)
 (note: scheme burd not found, using s2color)

 . 
 . * The trend in the error term can be visualized as a LOWESS curve to show when
 . * and how departures from homogenous variance occur throughout the sample as a
 . * function of the predictor. The trend reflects the influence of outliers with
 . * reference to that particular predictor: if the error term of the model shows
 . * a pattern in its standard errors, the LOWESS curve will show it by deviating
 . * from the null y-axis at values of the IV where the residuals are "clustered"
 . * above or below the expected mean of zero (which indicates homoskedasticity).
 . lowess rsta schooling, bw(.5) yline(0) ///
 >         name(diag_edu2, replace)
 (note: scheme burd not found, using s2color)

 . 
 . 
 . * (3) Variance inflation and interaction terms
 . * --------------------------------------------
 . 
 . * The Variance Inflation Factor (VIF) diagnoses an issue with 'kitchen sink'
 . * models that use high numbers of correlated variables together in the model,
 . * which measures several times the same effect and creates multicollinearity.
 . * This problem renders the regression coefficients useless. Critical cut-off
 . * points for variance inflation are VIF > 10 or 1/VIF < .1 (tolerance). Each
 . * VIF is computed as the reciprocal of the inverse R-squared, 1/(1-R^2), for
 . * each predictor in the model (that is, the R-squared of that variable minus
 . * the R-squared of the entire model without it).
 . vif

    Variable |       VIF       1/VIF  
 -------------+----------------------
   schooling |      3.20    0.312547
    log_gdpc |      3.85    0.259917
        aids |      1.27    0.787079
 womenrights |
          1  |      1.97    0.508825
          2  |      2.19    0.456091
          3  |      2.74    0.365413
 -------------+----------------------
    Mean VIF |      2.54

 . 
 . * Adding an interaction term is a technique to account for the variance that
 . * two variables explain in each other. The effect is calculated by multiplying
 . * the two variables together and throwing that product in the regression model.
 . * The regression coefficient for this product is the interaction effect. If that
 . * effect is significantly large, the model accounts for it by isolating it and
 . * reading other coefficients.
 . gen schoolingXlog_gdpc = schooling * log_gdpc

 . la var schoolingXlog_gdpc "GDP * Education"

 . 
 . * Regression model.
 . reg births schooling log_gdpc aids

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  3,   115) =   72.95
       Model |  155.009323     3  51.6697745           Prob > F      =  0.0000
    Residual |  81.4539618   115   .70829532           R-squared     =  0.6555
 -------------+------------------------------           Adj R-squared =  0.6465
       Total |  236.463285   118  2.00392615           Root MSE      =   .8416

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
   schooling |  -.2028221   .0430371    -4.71   0.000    -.2880703   -.1175739
    log_gdpc |  -.2464903   .0806962    -3.05   0.003    -.4063338   -.0866467
        aids |   .8600229   .2144907     4.01   0.000      .435158    1.284888
       _cons |     6.1997   .4788918    12.95   0.000     5.251108    7.148293
 ------------------------------------------------------------------------------

 . 
 . * Regression model with an interaction term.
 . reg births schooling log_gdpc schoolingXlog_gdpc aids

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  4,   114) =   74.29
       Model |  170.899523     4  42.7248808           Prob > F      =  0.0000
    Residual |  65.5637622   114  .575120721           R-squared     =  0.7227
 -------------+------------------------------           Adj R-squared =  0.7130
       Total |  236.463285   118  2.00392615           Root MSE      =  .75837

 -----------------------------------------------------------------------------------
           births |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 ------------------+----------------------------------------------------------------
        schooling |  -.8492374   .1289475    -6.59   0.000    -1.104681   -.5937934
         log_gdpc |  -1.012535   .1628702    -6.22   0.000     -1.33518   -.6898907
 schoolingXlog_g~c |    .087058   .0165624     5.26   0.000      .054248     .119868
             aids |   .8010199    .193603     4.14   0.000     .4174938    1.184546
            _cons |   11.62292   1.118353    10.39   0.000     9.407471    13.83837
 -----------------------------------------------------------------------------------

 . 
 . * Standardised coefficients reveal the extent to which the interaction actually
 . * influences the model, in comparison to all other included predictors (IVs).
 . reg births schooling log_gdpc schoolingXlog_gdpc aids, b

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  4,   114) =   74.29
       Model |  170.899523     4  42.7248808           Prob > F      =  0.0000
    Residual |  65.5637622   114  .575120721           R-squared     =  0.7227
 -------------+------------------------------           Adj R-squared =  0.7130
       Total |  236.463285   118  2.00392615           Root MSE      =  .75837

 -----------------------------------------------------------------------------------
           births |      Coef.   Std. Err.      t    P>|t|                     Beta
 ------------------+----------------------------------------------------------------
        schooling |  -.8492374   .1289475    -6.59   0.000                -1.807508
         log_gdpc |  -1.012535   .1628702    -6.22   0.000                -1.176961
 schoolingXlog_g~c |    .087058   .0165624     5.26   0.000                 2.165221
             aids |   .8010199    .193603     4.14   0.000                 .2243817
            _cons |   11.62292   1.118353    10.39   0.000                        .
 -----------------------------------------------------------------------------------

 . 
 . * Last, a shorter way to write up an interaction for two continuous predictors.
 . reg births c.schooling##c.log_gdpc aids, b

      Source |       SS       df       MS              Number of obs =     119
 -------------+------------------------------           F(  4,   114) =   74.29
       Model |  170.899526     4  42.7248815           Prob > F      =  0.0000
    Residual |  65.5637592   114  .575120695           R-squared     =  0.7227
 -------------+------------------------------           Adj R-squared =  0.7130
       Total |  236.463285   118  2.00392615           Root MSE      =  .75837

 ------------------------------------------------------------------------------
      births |      Coef.   Std. Err.      t    P>|t|                     Beta
 -------------+----------------------------------------------------------------
   schooling |  -.8492375   .1289475    -6.59   0.000                -1.807508
    log_gdpc |  -1.012536   .1628702    -6.22   0.000                -1.176961
             |
 c.schooling#|
  c.log_gdpc |    .087058   .0165624     5.26   0.000                 2.165222
             |
        aids |   .8010198    .193603     4.14   0.000                 .2243817
       _cons |   11.62292   1.118353    10.39   0.000                        .
 ------------------------------------------------------------------------------

 . 
 . 
 . * ========================
 . * = EXPORT MODEL RESULTS =
 . * ========================
 . 
 . 
 . * This section shows how to export regression results, in order to avoid having
 . * to copy out the results by hand, copy-paste or any other risky (non)technique
 . * that you might come up with at that stage. Exporting regression results also
 . * make it easier to build several regression models based on varying sets of
 . * covariates (independent variables), in order to compare their coefficients.
 . 
 . * The next commands require that you install the -estout- package first. Another
 . * frequently used command for the same task is the -outreg- or -outreg2- command
 . * that can be downloaded with -ssc install-.
 . 
 . * Wipe any previous regression estimates.
 . eststo clear

 . 
 . * Model 1: 'Baseline model'.
 . eststo M1: qui reg births schooling log_gdpc

 . 
 . * Re-read, in simplified form.
 . leanout:
 
 Dependent variable: births

          Variable    Coef     SE      95%  CI
  -----------------------------------------------
         schooling   -0.2    0.0   ( -0.3, -0.1)
          log_gdpc   -0.3    0.1   ( -0.5, -0.2)
             _cons    7.0    0.5   (  6.1,  7.9)
  -----------------------------------------------
 Number of observations = 119
 Root Mean Squared Error =   0.9

 . 
 . * Model 2: Adding the HIV/AIDS dummy.
 . eststo M2: qui reg births schooling log_gdpc aids

 . 
 . * Re-read, in simplified form.
 . leanout:
 
 Dependent variable: births

          Variable    Coef     SE      95%  CI
  -----------------------------------------------
         schooling   -0.2    0.0   ( -0.3, -0.1)
          log_gdpc   -0.2    0.1   ( -0.4, -0.1)
              aids    0.9    0.2   (  0.4,  1.3)
             _cons    6.2    0.5   (  5.3,  7.1)
  -----------------------------------------------
 Number of observations = 119
 Root Mean Squared Error =   0.8

 . 
 . * Model 3: Adding the interaction between education and wealth.
 . eststo M3: qui reg births c.schooling##c.log_gdpc aids

 . 
 . * Re-read, in simplified form.
 . leanout:
 
 Dependent variable: births

          Variable    Coef     SE      95%  CI
  -----------------------------------------------
         schooling   -0.8    0.1   ( -1.1, -0.6)
          log_gdpc   -1.0    0.2   ( -1.3, -0.7)
                   
       c.schooling#
        c.log_gdpc    0.1    0.0   (  0.1,  0.1)
                   
              aids    0.8    0.2   (  0.4,  1.2)
             _cons   11.6    1.1   (  9.4, 13.8)
  -----------------------------------------------
 Number of observations = 119
 Root Mean Squared Error =   0.8

 . 
 . * Compare all models on screen.
 . esttab M1 M2 M3, lab b(1) se(1) sca(rmse) ///
 >     mti("Baseline" "Control" "Interaction")

 --------------------------------------------------------------------
                              (1)             (2)             (3)   
                         Baseline         Control     Interaction   
 --------------------------------------------------------------------
 Average Schooling ~          -0.2***         -0.2***         -0.8***
                            (0.0)           (0.0)           (0.1)   

 Real GDP/capita (c~l         -0.3***         -0.2**          -1.0***
                            (0.1)           (0.1)           (0.2)   

 Highest HIV/AIDS p~r                          0.9***          0.8***
                                            (0.2)           (0.2)   

 c.schooling#c.log_~c                                          0.1***
                                                            (0.0)   

 Constant                      7.0***          6.2***         11.6***
                            (0.5)           (0.5)           (1.1)   
 --------------------------------------------------------------------
 Observations                  119             119             119   
 rmse                          0.9             0.8             0.8   
 --------------------------------------------------------------------
 Standard errors in parentheses
 * p<0.05, ** p<0.01, *** p<0.001

 . 
 . * Export all models for comparison and reporting.
 . esttab M1 M2 M3 using week9_regressions.txt, replace /// 
 >         lab b(1) se(1) sca(rmse) ///
 >     mti("Baseline" "Controls" "Interactions")
 (note: file week9_regressions.txt not found)
 (output written to week9_regressions.txt)

 . 
 . /* Basic usage of -estout- commands:
 >   
 >  - The -estout- commands work by storing model estimates with -eststo- and then
 >    putting them into tables with -esttab-. Use these commands at the end of your
 >    models: start with -reg- and -leanout-, then use -eststo- and -esttab-.
 >    
 >  - The -estout- command is especially practical when you run many models, as
 >    shown here when we compare the model between country cases and then check
 >    how the DV model compares to other satisfaction measures (covariates).
 > 
 >  - Check the -estout- online documentation for more examples. */
 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require estout fre tab_chi renvars scheme-burd

 . 
 . * Log results.
 . cap log using code/week10.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 10 ------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Attitudes Towards Immigration in Europe
 >  
 >  - DATA:   European Social Survey Round 4 (2008)
 > 
 >    This do-file complements the series that we finished running last week using
 >    the Quality of Government dataset. It shows how multiple regression can apply
 >    to survey data, and introduces a different form of regression model.
 >    
 >    Survey data commonly feature response items that are discrete rather than 
 >    continuous. This means that linear regression models will be of limited use
 >    with this type of data.
 >    
 >    When the dependent variable cannot be normaly distributed, a solution is to 
 >    simplify it to a dummy and to estimate a logistic regression model, which is
 >    a generalization of the linear model.
 >    
 >    This do-file introduces logistic models. For your own work, decide whether a
 >    logistic estimator is more appropriate than a linear one, and include draft
 >    models in your revised draft.
 >    
 >    Last updated 2013-05-31.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load ESS dataset.
 . use data/ess2008, clear
 (European Social Survey 2008)

 . 
 . * Subsetting to respondents age 25+ with full data.
 . drop if agea < 25 | mi(imdfetn, agea, gndr, brncntr, eduyrs, hinctnta, lrscale)
 (25884 observations deleted)

 . 
 . * Survey weights (design weight by country, multiplied by population weight).
 . gen dpw = dweight * pweight

 . la var dpw "Survey weight (population*design)"

 . 
 . * Country dummies (used for clustered standard errors).
 . encode cntry, gen(cid)

 . 
 . 
 . * DV: Allow many/few immigrants of different race/ethnic group from majority
 . * --------------------------------------------------------------------------
 . 
 . fre imdfetn

 imdfetn -- Allow many/few immigrants of different race/ethnic group from majority
 -----------------------------------------------------------------------------------
                                      |      Freq.    Percent      Valid       Cum.
 --------------------------------------+--------------------------------------------
 Valid   1 Allow many to come and live |       3959      12.83      12.83      12.83
          here                        |                                            
        2 Allow some                  |      11913      38.59      38.59      51.42
        3 Allow a few                 |      10386      33.65      33.65      85.07
        4 Allow none                  |       4610      14.93      14.93     100.00
        Total                         |      30868     100.00     100.00           
 -----------------------------------------------------------------------------------

 . 
 . * Relabel for concise legends in graphs.
 . la def imdfetn 1 "Many" 2 "Some" 3 "Few" 4 "None", replace

 . 
 . * Normality: distribution shows symmetricality but the reduced number of items
 . * on a 4-point scale limits variability and will create postestimation issues.
 . hist imdfetn, discrete percent addl ///
 >         name(dv, replace)
 (start=1, width=1)

 . 
 . * Dummy: 1 = allow many/some immigrants.
 . gen diff = (imdfetn < 3)

 . la var diff "Allow many/some migrants of different race/ethnicity from majority"

 . 
 . 
 . * IVs: age, gender, country of birth, education, income, left-right scale
 . * -----------------------------------------------------------------------
 . 
 . d agea gndr brncntr eduyrs hinctnta lrscale

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 agea            int    %3.0f       agea       Age of respondent, calculated
 gndr            byte   %1.0f       gndr       Gender
 brncntr         byte   %1.0f       brncntr    Born in country
 eduyrs          byte   %2.0f       eduyrs     Years of full-time education completed
 hinctnta        byte   %2.0f       hinctnta   Household's total net income, all
                                                sources
 lrscale         byte   %2.0f       lrscale    Placement on left right scale

 . 
 . * Renaming.
 . renvars agea hinctnta lrscale \ age income rightwing

 . 
 . * Create age groups.
 . gen cohort = irecode(age, 24, 34, 44, 54, 64, 74)

 . replace cohort = 15 + 10 * cohort
 (30868 real changes made)

 . 
 . * Dummify sex.
 . gen female:sex = (gndr == 2)

 . la def sex 0 "Male" 1 "Female", replace

 . 
 . * Dummify country of birth.
 . gen born:born = (brncntr == 1)

 . la def born 0 "Foreign-born" 1 "Born in country", replace

 . 
 . * Recode education years.
 . su eduyrs, d

           Years of full-time education completed
 -------------------------------------------------------------
      Percentiles      Smallest
 1%            1              0
 5%            5              0
 10%            7              0       Obs               30868
 25%           10              0       Sum of Wgt.       30868

 50%           12                      Mean           12.41982
                        Largest       Std. Dev.      4.257545
 75%           15             39
 90%           18             40       Variance       18.12669
 95%           19             43       Skewness      -.0218349
 99%           22             48       Kurtosis       3.770817

 . xtile edu3 = eduyrs if eduyrs < 22, nq(3)

 . la var edu3 "Education level"

 . la def edu3 1 "Low" 2 "Medium" 3 "High"

 . la val edu3 edu3

 . 
 . 
 . * Export summary statistics
 . * -------------------------
 . 
 . * The next command is part of the SRQM folder. If Stata returns an error when
 . * you run it, set the folder as your working directory and type -run profile-
 . * to run the course setup, then try the command again. If you still experience
 . * problems with the -stab- command, please send a detailed email on the issue.
 . 
 . stab using week10_stats.txt, replace ///
 >         mean(age rightwing) ///
 >         prop(imdfetn female born edu3 income)
 (note: file week10_stats.txt not found)

 Variable                     mean           sd          min          max         mea
 > n           sd          min          max         mean           sd          min   
 >        max

 Allow many/few imm~f            %            %            %

                                %            %            %

                                %            %            %

 Education level                 %            %            %

 Household's total ~l            %            %            %

 N = 30399 (excluding 469 incomplete observations)
 File: week10_stats.txt

 . 
 . /* Syntax of the -stab- command:
 > 
 >  - using FILE  - name of the exported file; plain text (.txt) recommended
 >  - replace     - overwrite any previously existing file
 >  - mean()      - summarizes a list of continuous variables (mean, sd, min, max)
 >  - prop()      - summarizes a list of categorical variables (frequencies)
 > 
 >   In the example above, the -stab- command will export one file to the working
 >   directory, containing summary statistics for the full European sample. */
 . 
 . 
 . * =====================
 . * = ASSOCIATION TESTS =
 . * =====================
 . 
 . 
 . * Dummify the DV categories.
 . tab imdfetn, gen(immig_)

      Allow |
   many/few |
 immigrants |
         of |
  different |
 race/ethnic |
 group from |
   majority |      Freq.     Percent        Cum.
 ------------+-----------------------------------
       Many |      3,959       12.83       12.83
       Some |     11,913       38.59       51.42
        Few |     10,386       33.65       85.07
       None |      4,610       14.93      100.00
 ------------+-----------------------------------
      Total |     30,868      100.00

 . 
 . * Crossvisualize DV with basic demographics.
 . gr bar immig_*, stack percent over(cohort) by(female born, note("")) yti("") ///
 >         legend(order(1 "Many" 2 "Some" 3 "Few" 4 "None") row(1)) ///
 >         scheme(burd4) name(demog, replace)

 . 
 . * Crosstabulation: DV by gender.
 . tab female imdfetn, row nof chi2 // Chi-squared test

           |   Allow many/few immigrants of different
           |       race/ethnic group from majority
    female |      Many       Some        Few       None |     Total
 -----------+--------------------------------------------+----------
      Male |     13.28      38.76      33.34      14.63 |    100.00 
    Female |     12.41      38.44      33.93      15.21 |    100.00 
 -----------+--------------------------------------------+----------
     Total |     12.83      38.59      33.65      14.93 |    100.00 

          Pearson chi2(3) =   7.2476   Pr = 0.064

 . tabchi female imdfetn, p noo noe // Pearson residuals

          Pearson residual

 ------------------------------------------
          |  Allow many/few immigrants of 
          |  different race/ethnic group  
          |         from majority         
   female |   Many    Some     Few    None
 ----------+-------------------------------
     Male |  1.526   0.328  -0.650  -0.965
   Female | -1.457  -0.313   0.621   0.922
 ------------------------------------------

          Pearson chi2(3) =   7.2476   Pr = 0.064
 likelihood-ratio chi2(3) =   7.2453   Pr = 0.064

 . 
 . * Crosstabulation: DV by country of birth.
 . tab born imdfetn, row nof chi2

                |   Allow many/few immigrants of different
                |       race/ethnic group from majority
           born |      Many       Some        Few       None |     Total
 ----------------+--------------------------------------------+----------
   Foreign-born |     17.48      42.11      30.28      10.13 |    100.00 
 Born in country |     12.33      38.22      34.01      15.45 |    100.00 
 ----------------+--------------------------------------------+----------
          Total |     12.83      38.59      33.65      14.93 |    100.00 

          Pearson chi2(3) = 129.0198   Pr = 0.000

 . tabchi born imdfetn, p noo noe

          Pearson residual

 ------------------------------------------------
                |  Allow many/few immigrants of 
                |  different race/ethnic group  
                |         from majority         
           born |   Many    Some     Few    None
 ----------------+-------------------------------
   Foreign-born |  7.109   3.098  -3.174  -6.805
 Born in country | -2.329  -1.015   1.040   2.229
 ------------------------------------------------

          Pearson chi2(3) = 129.0198   Pr = 0.000
 likelihood-ratio chi2(3) = 129.8765   Pr = 0.000

 . 
 . * Crosstabulation: DV by age cohort.
 . tab cohort imdfetn, row nof chi2

           |   Allow many/few immigrants of different
           |       race/ethnic group from majority
    cohort |      Many       Some        Few       None |     Total
 -----------+--------------------------------------------+----------
        25 |     15.39      42.12      30.57      11.92 |    100.00 
        35 |     14.60      40.67      31.45      13.28 |    100.00 
        45 |     14.15      39.81      32.06      13.98 |    100.00 
        55 |     11.77      37.80      34.72      15.71 |    100.00 
        65 |      9.45      34.73      37.81      18.01 |    100.00 
        75 |      7.93      31.45      39.89      20.73 |    100.00 
 -----------+--------------------------------------------+----------
     Total |     12.83      38.59      33.65      14.93 |    100.00 

         Pearson chi2(15) = 456.2647   Pr = 0.000

 . tabchi cohort imdfetn, p noo noe

          Pearson residual

 ------------------------------------------
          |  Allow many/few immigrants of 
          |  different race/ethnic group  
          |         from majority         
   cohort |   Many    Some     Few    None
 ----------+-------------------------------
       25 |  5.421   4.300  -4.014  -5.910
       35 |  3.929   2.649  -3.000  -3.396
       45 |  2.884   1.529  -2.134  -1.928
       55 | -2.218  -0.966   1.392   1.520
       65 | -6.177  -4.069   4.702   5.207
       75 | -7.173  -6.026   5.645   7.861
 ------------------------------------------

          Pearson chi2(15) = 456.2647   Pr = 0.000
 likelihood-ratio chi2(15) = 460.9260   Pr = 0.000

 . 
 . * Dummify educational attainment.
 . tab edu3, gen(edu_)

  Education |
      level |      Freq.     Percent        Cum.
 ------------+-----------------------------------
        Low |     12,017       39.53       39.53
     Medium |      9,199       30.26       69.79
       High |      9,183       30.21      100.00
 ------------+-----------------------------------
      Total |     30,399      100.00

 . 
 . * Clarify x-axis by dropping labels on income deciles.
 . la def inc10 1 "D1" 10 "D10", replace

 . la val income inc10

 . 
 . * Visualization of education with income, sex and country of birth.
 . gr bar edu_*, stack percent over(income) by(female born, note("")) yti("") ///
 >         legend(order(1 "Low" 2 "Medium" 3 "High") row(1) pos(11)) ///
 >         scheme(burd3) name(edu_inc, replace)

 . 
 . * Simplified political scale.
 . recode rightwing ///
 >         (0/4  = 1 "Left-wing")  ///
 >         (5    = 2 "Centre")     ///
 >         (6/11 = 3 "Right-wing") ///
 >         (else = .), gen(wing)
 (30123 differences between rightwing and wing)

 . tab wing, gen(wing_)

  RECODE of |
  rightwing |
 (Placement |
    on left |
      right |
     scale) |      Freq.     Percent        Cum.
 ------------+-----------------------------------
  Left-wing |      9,633       31.21       31.21
     Centre |      9,860       31.94       63.15
 Right-wing |     11,375       36.85      100.00
 ------------+-----------------------------------
      Total |     30,868      100.00

 . 
 . * Visualization of left-right political leaning by income decile and age cohort.
 . gr bar wing_*, stack percent over(income) by(cohort, note("")) yti("") ///
 >         legend(order(1 "Left-wing" 2 "Centre" 3 "Right-wing") row(1)) ///
 >     scheme(burd3) name(pol_inc, replace)

 . 
 . * Crosstabulation.
 . tab income wing, row nof chi2

 Household' |
   s total |
       net |
   income, |  RECODE of rightwing (Placement
       all |       on left right scale)
   sources | Left-wing     Centre  Right-win |     Total
 -----------+---------------------------------+----------
        D1 |     31.69      36.71      31.60 |    100.00 
         2 |     29.92      35.66      34.42 |    100.00 
         3 |     31.89      33.83      34.27 |    100.00 
         4 |     32.42      33.58      33.99 |    100.00 
         5 |     31.03      33.11      35.85 |    100.00 
         6 |     30.87      32.43      36.70 |    100.00 
         7 |     30.96      32.39      36.65 |    100.00 
         8 |     33.42      28.13      38.45 |    100.00 
         9 |     30.66      28.27      41.07 |    100.00 
       D10 |     28.85      24.41      46.73 |    100.00 
 -----------+---------------------------------+----------
     Total |     31.21      31.94      36.85 |    100.00 

         Pearson chi2(18) = 252.3559   Pr = 0.000

 . tabchi income wing, p noo noe

          Pearson residual

 ----------------------------------------------
 Household |
 's total  |
 net       |
 income,   | RECODE of rightwing (Placement on 
 all       |         left right scale)         
 sources   |  Left-wing      Centre  Right-wing
 ----------+-----------------------------------
       D1 |      0.404       3.950      -4.049
        2 |     -1.323       3.777      -2.298
        3 |      0.744       2.025      -2.570
        4 |      1.314       1.748      -2.836
        5 |     -0.179       1.193      -0.946
        6 |     -0.335       0.480      -0.139
        7 |     -0.248       0.444      -0.185
        8 |      2.167      -3.685       1.436
        9 |     -0.527      -3.469       3.714
      D10 |     -2.198      -6.954       8.496
 ----------------------------------------------

          Pearson chi2(18) = 252.3559   Pr = 0.000
 likelihood-ratio chi2(18) = 251.4758   Pr = 0.000

 . 
 . 
 . * =====================
 . * = REGRESSION MODELS =
 . * =====================
 . 
 . 
 . * Linear regression
 . * -----------------
 . 
 . global bl "age i.female i.born i.edu3 income rightwing" // store IV names

 . 
 . * Baseline OLS model.
 . reg imdfetn $bl [pw = dpw]
 (sum of wgt is   3.1650e+04)

 Linear regression                                      Number of obs =   30399
                                                       F(  7, 30391) =  111.42
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0746
                                                       Root MSE      =  .86365

 ------------------------------------------------------------------------------
             |               Robust
     imdfetn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0006248   .0005947     1.05   0.293    -.0005408    .0017905
    1.female |   .0455134   .0175361     2.60   0.009     .0111419    .0798848
      1.born |     .24353   .0313883     7.76   0.000     .1820076    .3050524
             |
        edu3 |
          2  |  -.1804094   .0221801    -8.13   0.000    -.2238834   -.1369353
          3  |  -.3767914   .0225044   -16.74   0.000    -.4209009   -.3326819
             |
      income |   -.033461   .0033486    -9.99   0.000    -.0400244   -.0268976
   rightwing |   .0374404   .0042896     8.73   0.000     .0290325    .0458483
       _cons |   2.388615   .0542555    44.03   0.000     2.282272    2.494958
 ------------------------------------------------------------------------------

 . 
 . * Adjusted OLS model: observations clustered by country.
 . reg imdfetn $bl [pw = dpw], vce(cluster cid)
 (sum of wgt is   3.1650e+04)

 Linear regression                                      Number of obs =   30399
                                                       F(  7,    25) =   94.48
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0746
                                                       Root MSE      =  .86365

                                   (Std. Err. adjusted for 26 clusters in cid)
 ------------------------------------------------------------------------------
             |               Robust
     imdfetn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0006248   .0012625     0.49   0.625    -.0019754    .0032251
    1.female |   .0455134   .0161814     2.81   0.009     .0121871    .0788397
      1.born |     .24353   .0505925     4.81   0.000     .1393329    .3477271
             |
        edu3 |
          2  |  -.1804094   .0738232    -2.44   0.022     -.332451   -.0283677
          3  |  -.3767914   .0747884    -5.04   0.000     -.530821   -.2227617
             |
      income |   -.033461   .0053758    -6.22   0.000    -.0445326   -.0223894
   rightwing |   .0374404   .0106904     3.50   0.002     .0154231    .0594577
       _cons |   2.388615   .1149658    20.78   0.000     2.151838    2.625391
 ------------------------------------------------------------------------------

 . 
 . * The last option reads as 'variance-covariance estimation is clustered by cid'.
 . * This specification enforces robust standard errors into the model. It uses the
 . * respondents' country of residence as a panel variable in the estimation of all
 . * regression coefficients. Panel variables are variables at which level we might
 . * observe some form of within-sample clustering, which violates the assumption
 . * that the error term is independently distributed across the observations.
 . 
 . * Variance inflation.
 . vif

    Variable |       VIF       1/VIF  
 -------------+----------------------
         age |      1.09    0.914717
    1.female |      1.01    0.993207
      1.born |      1.01    0.994724
        edu3 |
          2  |      1.32    0.754937
          3  |      1.50    0.668159
      income |      1.19    0.837507
   rightwing |      1.01    0.989781
 -------------+----------------------
    Mean VIF |      1.16

 . 
 . * Inspect residuals.
 . predict r, resid
 (469 missing values generated)

 . 
 . * Diagnostic plots.
 . hist r, normal ///
 >         name(r, replace)   // distribution of residuals
 (bin=44, start=-2.0704632, width=.09725008)

 . rvfplot, yli(0) ///
 >         name(rvf, replace) // residuals vs. fitted values

 . 
 . * Export.
 . eststo clear

 . eststo lin_1: reg imdfetn $bl [pw = dpw]
 (sum of wgt is   3.1650e+04)

 Linear regression                                      Number of obs =   30399
                                                       F(  7, 30391) =  111.42
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0746
                                                       Root MSE      =  .86365

 ------------------------------------------------------------------------------
             |               Robust
     imdfetn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0006248   .0005947     1.05   0.293    -.0005408    .0017905
    1.female |   .0455134   .0175361     2.60   0.009     .0111419    .0798848
      1.born |     .24353   .0313883     7.76   0.000     .1820076    .3050524
             |
        edu3 |
          2  |  -.1804094   .0221801    -8.13   0.000    -.2238834   -.1369353
          3  |  -.3767914   .0225044   -16.74   0.000    -.4209009   -.3326819
             |
      income |   -.033461   .0033486    -9.99   0.000    -.0400244   -.0268976
   rightwing |   .0374404   .0042896     8.73   0.000     .0290325    .0458483
       _cons |   2.388615   .0542555    44.03   0.000     2.282272    2.494958
 ------------------------------------------------------------------------------

 . eststo lin_2: reg imdfetn $bl [pw = dpw], vce(cluster cid)
 (sum of wgt is   3.1650e+04)

 Linear regression                                      Number of obs =   30399
                                                       F(  7,    25) =   94.48
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0746
                                                       Root MSE      =  .86365

                                   (Std. Err. adjusted for 26 clusters in cid)
 ------------------------------------------------------------------------------
             |               Robust
     imdfetn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0006248   .0012625     0.49   0.625    -.0019754    .0032251
    1.female |   .0455134   .0161814     2.81   0.009     .0121871    .0788397
      1.born |     .24353   .0505925     4.81   0.000     .1393329    .3477271
             |
        edu3 |
          2  |  -.1804094   .0738232    -2.44   0.022     -.332451   -.0283677
          3  |  -.3767914   .0747884    -5.04   0.000     -.530821   -.2227617
             |
      income |   -.033461   .0053758    -6.22   0.000    -.0445326   -.0223894
   rightwing |   .0374404   .0106904     3.50   0.002     .0154231    .0594577
       _cons |   2.388615   .1149658    20.78   0.000     2.151838    2.625391
 ------------------------------------------------------------------------------

 . esttab lin_? using week10_regressions.txt, mti("OLS" "Adj. OLS") replace
 (note: file week10_regressions.txt not found)
 (output written to week10_regressions.txt)

 . 
 . * The diagnostics clearly identify the issue here: the limited number of levels
 . * in the DV is causing residuals to follow a low-dimensional pattern that does
 . * not approximate a normal distribution. The residuals, for instance, follow a
 . * quadrimodal distribution that reflect the number of levels in the DV. The data
 . * therefore fail to fit the assumptions of the model by design.
 . 
 . * We turn to a logistic regression (logit) model, which accepts only dichotomous
 . * outcomes. The binary/dummy recoding of the DV was computed earlier as follows:
 . tab diff imdfetn

     Allow |
 many/some |
  migrants |
        of |
 different |
 race/ethni |   Allow many/few immigrants of different
 city from |       race/ethnic group from majority
  majority |      Many       Some        Few       None |     Total
 -----------+--------------------------------------------+----------
         0 |         0          0     10,386      4,610 |    14,996 
         1 |     3,959     11,913          0          0 |    15,872 
 -----------+--------------------------------------------+----------
     Total |     3,959     11,913     10,386      4,610 |    30,868 


 . 
 . * You are very welcome to consult the UCLA Stata FAQ pages to learn how logistic
 . * regression works if you are interested in estimating a logit model. Otherwise,
 . * just follow the code and comments below to get some basic ideas. The following
 . * is a very short demo: it would take a full course to explain logistic models
 . * properly, and you are very welcome to ask for one :)
 . 
 . 
 . * Logistic regression
 . * -------------------
 . 
 . * Binarize the DV again to have 1 = no immigrants.
 . gen nomigrants = (imdfetn > 2)

 . 
 . * Column percentages (conditional probabilities).
 . tab cohort nomigrants, col nof

           |      nomigrants
    cohort |         0          1 |     Total
 -----------+----------------------+----------
        25 |     20.77      16.24 |     18.57 
        35 |     21.92      18.78 |     20.39 
        45 |     20.81      18.80 |     19.83 
        55 |     17.75      19.11 |     18.41 
        65 |     11.93      15.96 |     13.89 
        75 |      6.82      11.12 |      8.91 
 -----------+----------------------+----------
     Total |    100.00     100.00 |    100.00 


 . 
 . * Log-odds of f = ln(Y = 1).
 . tabodds nomigrants cohort

 --------------------------------------------------------------------------
    cohort  |      cases     controls       odds      [95% Conf. Interval]
 ------------+-------------------------------------------------------------
         25 |       2435         3296    0.73877        0.70108   0.77850
         35 |       2816         3479    0.80943        0.77020   0.85066
         45 |       2819         3303    0.85347        0.81163   0.89745
         55 |       2866         2817    1.01739        0.96584   1.07170
         65 |       2393         1894    1.26346        1.18955   1.34197
         75 |       1667         1083    1.53924        1.42589   1.66161
 --------------------------------------------------------------------------
 Test of homogeneity (equal odds): chi2(5)  =   395.42
                                  Pr>chi2  =   0.0000

 Score test for trend of odds:     chi2(1)  =   372.18
                                  Pr>chi2  =   0.0000

 . 
 . * Odds ratios: magnitude of success-failure rate.
 . tabodds nomigrants cohort, or

 ---------------------------------------------------------------------------
      cohort |  Odds Ratio       chi2       P>chi2     [95% Conf. Interval]
 -------------+-------------------------------------------------------------
          25 |    1.000000          .           .              .          .
          35 |    1.095636       6.15       0.0131      1.019308   1.177681
          45 |    1.155247      15.19       0.0001      1.074309   1.242282
          55 |    1.377138      72.37       0.0000      1.278854   1.482976
          65 |    1.710216     174.57       0.0000      1.577840   1.853698
          75 |    2.083509     244.56       0.0000      1.896433   2.289040
 ---------------------------------------------------------------------------
 Test of homogeneity (equal odds): chi2(5)  =   395.42
                                  Pr>chi2  =   0.0000

 Score test for trend of odds:     chi2(1)  =   372.18
                                  Pr>chi2  =   0.0000

 . 
 . * Logistic regression with log-odds.
 . logit nomigrants i.cohort

 Iteration 0:   log likelihood = -21383.636  
 Iteration 1:   log likelihood = -21185.227  
 Iteration 2:   log likelihood =  -21185.21  
 Iteration 3:   log likelihood =  -21185.21  

 Logistic regression                               Number of obs   =      30868
                                                  LR chi2(5)      =     396.85
                                                  Prob > chi2     =     0.0000
 Log likelihood =  -21185.21                       Pseudo R2       =     0.0093

 ------------------------------------------------------------------------------
  nomigrants |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
      cohort |
         35  |   .0913354   .0368324     2.48   0.013     .0191452    .1635256
         45  |   .1443139   .0370347     3.90   0.000     .0717273    .2169005
         55  |   .3200077   .0376561     8.50   0.000     .2462031    .3938123
         65  |   .5366197   .0407424    13.17   0.000      .456766    .6164733
         75  |   .7340535   .0473003    15.52   0.000     .6413466    .8267603
             |
       _cons |  -.3027629   .0267222   -11.33   0.000    -.3551374   -.2503883
 ------------------------------------------------------------------------------

 . 
 . * Logistic regression with odds ratios.
 . logit nomigrants i.cohort, or

 Iteration 0:   log likelihood = -21383.636  
 Iteration 1:   log likelihood = -21185.227  
 Iteration 2:   log likelihood =  -21185.21  
 Iteration 3:   log likelihood =  -21185.21  

 Logistic regression                               Number of obs   =      30868
                                                  LR chi2(5)      =     396.85
                                                  Prob > chi2     =     0.0000
 Log likelihood =  -21185.21                       Pseudo R2       =     0.0093

 ------------------------------------------------------------------------------
  nomigrants | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
      cohort |
         35  |   1.095636    .040355     2.48   0.013      1.01933    1.177656
         45  |   1.155247   .0427842     3.90   0.000     1.074362    1.242221
         55  |   1.377138   .0518577     8.50   0.000     1.279159    1.482622
         65  |   1.710216   .0696783    13.17   0.000     1.578959    1.852384
         75  |   2.083509   .0985506    15.52   0.000     1.899036    2.285901
             |
       _cons |   .7387743   .0197417   -11.33   0.000     .7010771    .7784984
 ------------------------------------------------------------------------------

 . 
 . * Baseline model.
 . logit nomigrants $bl [pw = dpw] // coefficients are log-odds

 Iteration 0:   log pseudolikelihood = -21922.411  
 Iteration 1:   log pseudolikelihood =  -20962.96  
 Iteration 2:   log pseudolikelihood = -20960.853  
 Iteration 3:   log pseudolikelihood = -20960.853  

 Logistic regression                               Number of obs   =      30399
                                                  Wald chi2(7)    =     572.04
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -20960.853                 Pseudo R2       =     0.0439

 ------------------------------------------------------------------------------
             |               Robust
  nomigrants |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0020519   .0013714     1.50   0.135    -.0006359    .0047398
    1.female |   .1106589   .0407043     2.72   0.007     .0308799    .1904379
      1.born |   .4735991   .0783621     6.04   0.000     .3200122    .6271859
             |
        edu3 |
          2  |   -.345609   .0500825    -6.90   0.000    -.4437688   -.2474492
          3  |  -.7982442   .0537806   -14.84   0.000    -.9036522   -.6928362
             |
      income |  -.0678676   .0079785    -8.51   0.000    -.0835052     -.05223
   rightwing |   .0752383   .0094785     7.94   0.000     .0566609    .0938158
       _cons |  -.3394748   .1285284    -2.64   0.008    -.5913859   -.0875636
 ------------------------------------------------------------------------------

 . 
 . * Log-odds are variations in the probability of the DV. Negative log-odds imply
 . * that an increase in the IV, or the presence of it, reduces the probability of
 . * the DV being equal to 1. Log-odds can be compared by magnitude, but at that
 . * stage, it is usually simpler to read only the sign of the coefficient and its
 . * significance level (p-value, closeness of confidence interval bounds to zero).
 . 
 . * Odds ratios.
 . logit nomigrants $bl [pw = dpw], or

 Iteration 0:   log pseudolikelihood = -21922.411  
 Iteration 1:   log pseudolikelihood =  -20962.96  
 Iteration 2:   log pseudolikelihood = -20960.853  
 Iteration 3:   log pseudolikelihood = -20960.853  

 Logistic regression                               Number of obs   =      30399
                                                  Wald chi2(7)    =     572.04
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -20960.853                 Pseudo R2       =     0.0439

 ------------------------------------------------------------------------------
             |               Robust
  nomigrants | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   1.002054   .0013742     1.50   0.135     .9993643    1.004751
    1.female |   1.117014   .0454673     2.72   0.007     1.031362    1.209779
      1.born |   1.605763   .1258309     6.04   0.000     1.377145    1.872334
             |
        edu3 |
          2  |   .7077892   .0354478    -6.90   0.000     .6416137    .7807899
          3  |   .4501186   .0242076   -14.84   0.000     .4050875    .5001555
             |
      income |   .9343842    .007455    -8.51   0.000     .9198863    .9491106
   rightwing |   1.078141   .0102191     7.94   0.000     1.058297    1.098357
       _cons |   .7121443   .0915308    -2.64   0.008     .5535596    .9161606
 ------------------------------------------------------------------------------

 . 
 . * Odds ratios provide an easier means of comparison between coefficients: for
 . * example, in this model, completing upper secondary education increases the
 . * likelihood of allowing migrants from different groups by a factor of 2.03,
 . * i.e. higher-educated respondents are twice more likely than others to have
 . * answered "Some" or "Many" to the original question.
 . 
 . * Adjusted model.
 . logit nomigrants $bl [pw = dpw], vce(cluster cid)

 Iteration 0:   log pseudolikelihood = -21922.411  
 Iteration 1:   log pseudolikelihood =  -20962.96  
 Iteration 2:   log pseudolikelihood = -20960.853  
 Iteration 3:   log pseudolikelihood = -20960.853  

 Logistic regression                               Number of obs   =      30399
                                                  Wald chi2(7)    =     329.05
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -20960.853                 Pseudo R2       =     0.0439

                                   (Std. Err. adjusted for 26 clusters in cid)
 ------------------------------------------------------------------------------
             |               Robust
  nomigrants |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0020519   .0030196     0.68   0.497    -.0038664    .0079703
    1.female |   .1106589   .0423726     2.61   0.009     .0276101    .1937077
      1.born |   .4735991   .1093272     4.33   0.000     .2593217    .6878765
             |
        edu3 |
          2  |   -.345609   .1412374    -2.45   0.014    -.6224291   -.0687888
          3  |  -.7982442   .1598086    -5.00   0.000    -1.111463   -.4850251
             |
      income |  -.0678676   .0123987    -5.47   0.000    -.0921686   -.0435666
   rightwing |   .0752383   .0258223     2.91   0.004     .0246275    .1258492
       _cons |  -.3394748   .2798738    -1.21   0.225    -.8880173    .2090677
 ------------------------------------------------------------------------------

 . 
 . * Odds ratios.
 . logit nomigrants $bl [pw = dpw], vce(cluster cid) or

 Iteration 0:   log pseudolikelihood = -21922.411  
 Iteration 1:   log pseudolikelihood =  -20962.96  
 Iteration 2:   log pseudolikelihood = -20960.853  
 Iteration 3:   log pseudolikelihood = -20960.853  

 Logistic regression                               Number of obs   =      30399
                                                  Wald chi2(7)    =     329.05
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -20960.853                 Pseudo R2       =     0.0439

                                   (Std. Err. adjusted for 26 clusters in cid)
 ------------------------------------------------------------------------------
             |               Robust
  nomigrants | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   1.002054   .0030258     0.68   0.497     .9961411    1.008002
    1.female |   1.117014   .0473308     2.61   0.009     1.027995    1.213741
      1.born |   1.605763   .1755536     4.33   0.000     1.296051    1.989486
             |
        edu3 |
          2  |   .7077892   .0999663    -2.45   0.014     .5366393    .9335238
          3  |   .4501186   .0719328    -5.00   0.000      .329077    .6156818
             |
      income |   .9343842   .0115851    -5.47   0.000     .9119514    .9573688
   rightwing |   1.078141   .0278401     2.91   0.004     1.024933    1.134111
       _cons |   .7121443   .1993105    -1.21   0.225     .4114708    1.232528
 ------------------------------------------------------------------------------

 . 
 . * Export.
 . eststo clear

 . eststo log_1: logit nomigrants $bl [pw = dpw]

 Iteration 0:   log pseudolikelihood = -21922.411  
 Iteration 1:   log pseudolikelihood =  -20962.96  
 Iteration 2:   log pseudolikelihood = -20960.853  
 Iteration 3:   log pseudolikelihood = -20960.853  

 Logistic regression                               Number of obs   =      30399
                                                  Wald chi2(7)    =     572.04
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -20960.853                 Pseudo R2       =     0.0439

 ------------------------------------------------------------------------------
             |               Robust
  nomigrants |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0020519   .0013714     1.50   0.135    -.0006359    .0047398
    1.female |   .1106589   .0407043     2.72   0.007     .0308799    .1904379
      1.born |   .4735991   .0783621     6.04   0.000     .3200122    .6271859
             |
        edu3 |
          2  |   -.345609   .0500825    -6.90   0.000    -.4437688   -.2474492
          3  |  -.7982442   .0537806   -14.84   0.000    -.9036522   -.6928362
             |
      income |  -.0678676   .0079785    -8.51   0.000    -.0835052     -.05223
   rightwing |   .0752383   .0094785     7.94   0.000     .0566609    .0938158
       _cons |  -.3394748   .1285284    -2.64   0.008    -.5913859   -.0875636
 ------------------------------------------------------------------------------

 . eststo log_2: logit nomigrants $bl [pw = dpw], vce(cluster cid)

 Iteration 0:   log pseudolikelihood = -21922.411  
 Iteration 1:   log pseudolikelihood =  -20962.96  
 Iteration 2:   log pseudolikelihood = -20960.853  
 Iteration 3:   log pseudolikelihood = -20960.853  

 Logistic regression                               Number of obs   =      30399
                                                  Wald chi2(7)    =     329.05
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -20960.853                 Pseudo R2       =     0.0439

                                   (Std. Err. adjusted for 26 clusters in cid)
 ------------------------------------------------------------------------------
             |               Robust
  nomigrants |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0020519   .0030196     0.68   0.497    -.0038664    .0079703
    1.female |   .1106589   .0423726     2.61   0.009     .0276101    .1937077
      1.born |   .4735991   .1093272     4.33   0.000     .2593217    .6878765
             |
        edu3 |
          2  |   -.345609   .1412374    -2.45   0.014    -.6224291   -.0687888
          3  |  -.7982442   .1598086    -5.00   0.000    -1.111463   -.4850251
             |
      income |  -.0678676   .0123987    -5.47   0.000    -.0921686   -.0435666
   rightwing |   .0752383   .0258223     2.91   0.004     .0246275    .1258492
       _cons |  -.3394748   .2798738    -1.21   0.225    -.8880173    .2090677
 ------------------------------------------------------------------------------

 . esttab log_? using week10_logits.txt, mti("Logit" "Adj. logit") replace
 (note: file week10_logits.txt not found)
 (output written to week10_logits.txt)

 . 
 . 
 . * Marginal effects
 . * ----------------
 . 
 . * Marginal effects of political attitude: estimated probability of DV at each
 . * level of the 10-point left/right scale used in the model, all other factors
 . * kept constant (demographics, education and income).
 . margins, at(rightwing = (0(1)10))

 Predictive margins                                Number of obs   =      30399
 Model VCE    : Robust

 Expression   : Pr(nomigrants), predict()

 1._at        : rightwing       =           0

 2._at        : rightwing       =           1

 3._at        : rightwing       =           2

 4._at        : rightwing       =           3

 5._at        : rightwing       =           4

 6._at        : rightwing       =           5

 7._at        : rightwing       =           6

 8._at        : rightwing       =           7

 9._at        : rightwing       =           8

 10._at       : rightwing       =           9

 11._at       : rightwing       =          10

 ------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         _at |
          1  |   .3941882   .0538393     7.32   0.000     .2886651    .4997112
          2  |   .4113932   .0494438     8.32   0.000     .3144852    .5083012
          3  |   .4288029   .0449954     9.53   0.000     .3406135    .5169923
          4  |   .4463777   .0405952    11.00   0.000     .3668126    .5259428
          5  |   .4640767   .0363763    12.76   0.000     .3927805    .5353728
          6  |   .4818579   .0325163    14.82   0.000     .4181271    .5455887
          7  |   .4996788   .0292476    17.08   0.000     .4423546     .557003
          8  |   .5174964   .0268496    19.27   0.000     .4648722    .5701205
          9  |   .5352678   .0255948    20.91   0.000     .4851029    .5854327
         10  |   .5529507   .0256391    21.57   0.000      .502699    .6032025
         11  |   .5705037   .0269283    21.19   0.000     .5177252    .6232822
 ------------------------------------------------------------------------------

 . marginsplot, xla(minmax) recast(line) recastci(rarea) ciopts(col(*.6)) ///
 >         name(mfx_right, replace)

  Variables that uniquely identify margins: rightwing

 . 
 . * Marginal effects of educational attainment, by gender and country of birth.
 . * The margins command will generate estimate for all possible permutations of
 . * the IV list provided, and then plot them as confidence intervals.
 . margins born#female, at(edu3 = (1(1)3))

 Predictive margins                                Number of obs   =      30399
 Model VCE    : Robust

 Expression   : Pr(nomigrants), predict()

 1._at        : edu3            =           1

 2._at        : edu3            =           2

 3._at        : edu3            =           3

 ---------------------------------------------------------------------------------
                |            Delta-method
                |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
 ----------------+----------------------------------------------------------------
 _at#born#female |
         1 0 0  |     .44634   .0258202    17.29   0.000     .3957333    .4969468
         1 0 1  |   .4733903   .0293437    16.13   0.000     .4158777    .5309029
         1 1 0  |   .5623557   .0273294    20.58   0.000      .508791    .6159204
         1 1 1  |   .5889661   .0317194    18.57   0.000     .5267972    .6511349
         2 0 0  |    .364512   .0412211     8.84   0.000     .2837202    .4453038
         2 0 1  |   .3901174   .0459254     8.49   0.000     .3001052    .4801296
         2 1 0  |    .477645   .0412908    11.57   0.000     .3967165    .5585734
         2 1 1  |   .5048604   .0466906    10.81   0.000     .4133486    .5963722
         3 0 0  |   .2684944   .0343446     7.82   0.000     .2011804    .3358085
         3 0 1  |   .2904831   .0391938     7.41   0.000     .2136647    .3673016
         3 1 0  |     .36931   .0379644     9.73   0.000     .2949012    .4437188
         3 1 1  |   .3950411   .0440971     8.96   0.000     .3086122    .4814699
 ---------------------------------------------------------------------------------

 . marginsplot, xla(minmax) by(female born) ///
 >         name(mfx_demog, replace)

  Variables that uniquely identify margins: edu3 born female

 . 
 . * Effect of increasing age on the probability of the DV being equal to 1, by sex
 . * and country of birth. The overlap in confidence intervals illustrates the weak
 . * value of age as a predictor for the DV: the marginal effect of age is residual
 . * in the model, at least in comparison to other predictors.
 . margins born#female, at(age=(25(5)85))

 Predictive margins                                Number of obs   =      30399
 Model VCE    : Robust

 Expression   : Pr(nomigrants), predict()

 1._at        : age             =          25

 2._at        : age             =          30

 3._at        : age             =          35

 4._at        : age             =          40

 5._at        : age             =          45

 6._at        : age             =          50

 7._at        : age             =          55

 8._at        : age             =          60

 9._at        : age             =          65

 10._at       : age             =          70

 11._at       : age             =          75

 12._at       : age             =          80

 13._at       : age             =          85

 ---------------------------------------------------------------------------------
                |            Delta-method
                |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
 ----------------+----------------------------------------------------------------
 _at#born#female |
         1 0 0  |   .3584472   .0367548     9.75   0.000     .2864091    .4304852
         1 0 1  |   .3830138   .0395727     9.68   0.000     .3054528    .4605748
         1 1 0  |   .4670019    .039842    11.72   0.000     .3889129    .5450908
         1 1 1  |   .4931773   .0432812    11.39   0.000     .4083478    .5780069
         2 0 0  |   .3606976   .0347461    10.38   0.000     .2925966    .4287987
         2 0 1  |   .3853225   .0378633    10.18   0.000     .3111117    .4595332
         2 1 0  |   .4694238   .0374101    12.55   0.000     .3961014    .5427462
         2 1 1  |   .4956078   .0412816    12.01   0.000     .4146974    .5765182
         3 0 0  |   .3629539   .0329269    11.02   0.000     .2984183    .4274895
         3 0 1  |    .387636   .0363767    10.66   0.000      .316339     .458933
         3 1 0  |   .4718471   .0351654    13.42   0.000     .4029242      .54077
         3 1 1  |   .4980384   .0395035    12.61   0.000      .420613    .5754638
         4 0 0  |   .3652159   .0313362    11.65   0.000     .3037982    .4266337
         4 0 1  |   .3899543   .0351455    11.10   0.000     .3210704    .4588383
         4 1 0  |   .4742716   .0331478    14.31   0.000      .409303    .5392401
         4 1 1  |   .5004691   .0379786    13.18   0.000     .4260324    .5749057
         5 0 0  |   .3674836    .030016    12.24   0.000     .3086534    .4263138
         5 0 1  |   .3922774   .0342019    11.47   0.000     .3252429    .4593119
         5 1 0  |   .4766972   .0314029    15.18   0.000     .4151485    .5382458
         5 1 1  |   .5028997   .0367388    13.69   0.000     .4308929    .5749065
         6 0 0  |   .3697568   .0290093    12.75   0.000     .3128996    .4266139
         6 0 1  |    .394605   .0335745    11.75   0.000     .3288002    .4604098
         6 1 0  |   .4791237     .02998    15.98   0.000      .420364    .5378835
         6 1 1  |   .5053301   .0358141    14.11   0.000     .4351357    .5755245
         7 0 0  |   .3720355   .0283555    13.12   0.000     .3164597    .4276112
         7 0 1  |   .3969372   .0332854    11.93   0.000     .3316989    .4621754
         7 1 0  |   .4815512   .0289281    16.65   0.000     .4248533    .5382492
         7 1 1  |   .5077602   .0352293    14.41   0.000      .438712    .5768084
         8 0 0  |   .3743195   .0280851    13.33   0.000     .3192737    .4293653
         8 0 1  |   .3992738   .0333477    11.97   0.000     .3339135     .464634
         8 1 0  |   .4839795   .0282897    17.11   0.000     .4285327    .5394262
         8 1 1  |     .51019   .0350013    14.58   0.000     .4415887    .5787912
         9 0 0  |   .3766089   .0282149    13.35   0.000     .3213088     .431909
         9 0 1  |   .4016147   .0337632    11.90   0.000       .33544    .4677893
         9 1 0  |   .4864084   .0280941    17.31   0.000      .431345    .5414718
         9 1 1  |   .5126192   .0351367    14.59   0.000     .4437526    .5814857
        10 0 0  |   .3789035   .0287447    13.18   0.000     .3225649    .4352422
        10 0 1  |   .4039598    .034523    11.70   0.000     .3362961    .4716236
        10 1 0  |   .4888379   .0283512    17.24   0.000     .4332706    .5444051
        10 1 1  |   .5150478   .0356307    14.46   0.000     .4452128    .5848828
        11 0 0  |   .3812033   .0296584    12.85   0.000     .3230738    .4393327
        11 0 1  |   .4063091   .0356083    11.41   0.000     .3365181    .4761001
        11 1 0  |   .4912678   .0290494    16.91   0.000      .434332    .5482036
        11 1 1  |   .5174757   .0364683    14.19   0.000     .4459992    .5889522
        12 0 0  |    .383508   .0309267    12.40   0.000     .3228929    .4441232
        12 0 1  |   .4086624   .0369938    11.05   0.000     .3361559    .4811689
        12 1 0  |   .4936981   .0301585    16.37   0.000     .4345886    .5528076
        12 1 1  |   .5199027   .0376254    13.82   0.000     .4461582    .5936473
        13 0 0  |   .3858177   .0325124    11.87   0.000     .3220947    .4495408
        13 0 1  |   .4110196     .03865    10.63   0.000     .3352671    .4867722
        13 1 0  |   .4961286   .0316351    15.68   0.000     .4341249    .5581323
        13 1 1  |   .5223289   .0390729    13.37   0.000     .4457474    .5989103
 ---------------------------------------------------------------------------------

 . marginsplot, by(female) recast(line) recastci(rarea) ciopts(col(*.6)) ///
 >         name(mfx_age, replace)

  Variables that uniquely identify margins: age born female

 . 
 . 
 . * Sensitivity analysis
 . * --------------------
 . 
 . * Ordered logistic regression, to test the cut point that we chose when recoding
 . * the DV to a dummy. The results should show identical signs on the coefficients
 . * and their order of magnitude should also stay stable. If not, then the model
 . * is sensitive to the choice of cutoff point that we made earlier. Note that in
 . * our example, the signs of the coefficients should actually be the same for the
 . * OLS (linear regression) and ordered logit, not for the logit (the logit codes
 . * the dummy in reverse order to the original variable).
 . ologit imdfetn $bl [pw = dpw], vce(cluster cid)

 Iteration 0:   log pseudolikelihood = -40526.845  
 Iteration 1:   log pseudolikelihood = -39309.792  
 Iteration 2:   log pseudolikelihood = -39302.192  
 Iteration 3:   log pseudolikelihood = -39302.188  
 Iteration 4:   log pseudolikelihood = -39302.188  

 Ordered logistic regression                       Number of obs   =      30399
                                                  Wald chi2(7)    =     443.81
                                                  Prob > chi2     =     0.0000
 Log pseudolikelihood = -39302.188                 Pseudo R2       =     0.0302

                                   (Std. Err. adjusted for 26 clusters in cid)
 ------------------------------------------------------------------------------
             |               Robust
     imdfetn |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
         age |   .0012223   .0027681     0.44   0.659    -.0042031    .0066477
    1.female |    .100056   .0331588     3.02   0.003     .0350661     .165046
      1.born |   .4944239   .1045674     4.73   0.000     .2894756    .6993722
             |
        edu3 |
          2  |  -.3890632   .1585103    -2.45   0.014    -.6997377   -.0783887
          3  |  -.8055187   .1636851    -4.92   0.000    -1.126336   -.4847017
             |
      income |  -.0710706   .0116204    -6.12   0.000    -.0938461   -.0482951
   rightwing |   .0839352   .0273557     3.07   0.002     .0303189    .1375515
 -------------+----------------------------------------------------------------
       /cut1 |  -1.794026   .1873578                     -2.161241   -1.426812
       /cut2 |   .3108741   .2636075                     -.2057871    .8275354
       /cut3 |   2.053481   .3593456                      1.349176    2.757785
 ------------------------------------------------------------------------------

 . 
 . 
 . * Export model results
 . * --------------------
 . 
 . eststo clear

 . eststo lin_1: qui reg imdfetn $bl [pw = dpw], b

 . eststo lin_2: qui reg imdfetn $bl [pw = dpw], vce(cluster cid)

 . eststo log_1: qui logit nomigrants $bl [pw = dpw]

 . eststo log_2: qui logit nomigrants $bl [pw = dpw], vce(cluster cid)

 . eststo log_3: qui ologit imdfetn $bl [pw = dpw], vce(cluster cid)

 . esttab lin_* log_* using week10_models.txt, constant label beta(2) se(2) r2(2) ///
 >         mti("OLS" "Adj. OLS" "Logit" "Adj. logit" "Ord. logit") replace
 (note: file week10_models.txt not found)
 (output written to week10_models.txt)

 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Thanks for following! And all the best for the future.
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require estout fre leanout renvars scheme-burd spineplot

 . 
 . * Log results.
 . cap log using code/week11.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 11 ------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Satisfaction with Health Services in Britain and France
 >  
 >  - DATA:   European Social Survey Round 4 (2008)
 > 
 >    We explore patterns of satisfaction with the state of health services in
 >    the UK and France, two countries with extensive public healthcare systems
 >    and where health services play different roles in political competition.
 > 
 >  - (H1): We expect to observe high satisfaction on average, except among those
 >    in ill health, who we expect to report lower satisfaction regardless of age,
 >    sex, income or political views.
 > 
 >  - (H2): We also expect respondents in political opposition to the government to
 >    report less satisfaction with the state of health services in the country,
 >    independently of all other characteristics.
 > 
 >  - (H3): We finally expect to find lower patterns of satisfaction among those
 >    who report financial difficulties, as evidence of an income effect that
 >    we expect to exist in isolation of all others.
 > 
 >    We use data from the European Social Survey (ESS) Round 4. The sample used in
 >    the analysis contains N = 1,942 French and N = 2,079 UK individuals selected
 >    through stratified probability sampling and interviewed face-to-face in 2008.
 > 
 >    We run linear regressions for each country to assess whether satisfaction
 >    with health services can be predicted from political views, independently
 >    of age, sex, health status and financial situation.
 >    
 >    Last updated 2013-05-31.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load ESS dataset.
 . use data/ess2008, clear
 (European Social Survey 2008)

 . 
 . * Country-specific design weight, multiplied by country-level population weight.
 . gen dpw = dweight * pweight

 . la var dpw "Survey weight (population * design)"

 . 
 . * Survey weights.
 . svyset [pw = dpw]

      pweight: dpw
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

 . 
 . * Country dummies (used for clustered standard errors).
 . encode cntry, gen(cid)

 . 
 . 
 . * Dependent variable
 . * ------------------
 . 
 . d stf*

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 stflife         byte   %2.0f       stflife    How satisfied with life as a whole
 stfeco          byte   %2.0f       stfeco     How satisfied with present state of
                                                economy in country
 stfgov          byte   %2.0f       stfgov     How satisfied with the national
                                                government
 stfdem          byte   %2.0f       stfdem     How satisfied with the way democracy
                                                works in country
 stfedu          byte   %2.0f       stfedu     State of education in country nowadays
 stfhlth         byte   %2.0f       stfhlth    State of health services in country
                                                nowadays

 . 
 . * Rename DV and a bunch of covariates
 . renvars stfhlth stfedu stfgov \ hsat esat gsat

 . 
 . * Country-specific distributions.
 . tab cntry, su(hsat)

            | Summary of State of health services
            |         in country nowadays
    Country |        Mean   Std. Dev.       Freq.
 ------------+------------------------------------
         BE |           7           2        1758
         BG |           3           2        2163
         CH |           7           2        1811
         CY |           6           2        1184
         CZ |           5           2        1999
         DE |           5           2        2723
         DK |           6           2        1593
         EE |           5           2        1631
         ES |           6           2        2552
         FI |           7           2        2190
         FR |           6           2        2065
         GB |           6           2        2343
         GR |           3           2        2057
         HR |           4           2        1467
         HU |           4           2        1519
         IE |           4           2        1758
         IL |           6           2        2421
         LV |           4           2        1948
         NL |           6           2        1764
         NO |           6           2        1548
         PL |           4           2        1601
         PT |           4           2        2334
         RO |           4           3        2105
         RU |           4           2        2461
         SE |           6           2        1812
         SI |           5           2        1271
         SK |           4           2        1796
         TR |           5           3        2363
         UA |           2           2        1789
 ------------+------------------------------------
      Total |           5           3       56026

 . hist hsat, discrete by(cntry, note("")) ///
 >         name(dv_bins, replace)

 . 
 . * Detailed summary statistics.
 . su hsat, d

        State of health services in country nowadays
 -------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
 10%            1              0       Obs               56026
 25%            3              0       Sum of Wgt.       56026

 50%            5                      Mean           4.999893
                        Largest       Std. Dev.      2.607701
 75%            7             10
 90%            8             10       Variance       6.800107
 95%            9             10       Skewness      -.1756378
 99%           10             10       Kurtosis       2.165036

 . 
 . 
 . * Cross-country comparisons
 . * -------------------------
 . 
 . * Cross-country visualization (mean).
 . gr dot hsat, over(cntry, sort(1)des) yla(0 "Min" 10 "Max") ///
 >     yti("Satisfaction in health services") ///
 >     name(dv_dots, replace)

 . 
 . * Cross-country visualization (median).
 . gr box hsat, noout over(cntry, sort(1)des) yla(0 "Min" 10 "Max") ///
 >     yti("Satisfaction in health services") ///
 >     name(dv_boxes, replace)

 . 
 . * Generate dummies for the full 11-pt scale DV.
 . cap drop hsat11_*

 . tab hsat, gen(hsat11_)

      State of |
        health |
   services in |
       country |
      nowadays |      Freq.     Percent        Cum.
 ---------------+-----------------------------------
 Extremely bad |      3,298        5.89        5.89
             1 |      3,022        5.39       11.28
             2 |      4,618        8.24       19.52
             3 |      6,050       10.80       30.32
             4 |      5,767       10.29       40.62
             5 |      8,318       14.85       55.46
             6 |      6,326       11.29       66.75
             7 |      7,571       13.51       80.27
             8 |      6,865       12.25       92.52
             9 |      2,725        4.86       97.38
 Extremely good |      1,466        2.62      100.00
 ---------------+-----------------------------------
         Total |     56,026      100.00

 . 
 . * Cross-country visualization (proportions).
 . gr hbar hsat11_*, over(cntry, sort(1)des) stack legend(off) ///
 >     yti("Satisfaction in health services") ///
 >     scheme(burd11) name(dv_bars, replace)

 . 
 . 
 . * Independent variables
 . * ---------------------
 . 
 . fre agea gndr health hincfel lrscale, r(10)

 agea -- Age of respondent, calculated
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   15    |        356       0.63       0.63       0.63
        16    |        623       1.10       1.10       1.73
        17    |        729       1.28       1.29       3.02
        18    |        674       1.19       1.19       4.21
        19    |        824       1.45       1.46       5.67
        :     |          :          :          :          :
        97    |          4       0.01       0.01      99.99
        98    |          2       0.00       0.00      99.99
        99    |          2       0.00       0.00      99.99
        105   |          2       0.00       0.00     100.00
        123   |          1       0.00       0.00     100.00
        Total |      56544      99.63     100.00           
 Missing .a    |        208       0.37                      
 Total         |      56752     100.00                      
 -----------------------------------------------------------

 gndr -- Gender
 ---------------------------------------------------------------
                  |      Freq.    Percent      Valid       Cum.
 ------------------+--------------------------------------------
 Valid   1  Male   |      25787      45.44      45.46      45.46
        2  Female |      30935      54.51      54.54     100.00
        Total     |      56722      99.95     100.00           
 Missing .a        |         30       0.05                      
 Total             |      56752     100.00                      
 ---------------------------------------------------------------

 health -- Subjective general health
 ------------------------------------------------------------------
                     |      Freq.    Percent      Valid       Cum.
 ---------------------+--------------------------------------------
 Valid   1  Very good |      12245      21.58      21.61      21.61
        2  Good      |      22949      40.44      40.49      62.10
        3  Fair      |      15863      27.95      27.99      90.09
        4  Bad       |       4633       8.16       8.17      98.27
        5  Very bad  |        983       1.73       1.73     100.00
        Total        |      56673      99.86     100.00           
 Missing .a           |         10       0.02                      
        .b           |         53       0.09                      
        .c           |         16       0.03                      
        Total        |         79       0.14                      
 Total                |      56752     100.00                      
 ------------------------------------------------------------------

 hincfel -- Feeling about household's income nowadays
 ----------------------------------------------------------------------------------
                                     |      Freq.    Percent      Valid       Cum.
 -------------------------------------+--------------------------------------------
 Valid   1  Living comfortably on     |      13135      23.14      23.41      23.41
           present income            |                                            
        2  Coping on present income  |      24544      43.25      43.74      67.15
        3  Difficult on present      |      12681      22.34      22.60      89.76
           income                    |                                            
        4  Very difficult on present |       5748      10.13      10.24     100.00
           income                    |                                            
        Total                        |      56108      98.87     100.00           
 Missing .a                           |         97       0.17                      
        .b                           |        442       0.78                      
        .c                           |        105       0.19                      
        Total                        |        644       1.13                      
 Total                                |      56752     100.00                      
 ----------------------------------------------------------------------------------

 lrscale -- Placement on left right scale
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   0  Left  |       1589       2.80       3.34       3.34
        1  1     |       1261       2.22       2.65       5.99
        2  2     |       2590       4.56       5.44      11.44
        3  3     |       4644       8.18       9.76      21.20
        4  4     |       4695       8.27       9.87      31.07
        5  5     |      15412      27.16      32.40      63.47
        6  6     |       4498       7.93       9.46      72.92
        7  7     |       4960       8.74      10.43      83.35
        8  8     |       4102       7.23       8.62      91.97
        9  9     |       1601       2.82       3.37      95.34
        10 Right |       2217       3.91       4.66     100.00
        Total    |      47569      83.82     100.00           
 Missing .a       |        856       1.51                      
        .b       |       8191      14.43                      
        .c       |        136       0.24                      
        Total    |       9183      16.18                      
 Total            |      56752     100.00                      
 --------------------------------------------------------------

 . 
 . * Recode sex to dummy.
 . gen female:female = (gndr == 2) if !mi(gndr)
 (30 missing values generated)

 . la def female 0 "Male" 1 "Female", replace

 . la var female "Gender"

 . 
 . * Fix age variable name.
 . ren agea age

 . 
 . * Generate six age groups (15-24, 25-34, ..., 65+).
 . gen age6:age6 = irecode(age, 24, 34, 44, 54, 64, .)
 (208 missing values generated)

 . replace age6 = 10 * age6 + 15
 (56544 real changes made)

 . la def age6 15 "15-24" 25 "25-34" 35 "35-44" ///
 >         45 "45-54" 55 "55-64" 65 "65+", replace

 . la var age6 "Age groups"

 . 
 . * Subjective low income dummy.
 . gen lowinc = (hincfel > 2) if !mi(hincfel)
 (644 missing values generated)

 . la var lowinc "Subjective low income"

 . 
 . * Recode left-right scale.
 . recode lrscale (0/4 = 1 "Left") (5 = 2 "Centre") (6/10 = 3 "Right"), gen(pol3)
 (46308 differences between lrscale and pol3)

 . la var pol3 "Political views (left-right)"

 . 
 . 
 . * Subsetting
 . * ----------
 . 
 . * Check missing values.
 . misstable pat hsat age6 female health pol3 lowinc if cntry == "FR"

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Percent   |  1  2  3
  ------------+-------------
       94%    |  1  1  1
              |
        6     |  1  1  0
       <1     |  0  1  1
       <1     |  1  0  0
       <1     |  1  0  1
       <1     |  0  1  0
       <1     |  0  0  0
  ------------+-------------
      100%    |

  Variables are  (1) lowinc  (2) hsat  (3) pol3

 . misstable pat hsat age6 female health pol3 lowinc if cntry == "GB"

       Missing-value patterns
         (1 means complete)

              |   Pattern
    Percent   |  1  2  3  4    5  6
  ------------+---------------------
       88%    |  1  1  1  1    1  1
              |
       10     |  1  1  1  1    1  0
       <1     |  1  1  1  1    0  1
       <1     |  1  1  0  0    0  1
       <1     |  1  0  1  1    1  1
       <1     |  1  1  1  0    0  1
       <1     |  1  1  1  0    1  0
       <1     |  1  1  1  0    1  1
       <1     |  1  1  1  0    0  0
       <1     |  0  1  1  1    1  1
       <1     |  1  0  1  1    1  0
       <1     |  1  1  0  0    0  0
       <1     |  1  1  1  1    0  0
  ------------+---------------------
      100%    |

  Variables are  (1) health  (2) hsat  (3) female  (4) lowinc  (5) age6  (6) pol3

 . 
 . * Select case studies.
 . keep if inlist(cntry, "FR", "GB")
 (52327 observations deleted)

 . 
 . * Delete incomplete observations.
 . drop if mi(hsat, age6, female, health, pol3, lowinc)
 (404 observations deleted)

 . 
 . * Final sample sizes.
 . bys cntry: count

 ------------------------------------------------------------------------------------
 -> cntry = FR
 1942
 ------------------------------------------------------------------------------------
 -> cntry = GB
 2079

 . 
 . 
 . * Normality
 . * ---------
 . 
 . * Distribution of the DV in the case studies.
 . hist hsat, discrete normal xla(0 10) by(cntry, legend(off) note("")) ///
 >     name(dv_histograms, replace)

 . 
 . * Generate strictly positive DV recode.
 . gen hsat1 = hsat + 1

 . 
 . * Visual check of common transformations.
 . gladder hsat1, bin(11) ///
 >    name(gladder, replace)

 . 
 . /* Notes:
 > 
 >  - There are more missing observations for Britain than for France, and this
 >    might distort the results if the non-respondents come, for example, from the
 >    same end of the political spectrum. We'll be careful.
 > 
 >  - The distribution of the DV is skewed to the right in both case studies, which
 >    is consistent with the hypothesis that extensive healthcare states like the
 >    ones found in Britain France enjoy higher popular support.
 > 
 >  - To allow for a log-transformation, the variable should be strictly positive
 >    since the function f: y = log(x) is undefined for x = 0. We use a recode of
 >    the DV of strictly positive range to test for transformations.
 > 
 >  - The square root comes only marginally closer to a normal distribution. With
 >    little improvement in normality, transforming the DV would be overkill. It is
 >    reasonable to carry on with the untransformed DV. */
 . 
 . 
 . * Export summary statistics
 . * -------------------------
 . 
 . * The next command is part of the SRQM folder. If Stata returns an error when
 . * you run it, set the folder as your working directory and type -run profile-
 . * to run the course setup, then try the command again. If you still experience
 . * problems with the -stab- command, please send a detailed email on the issue.
 . 
 . stab using week11_stats_FR.txt if cntry == "FR", replace ///
 >   mean(hsat) ///
 >   prop(female age6 health lowinc pol3)
 (note: file week11_stats_FR.txt not found)

 Variable                     mean           sd          min          max         mea
 > n           sd          min          max         mean           sd          min   
 >        max         mean           sd          min          max         mean       
 >     sd          min          max

 Gender                          %            %            %            %            
 > %

 Age groups                      %            %            %            %            
 > %

 Subjective general~h            %            %            %            %            
 > %

 Subjective low inc~e            %            %            %            %            
 > %

 Political views (l~)            %            %            %            %            
 > %

 N = 19420
 File: week11_stats_FR.txt

 . 
 . stab using week11_stats_GB.txt if cntry == "GB", replace ///
 >   mean(hsat) ///
 >   prop(female age6 health lowinc pol3)
 (note: file week11_stats_GB.txt not found)

 Variable                     mean           sd          min          max         mea
 > n           sd          min          max         mean           sd          min   
 >        max         mean           sd          min          max         mean       
 >     sd          min          max

 Gender                          %            %            %            %            
 > %

 Age groups                      %            %            %            %            
 > %

 Subjective general~h            %            %            %            %            
 > %

 Subjective low inc~e            %            %            %            %            
 > %

 Political views (l~)            %            %            %            %            
 > %

 N = 20790
 File: week11_stats_GB.txt

 . 
 . /* Syntax of the -stab- command:
 > 
 >  - using FILE  - name of the exported file; plain text (.txt) recommended
 >  - replace     - overwrite any previously existing file
 >  - mean()      - summarizes a list of continuous variables (mean, sd, min, max)
 >  - prop()      - summarizes a list of categorical variables (frequencies)
 > 
 >   In the example above, the -stab- command will export two files to the working
 >   directory, containing summary statistics for France (week11_stats_FR.txt) and
 >   Britain (week11_stats_GB.txt). */
 . 
 . 
 . * =====================
 . * = ASSOCIATION TESTS =
 . * =====================
 . 
 . 
 . * Relationships with socio-demographics
 . * -------------------------------------
 . 
 . * Line graph using DV means computed for each age and gender group.
 . cap drop msat_?

 . bys cntry age6: egen msat_1 = mean(hsat) if female
 (1880 missing values generated)

 . bys cntry age6: egen msat_2 = mean(hsat) if !female
 (2141 missing values generated)

 . tw conn msat_? age6, by(cntry, note("")) ///
 >     xti("Age") yti("Mean level of satisfaction") ///
 >     legend(row(1) order(1 "Female" 2 "Male")) ///
 >     name(hsat_age_sex, replace)

 . 
 . * Association between DV and gender.
 . by cntry: ttest hsat, by(female)

 ------------------------------------------------------------------------------------
 -> cntry = FR

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
    Male |     898    6.203786    .0707298    2.119535    6.064971    6.342601
  Female |    1044    5.822797    .0637708    2.060499    5.697663    5.947931
 ---------+--------------------------------------------------------------------
 combined |    1942     5.99897    .0475649    2.096094    5.905687    6.092254
 ---------+--------------------------------------------------------------------
    diff |            .3809893    .0950314                .1946148    .5673637
 ------------------------------------------------------------------------------
    diff = mean(Male) - mean(Female)                              t =   4.0091
 Ho: diff = 0                                     degrees of freedom =     1940

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0001          Pr(T > t) = 0.0000

 ------------------------------------------------------------------------------------
 -> cntry = GB

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
    Male |     982    6.255601    .0679263    2.128599    6.122303    6.388898
  Female |    1097    5.717411    .0676622     2.24104    5.584649    5.850173
 ---------+--------------------------------------------------------------------
 combined |    2079    5.971621      .04835    2.204568    5.876802     6.06644
 ---------+--------------------------------------------------------------------
    diff |            .5381897     .096149                .3496312    .7267482
 ------------------------------------------------------------------------------
    diff = mean(Male) - mean(Female)                              t =   5.5975
 Ho: diff = 0                                     degrees of freedom =     2077

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

 . 
 . * Correlation between DV and age.
 . by cntry: pwcorr hsat age, obs star(.01)

 ------------------------------------------------------------------------------------
 -> cntry = FR

             |     hsat      age
 -------------+------------------
        hsat |   1.0000 
             |     1942
             |
         age |  -0.0655*  1.0000 
             |     1942     1942
             |

 ------------------------------------------------------------------------------------
 -> cntry = GB

             |     hsat      age
 -------------+------------------
        hsat |   1.0000 
             |     2079
             |
         age |   0.2042*  1.0000 
             |     2079     2079
             |

 . 
 . * Generate a dummy from extreme categories of age.
 . cap drop agex

 . gen agex:agex = .
 (4021 missing values generated)

 . replace agex = 0 if age6 == 15
 (378 real changes made)

 . replace agex = 1 if age6 == 65
 (915 real changes made)

 . la def agex 0 "15-24" 1 "65+", replace

 . 
 . * Difference between age extremes.
 . bys cntry: ttest hsat, by(agex)

 ------------------------------------------------------------------------------------
 -> cntry = FR

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
   15-24 |     202    6.475248    .1298296    1.845225    6.219245     6.73125
     65+ |     420    5.961905    .1027907    2.106582    5.759855    6.163954
 ---------+--------------------------------------------------------------------
 combined |     622    6.128617     .081723    2.038167     5.96813    6.289104
 ---------+--------------------------------------------------------------------
    diff |            .5133428    .1734354                .1727508    .8539347
 ------------------------------------------------------------------------------
    diff = mean(15-24) - mean(65+)                                t =   2.9599
 Ho: diff = 0                                     degrees of freedom =      620

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9984         Pr(|T| > |t|) = 0.0032          Pr(T > t) = 0.0016

 ------------------------------------------------------------------------------------
 -> cntry = GB

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
   15-24 |     176    5.784091    .1443521    1.915046    5.499196    6.068986
     65+ |     495    6.856566    .0958556    2.132653    6.668231    7.044901
 ---------+--------------------------------------------------------------------
 combined |     671    6.575261    .0822037    2.129378    6.413853    6.736669
 ---------+--------------------------------------------------------------------
    diff |           -1.072475    .1823618               -1.430545   -.7144044
 ------------------------------------------------------------------------------
    diff = mean(15-24) - mean(65+)                                t =  -5.8810
 Ho: diff = 0                                     degrees of freedom =      669

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

 . 
 . 
 . * Relationship to health status
 . * -----------------------------
 . 
 . * DV by health.
 . gr dot hsat, over(health) over(cntry) ///
 >     yti("Satisfaction in health services") ///
 >     name(dv_health, replace)

 . 
 . * Line graph using DV means computed for each health status and gender group.
 . cap drop mu_hsat_*

 . bys health female: egen mu_hsat_FR = mean(hsat) if cntry == "FR"
 (2079 missing values generated)

 . bys health female: egen mu_hsat_GB = mean(hsat) if cntry == "GB"
 (1942 missing values generated)

 . tw conn mu_hsat_* health, by(female, note("")) ///
 >     xti("Health status") yti("Mean level of satisfaction") ///
 >     xlab(1 "Good" 5 "Bad") ///
 >     legend(row(1) order(1 "FR" 2 "GB")) ///
 >     name(hsat_health, replace)

 . 
 . * Generate a dummy from health status (bad/very bad = 0, good/very good = 1).
 . cap drop health01

 . recode health (1/2 = 1 "Good") (4/5 = 0 "Poor") (else = .), gen(health01)
 (2988 differences between health and health01)

 . 
 . * Association between DV and health status.
 . bys cntry: ttest hsat, by(health01)

 ------------------------------------------------------------------------------------
 -> cntry = FR

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
    Poor |     143    5.545455    .2064476    2.468755    5.137347    5.953563
    Good |    1243    6.132743    .0577299    2.035335    6.019485    6.246002
 ---------+--------------------------------------------------------------------
 combined |    1386     6.07215     .056162    2.090858    5.961978    6.182322
 ---------+--------------------------------------------------------------------
    diff |           -.5872888    .1840209               -.9482789   -.2262988
 ------------------------------------------------------------------------------
    diff = mean(Poor) - mean(Good)                                t =  -3.1914
 Ho: diff = 0                                     degrees of freedom =     1384

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0007         Pr(|T| > |t|) = 0.0014          Pr(T > t) = 0.9993

 ------------------------------------------------------------------------------------
 -> cntry = GB

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
    Poor |     141    5.929078    .2209048    2.623099    5.492337    6.365819
    Good |    1501    5.965356    .0541266    2.097014    5.859185    6.071528
 ---------+--------------------------------------------------------------------
 combined |    1642    5.962241    .0529676    2.146332     5.85835    6.066132
 ---------+--------------------------------------------------------------------
    diff |           -.0362784    .1891085               -.4071979    .3346411
 ------------------------------------------------------------------------------
    diff = mean(Poor) - mean(Good)                                t =  -0.1918
 Ho: diff = 0                                     degrees of freedom =     1640

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.4239         Pr(|T| > |t|) = 0.8479          Pr(T > t) = 0.5761

 . 
 . 
 . * Relationship to low income status
 . * ---------------------------------
 . 
 . * DV by income.
 . bys cntry: ttest hsat, by(lowinc)

 ------------------------------------------------------------------------------------
 -> cntry = FR

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
       0 |    1647    6.061324    .0505408     2.05111    5.962193    6.160455
       1 |     295    5.650847    .1341597    2.304268    5.386812    5.914882
 ---------+--------------------------------------------------------------------
 combined |    1942     5.99897    .0475649    2.096094    5.905687    6.092254
 ---------+--------------------------------------------------------------------
    diff |            .4104762     .132225                .1511582    .6697941
 ------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =   3.1044
 Ho: diff = 0                                     degrees of freedom =     1940

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9990         Pr(|T| > |t|) = 0.0019          Pr(T > t) = 0.0010

 ------------------------------------------------------------------------------------
 -> cntry = GB

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
       0 |    1730    6.064162    .0516962    2.150212    5.962768    6.165555
       1 |     349    5.512894    .1288758    2.407598    5.259421    5.766367
 ---------+--------------------------------------------------------------------
 combined |    2079    5.971621      .04835    2.204568    5.876802     6.06644
 ---------+--------------------------------------------------------------------
    diff |            .5512679     .128829                .2986205    .8039152
 ------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =   4.2791
 Ho: diff = 0                                     degrees of freedom =     2077

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

 . 
 . * Association between IV and political attitude.
 . bys cntry: tab lowinc pol3, col chi2 nokey

 ------------------------------------------------------------------------------------
 -> cntry = FR

 Subjective |   Political views (left-right)
 low income |      Left     Centre      Right |     Total
 -----------+---------------------------------+----------
         0 |       605        449        593 |     1,647 
           |     81.54      82.54      90.40 |     84.81 
 -----------+---------------------------------+----------
         1 |       137         95         63 |       295 
           |     18.46      17.46       9.60 |     15.19 
 -----------+---------------------------------+----------
     Total |       742        544        656 |     1,942 
           |    100.00     100.00     100.00 |    100.00 

          Pearson chi2(2) =  24.2449   Pr = 0.000

 ------------------------------------------------------------------------------------
 -> cntry = GB

 Subjective |   Political views (left-right)
 low income |      Left     Centre      Right |     Total
 -----------+---------------------------------+----------
         0 |       500        709        521 |     1,730 
           |     83.47      80.94      86.26 |     83.21 
 -----------+---------------------------------+----------
         1 |        99        167         83 |       349 
           |     16.53      19.06      13.74 |     16.79 
 -----------+---------------------------------+----------
     Total |       599        876        604 |     2,079 
           |    100.00     100.00     100.00 |    100.00 

          Pearson chi2(2) =   7.2899   Pr = 0.026


 . 
 . * Proportions test (since the lowinc dummy is a proportion of the sample).
 . bys cntry: prtest lowinc if pol3 != 2, by(pol3)

 ------------------------------------------------------------------------------------
 -> cntry = FR

 Two-sample test of proportions                  Left: Number of obs =      742
                                               Right: Number of obs =      656
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        Left |   .1846361    .014244                      .1567184    .2125539
       Right |   .0960366   .0115038                      .0734895    .1185836
 -------------+----------------------------------------------------------------
        diff |   .0885995   .0183093                       .052714     .124485
             |  under Ho:   .0187645     4.72   0.000
 ------------------------------------------------------------------------------
        diff = prop(Left) - prop(Right)                           z =   4.7217
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 1.0000         Pr(|Z| < |z|) = 0.0000          Pr(Z > z) = 0.0000

 ------------------------------------------------------------------------------------
 -> cntry = GB

 Two-sample test of proportions                  Left: Number of obs =      599
                                               Right: Number of obs =      604
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        Left |   .1652755   .0151762                      .1355307    .1950202
       Right |   .1374172   .0140089                      .1099604    .1648741
 -------------+----------------------------------------------------------------
        diff |   .0278582   .0206534                     -.0126217    .0683382
             |  under Ho:   .0206625     1.35   0.178
 ------------------------------------------------------------------------------
        diff = prop(Left) - prop(Right)                           z =   1.3482
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.9112         Pr(|Z| < |z|) = 0.1776          Pr(Z > z) = 0.0888

 . 
 . 
 . * Relationship to left-right attitude
 . * -----------------------------------
 . 
 . * Correlation between DV and political attitude (left 1-10 right).
 . by cntry: pwcorr hsat lrscale, obs sig

 ------------------------------------------------------------------------------------
 -> cntry = FR

             |     hsat  lrscale
 -------------+------------------
        hsat |   1.0000 
             |
             |     1942
             |
     lrscale |   0.1998   1.0000 
             |   0.0000
             |     1942     1942
             |

 ------------------------------------------------------------------------------------
 -> cntry = GB

             |     hsat  lrscale
 -------------+------------------
        hsat |   1.0000 
             |
             |     2079
             |
     lrscale |   0.0153   1.0000 
             |   0.4853
             |     2079     2079
             |

 . 
 . * Association between DV and political attitude (left, centre, right).
 . gr box hsat, noout note("") over(pol3) asyvars over(cntry) legend(row(1)) ///
 >     scheme(burd4) name(dv_pol3, replace)

 . 
 . * Comparison with covariates
 . * --------------------------
 . 
 . d hsat esat gsat

              storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------------------
 hsat            byte   %2.0f       stfhlth    State of health services in country
                                                nowadays
 esat            byte   %2.0f       stfedu     State of education in country nowadays
 gsat            byte   %2.0f       stfgov     How satisfied with the national
                                                government

 . 
 . * DV and other ESS satisfaction items (edu = education, gov = government).
 . cap drop msat*

 . bys cntry lrscale: egen msat1 = mean(hsat)

 . bys cntry lrscale: egen msat2 = mean(esat)

 . bys cntry lrscale: egen msat3 = mean(gsat)

 . 
 . * Line graph, using the means computed above for each left-right group.
 . tw conn msat? lrscale, by(cntry, note("")) ///
 >     xla(0 "Left" 10 "Right") xti("") yti("Mean level of satisfaction") ///
 >     legend(row(1) order(1 "Health services" 2 "Education" 3 "Government")) ///
 >     name(stf_lrscale, replace)

 . 
 . /* Notes:
 > 
 >  - The significance tests are expectedly highly positive due to the large N.
 >    The risk here is to make Type I errors, even though the variations between
 >    age groups in each country seem statistically robust.
 > 
 >  - Health status seems important in France but not in Britain, whereas old age
 >    seems important in Britain but not in France. It will be interesting to see
 >    if any of these effects persist after controlling for income.
 > 
 >  - The relationship between financial difficulties and political leaning shows
 >    how your independent variables are interacting with each other.
 >          
 >  - Other measures of satisfaction (which are not part of the model itself) show
 >    how health services correlate to other measures of public sector performance
 >    when the measures are examined by left-right positioning. Politics matter. */
 .  
 . 
 . * =====================
 . * = REGRESSION MODELS =
 . * =====================
 . 
 . 
 . * Multiple linear regression model for each country case.
 . bys cntry: reg hsat ib45.age6 female i.health lowinc ib2.pol3

 ------------------------------------------------------------------------------------
 -> cntry = FR

      Source |       SS       df       MS              Number of obs =    1942
 -------------+------------------------------           F( 13,  1928) =   10.15
       Model |  546.017183    13  42.0013217           Prob > F      =  0.0000
    Residual |  7981.98076  1928  4.14003151           R-squared     =  0.0640
 -------------+------------------------------           Adj R-squared =  0.0577
       Total |  8527.99794  1941  4.39361048           Root MSE      =  2.0347

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .5757433   .1831204     3.14   0.002     .2166085    .9348781
         25  |   .4181367   .1620445     2.58   0.010     .1003357    .7359377
         35  |   .2737408   .1555479     1.76   0.079     -.031319    .5788006
         55  |   .0178386   .1577826     0.11   0.910    -.2916038     .327281
         65  |   .1721822   .1509785     1.14   0.254    -.1239161    .4682806
             |
      female |  -.3929954   .0930814    -4.22   0.000    -.5755461   -.2104446
             |
      health |
          2  |   -.367027   .1261025    -2.91   0.004    -.6143386   -.1197154
          3  |  -.4762367   .1408647    -3.38   0.001    -.7524999   -.1999735
          4  |  -.5536348   .2210454    -2.50   0.012    -.9871479   -.1201216
          5  |  -1.020825   .4543408    -2.25   0.025    -1.911876   -.1297743
             |
      lowinc |  -.2263043   .1330334    -1.70   0.089    -.4872088    .0346003
             |
        pol3 |
          1  |  -.5218848   .1154915    -4.52   0.000    -.7483861   -.2953834
          3  |   .3431802   .1195869     2.87   0.004     .1086469    .5777134
             |
       _cons |   6.458039   .1756112    36.77   0.000     6.113631    6.802447
 ------------------------------------------------------------------------------

 ------------------------------------------------------------------------------------
 -> cntry = GB

      Source |       SS       df       MS              Number of obs =    2079
 -------------+------------------------------           F( 13,  2065) =   13.25
       Model |  777.733419    13  59.8256476           Prob > F      =  0.0000
    Residual |  9321.59222  2065  4.51408824           R-squared     =  0.0770
 -------------+------------------------------           Adj R-squared =  0.0712
       Total |  10099.3256  2078  4.86011821           Root MSE      =  2.1246

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .1577784   .1958508     0.81   0.421    -.2263073     .541864
         25  |  -.0492595   .1659278    -0.30   0.767    -.3746627    .2761437
         35  |  -.0429892   .1537828    -0.28   0.780    -.3445747    .2585963
         55  |   .3753688      .1624     2.31   0.021      .056884    .6938537
         65  |   1.240436   .1496894     8.29   0.000      .946878    1.533994
             |
      female |  -.5118235   .0939119    -5.45   0.000    -.6959954   -.3276516
             |
      health |
          2  |   -.293325   .1115856    -2.63   0.009    -.5121571   -.0744928
          3  |  -.3068507   .1356245    -2.26   0.024    -.5728256   -.0408758
          4  |   -.450338   .2229726    -2.02   0.044    -.8876125   -.0130635
          5  |  -.0378434   .4049108    -0.09   0.926    -.8319194    .7562325
             |
      lowinc |  -.3094155    .129387    -2.39   0.017     -.563158    -.055673
             |
        pol3 |
          1  |   .1433662   .1130665     1.27   0.205      -.07837    .3651024
          3  |   .0290743   .1146745     0.25   0.800    -.1958155    .2539641
             |
       _cons |   6.101753   .1548257    39.41   0.000     5.798122    6.405383
 ------------------------------------------------------------------------------

 . 
 . * Cleaner output with the -leanout- command.
 . leanout: reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "FR"
 
 Dependent variable: hsat

          Variable    Coef     SE      95%  CI
  -----------------------------------------------
              age6 
               15     0.6    0.2   (  0.2,  0.9)
               25     0.4    0.2   (  0.1,  0.7)
               35     0.3    0.2   ( -0.0,  0.6)
               55     0.0    0.2   ( -0.3,  0.3)
               65     0.2    0.2   ( -0.1,  0.5)
                   
            female   -0.4    0.1   ( -0.6, -0.2)
                   
            health 
                2    -0.4    0.1   ( -0.6, -0.1)
                3    -0.5    0.1   ( -0.8, -0.2)
                4    -0.6    0.2   ( -1.0, -0.1)
                5    -1.0    0.5   ( -1.9, -0.1)
                   
            lowinc   -0.2    0.1   ( -0.5,  0.0)
                   
              pol3 
                1    -0.5    0.1   ( -0.7, -0.3)
                3     0.3    0.1   (  0.1,  0.6)
                   
             _cons    6.5    0.2   (  6.1,  6.8)
  -----------------------------------------------
 Number of observations = 1942
 Root Mean Squared Error =   2.0

 . leanout: reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "GB"
 
 Dependent variable: hsat

          Variable    Coef     SE      95%  CI
  -----------------------------------------------
              age6 
               15     0.2    0.2   ( -0.2,  0.5)
               25    -0.0    0.2   ( -0.4,  0.3)
               35    -0.0    0.2   ( -0.3,  0.3)
               55     0.4    0.2   (  0.1,  0.7)
               65     1.2    0.1   (  0.9,  1.5)
                   
            female   -0.5    0.1   ( -0.7, -0.3)
                   
            health 
                2    -0.3    0.1   ( -0.5, -0.1)
                3    -0.3    0.1   ( -0.6, -0.0)
                4    -0.5    0.2   ( -0.9, -0.0)
                5    -0.0    0.4   ( -0.8,  0.8)
                   
            lowinc   -0.3    0.1   ( -0.6, -0.1)
                   
              pol3 
                1     0.1    0.1   ( -0.1,  0.4)
                3     0.0    0.1   ( -0.2,  0.3)
                   
             _cons    6.1    0.2   (  5.8,  6.4)
  -----------------------------------------------
 Number of observations = 2079
 Root Mean Squared Error =   2.1

 . 
 . /* Notes:
 > 
 >  - This model is specified as a multiple linear regression. It captures linear
 >    relationships by computing the partial derivative of each variable, which is
 >    its effect on the DV when all other variables are held constant.
 > 
 >    We will therefore read the coefficient of an IV as its net effect on the DV,
 >    independently of all other variables in the model. This interpretation gives
 >    its meaning to the idiom of 'all other things being equal' (ceteris paribus).
 > 
 >  - The baseline age category is set to the category that contains the average
 >    population age (45-54 years-old) and is coded 'ib45' because the categories
 >    of 'age6' are coded 15, 25, 35 etc.
 > 
 >  - The baseline health status is set to default reference category 1 = very good.
 >    Categories 2-5 code for 2 = good to 5 = poor health.
 > 
 >  - The baseline political attitude is the modal (and central) category 2 = centre
 >    so that 1 = leftwing and 3-rightwing.
 >    
 >    The baseline model, given by the constant, is therefore the predicted mean of
 >    the DV for respondents who are males, aged 45-54, in very good health, at the
 >    centre politically and who did not report financial difficulties ('lowinc').
 >    
 >  - Let's manually check whether the model does a good job at predicting the
 >    constant (the baseline model) in the second country case:
 >    
 >    su hsat if age6 == 45 & !female & health == 1 & !lowinc & pol3 == 2 & cntry == 
 > "GB"
 >    
 >  - For the same country case, the model predicts a higher value for respondents
 >    aged 65+, keeping all other variables equal. Let's check that too:
 >    
 >    su hsat if age6 == 65 & !female & health == 1 & !lowinc & pol3 == 2 & cntry == 
 > "GB"
 >    
 >    Not so bad for a model predicting only 7% of the variance, but remember that
 >    the predicted values are only means, that they are significant only for some
 >    coefficients, and that they apply only to a fraction of all observations.
 >    
 >  - To assess the overall quality of the models, you should rather read the RMSE.
 >    The Root-Mean-Square Error is the standard error of the regression: it shows
 >    by how much we mispredict the DV on average.
 > 
 >    We later turn to regression diagnostics to explore the error term. */
 . 
 . 
 . * Using the -estout- command
 . * --------------------------
 . 
 . * Store model estimates.
 . eststo clear

 . bys cntry: eststo: qui reg hsat ib45.age6 female i.health lowinc ib2.pol3

 ------------------------------------------------------------------------------------
 -> FR
 (est1 stored)

 ------------------------------------------------------------------------------------
 -> GB
 (est2 stored)

 . 
 . * View stored model estimates.
 . eststo dir

 -------------------------------------------------------
        name | command      depvar       npar  title 
 -------------+-----------------------------------------
        est1 | regress      hsat           17  FR
        est2 | regress      hsat           17  GB
 -------------------------------------------------------

 . 
 . * View standardized coefficients.
 . esttab, wide nogaps beta(2) se(2) sca(rmse) mti("FR" "GB")

 ----------------------------------------------------------------------
                      (1)                          (2)                
                       FR                           GB                
 ----------------------------------------------------------------------
 15.age6              0.08**        (0.18)         0.02          (0.20)
 25.age6              0.07**        (0.16)        -0.01          (0.17)
 35.age6              0.05          (0.16)        -0.01          (0.15)
 45b.age6             0.00             (.)         0.00             (.)
 55.age6              0.00          (0.16)         0.06*         (0.16)
 65.age6              0.03          (0.15)         0.24***       (0.15)
 female              -0.09***       (0.09)        -0.12***       (0.09)
 1b.health            0.00             (.)         0.00             (.)
 2.health            -0.09**        (0.13)        -0.07**        (0.11)
 3.health            -0.10***       (0.14)        -0.06*         (0.14)
 4.health            -0.06*         (0.22)        -0.05*         (0.22)
 5.health            -0.05*         (0.45)        -0.00          (0.40)
 lowinc              -0.04          (0.13)        -0.05*         (0.13)
 1.pol3              -0.12***       (0.12)         0.03          (0.11)
 2b.pol3              0.00             (.)         0.00             (.)
 3.pol3               0.08**        (0.12)         0.01          (0.11)
 ----------------------------------------------------------------------
 N                    1942                         2079                
 rmse                2.035                        2.125                
 ----------------------------------------------------------------------
 Standardized beta coefficients; Standard errors in parentheses
 * p<0.05, ** p<0.01, *** p<0.001

 . 
 . * Export unstandardized coefficients.
 . esttab using week11_regressions.txt, replace ///
 >     nolines wide nogaps b(1) se(1) sca(rmse) mti("FR" "GB")
 (note: file week11_regressions.txt not found)
 (output written to week11_regressions.txt)

 . 
 . 
 . * Models with covariates
 . * ----------------------
 . 
 . * Store model estimates (again).
 . eststo clear

 . bys cntry: eststo: reg hsat ib45.age6 female i.health lowinc ib2.pol3

 ------------------------------------------------------------------------------------
 -> FR

      Source |       SS       df       MS              Number of obs =    1942
 -------------+------------------------------           F( 13,  1928) =   10.15
       Model |  546.017183    13  42.0013217           Prob > F      =  0.0000
    Residual |  7981.98076  1928  4.14003151           R-squared     =  0.0640
 -------------+------------------------------           Adj R-squared =  0.0577
       Total |  8527.99794  1941  4.39361048           Root MSE      =  2.0347

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .5757433   .1831204     3.14   0.002     .2166085    .9348781
         25  |   .4181367   .1620445     2.58   0.010     .1003357    .7359377
         35  |   .2737408   .1555479     1.76   0.079     -.031319    .5788006
         55  |   .0178386   .1577826     0.11   0.910    -.2916038     .327281
         65  |   .1721822   .1509785     1.14   0.254    -.1239161    .4682806
             |
      female |  -.3929954   .0930814    -4.22   0.000    -.5755461   -.2104446
             |
      health |
          2  |   -.367027   .1261025    -2.91   0.004    -.6143386   -.1197154
          3  |  -.4762367   .1408647    -3.38   0.001    -.7524999   -.1999735
          4  |  -.5536348   .2210454    -2.50   0.012    -.9871479   -.1201216
          5  |  -1.020825   .4543408    -2.25   0.025    -1.911876   -.1297743
             |
      lowinc |  -.2263043   .1330334    -1.70   0.089    -.4872088    .0346003
             |
        pol3 |
          1  |  -.5218848   .1154915    -4.52   0.000    -.7483861   -.2953834
          3  |   .3431802   .1195869     2.87   0.004     .1086469    .5777134
             |
       _cons |   6.458039   .1756112    36.77   0.000     6.113631    6.802447
 ------------------------------------------------------------------------------
 (est1 stored)

 ------------------------------------------------------------------------------------
 -> GB

      Source |       SS       df       MS              Number of obs =    2079
 -------------+------------------------------           F( 13,  2065) =   13.25
       Model |  777.733419    13  59.8256476           Prob > F      =  0.0000
    Residual |  9321.59222  2065  4.51408824           R-squared     =  0.0770
 -------------+------------------------------           Adj R-squared =  0.0712
       Total |  10099.3256  2078  4.86011821           Root MSE      =  2.1246

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .1577784   .1958508     0.81   0.421    -.2263073     .541864
         25  |  -.0492595   .1659278    -0.30   0.767    -.3746627    .2761437
         35  |  -.0429892   .1537828    -0.28   0.780    -.3445747    .2585963
         55  |   .3753688      .1624     2.31   0.021      .056884    .6938537
         65  |   1.240436   .1496894     8.29   0.000      .946878    1.533994
             |
      female |  -.5118235   .0939119    -5.45   0.000    -.6959954   -.3276516
             |
      health |
          2  |   -.293325   .1115856    -2.63   0.009    -.5121571   -.0744928
          3  |  -.3068507   .1356245    -2.26   0.024    -.5728256   -.0408758
          4  |   -.450338   .2229726    -2.02   0.044    -.8876125   -.0130635
          5  |  -.0378434   .4049108    -0.09   0.926    -.8319194    .7562325
             |
      lowinc |  -.3094155    .129387    -2.39   0.017     -.563158    -.055673
             |
        pol3 |
          1  |   .1433662   .1130665     1.27   0.205      -.07837    .3651024
          3  |   .0290743   .1146745     0.25   0.800    -.1958155    .2539641
             |
       _cons |   6.101753   .1548257    39.41   0.000     5.798122    6.405383
 ------------------------------------------------------------------------------
 (est2 stored)

 . 
 . * Run identical model on satisfaction with education.
 . bys cntry: eststo: reg esat ib45.age6 female i.health lowinc ib2.pol3

 ------------------------------------------------------------------------------------
 -> FR

      Source |       SS       df       MS              Number of obs =    1918
 -------------+------------------------------           F( 13,  1904) =    4.38
       Model |  239.038913    13  18.3876087           Prob > F      =  0.0000
    Residual |  7986.26714  1904  4.19446803           R-squared     =  0.0291
 -------------+------------------------------           Adj R-squared =  0.0224
       Total |  8225.30605  1917  4.29071781           Root MSE      =   2.048

 ------------------------------------------------------------------------------
        esat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |    .338069   .1848041     1.83   0.068    -.0243707    .7005088
         25  |   .1280543   .1638014     0.78   0.434    -.1931947    .4493033
         35  |   .3197301   .1573435     2.03   0.042     .0111463    .6283139
         55  |    .138363   .1598155     0.87   0.387    -.1750689    .4517949
         65  |   .0144126   .1534757     0.09   0.925    -.2865856    .3154108
             |
      female |   .0175677   .0942759     0.19   0.852    -.1673271    .2024626
             |
      health |
          2  |  -.1528419   .1275963    -1.20   0.231     -.403085    .0974013
          3  |  -.3176335   .1430491    -2.22   0.027     -.598183    -.037084
          4  |   -.406681   .2250693    -1.81   0.071    -.8480894    .0347273
          5  |  -.7693203   .4576307    -1.68   0.093    -1.666831    .1281899
             |
      lowinc |  -.4343483   .1345077    -3.23   0.001    -.6981462   -.1705504
             |
        pol3 |
          1  |  -.3903843    .116867    -3.34   0.001    -.6195851   -.1611835
          3  |   .0564588   .1212735     0.47   0.642    -.1813841    .2943017
             |
       _cons |   5.209674   .1780503    29.26   0.000      4.86048    5.558868
 ------------------------------------------------------------------------------
 (est3 stored)

 ------------------------------------------------------------------------------------
 -> GB

      Source |       SS       df       MS              Number of obs =    2028
 -------------+------------------------------           F( 13,  2014) =    3.10
       Model |  171.674796    13  13.2057536           Prob > F      =  0.0001
    Residual |  8578.75666  2014   4.2595614           R-squared     =  0.0196
 -------------+------------------------------           Adj R-squared =  0.0133
       Total |  8750.43146  2027  4.31693708           Root MSE      =  2.0639

 ------------------------------------------------------------------------------
        esat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .7480398   .1915159     3.91   0.000     .3724498     1.12363
         25  |   .3473778   .1630304     2.13   0.033     .0276521    .6671036
         35  |   .2107234   .1509536     1.40   0.163    -.0853182    .5067649
         55  |    .103085   .1603123     0.64   0.520    -.2113102    .4174802
         65  |   .3305423   .1478204     2.24   0.025     .0406455    .6204391
             |
      female |  -.0976372   .0924553    -1.06   0.291    -.2789551    .0836808
             |
      health |
          2  |  -.0752296   .1096759    -0.69   0.493    -.2903197    .1398605
          3  |  -.1966405   .1336877    -1.47   0.141     -.458821    .0655401
          4  |  -.4759092   .2179303    -2.18   0.029    -.9033015   -.0485169
          5  |  -.0885606   .3994685    -0.22   0.825    -.8719753    .6948541
             |
      lowinc |  -.1847112   .1266079    -1.46   0.145    -.4330074    .0635851
             |
        pol3 |
          1  |   .0261182   .1114108     0.23   0.815    -.1923743    .2446108
          3  |  -.3412746   .1128567    -3.02   0.003    -.5626027   -.1199465
             |
       _cons |   5.729339   .1530038    37.45   0.000     5.429277    6.029401
 ------------------------------------------------------------------------------
 (est4 stored)

 . 
 . * Run identical model on satisfaction with government.
 . bys cntry: eststo: reg gsat ib45.age6 female i.health lowinc ib2.pol3

 ------------------------------------------------------------------------------------
 -> FR

      Source |       SS       df       MS              Number of obs =    1927
 -------------+------------------------------           F( 13,  1913) =   62.78
       Model |  3143.99841    13  241.846032           Prob > F      =  0.0000
    Residual |   7368.8267  1913  3.85197423           R-squared     =  0.2991
 -------------+------------------------------           Adj R-squared =  0.2943
       Total |  10512.8251  1926  5.45837233           Root MSE      =  1.9626

 ------------------------------------------------------------------------------
        gsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .1415863   .1774439     0.80   0.425    -.2064176    .4895902
         25  |  -.1676006   .1563996    -1.07   0.284    -.4743322     .139131
         35  |  -.1988654   .1504437    -1.32   0.186    -.4939163    .0961855
         55  |   .0137179   .1526804     0.09   0.928    -.2857197    .3131554
         65  |   .2527315   .1462162     1.73   0.084    -.0340285    .5394915
             |
      female |  -.0071429    .090083    -0.08   0.937    -.1838141    .1695283
             |
      health |
          2  |  -.2660753   .1221563    -2.18   0.030    -.5056489   -.0265017
          3  |  -.3098499    .136446    -2.27   0.023    -.5774485   -.0422513
          4  |  -.5489245    .214653    -2.56   0.011     -.969903    -.127946
          5  |  -.6821846   .4385135    -1.56   0.120    -1.542199    .1778301
             |
      lowinc |  -.5982746   .1294904    -4.62   0.000    -.8522318   -.3443175
             |
        pol3 |
          1  |  -1.447156   .1121175   -12.91   0.000    -1.667042   -1.227271
          3  |   1.368768   .1158465    11.82   0.000     1.141569    1.595967
             |
       _cons |   4.312006   .1701966    25.34   0.000     3.978216    4.645797
 ------------------------------------------------------------------------------
 (est5 stored)

 ------------------------------------------------------------------------------------
 -> GB

      Source |       SS       df       MS              Number of obs =    2070
 -------------+------------------------------           F( 13,  2056) =    6.55
       Model |  442.594151    13   34.045704           Prob > F      =  0.0000
    Residual |  10690.3604  2056  5.19959165           R-squared     =  0.0398
 -------------+------------------------------           Adj R-squared =  0.0337
       Total |  11132.9546  2069  5.38083837           Root MSE      =  2.2803

 ------------------------------------------------------------------------------
        gsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .8183095   .2110845     3.88   0.000     .4043478    1.232271
         25  |   .1340198    .178367     0.75   0.453    -.2157789    .4838186
         35  |  -.0549785   .1651793    -0.33   0.739    -.3789146    .2689576
         55  |   .0306191   .1744077     0.18   0.861     -.311415    .3726533
         65  |   .3743036   .1611006     2.32   0.020     .0583662    .6902409
             |
      female |  -.2991263   .1010176    -2.96   0.003    -.4972338   -.1010188
             |
      health |
          2  |  -.1736806   .1198973    -1.45   0.148    -.4088133    .0614521
          3  |   -.495565   .1459817    -3.39   0.001    -.7818524   -.2092776
          4  |  -.5286341   .2402121    -2.20   0.028    -.9997185   -.0575497
          5  |  -.3879851   .4346074    -0.89   0.372    -1.240302    .4643315
             |
      lowinc |  -.4232963   .1390684    -3.04   0.002    -.6960259   -.1505666
             |
        pol3 |
          1  |   .4348585   .1217063     3.57   0.000      .196178     .673539
          3  |  -.2616768   .1232299    -2.12   0.034    -.5033452   -.0200084
             |
       _cons |   3.765473   .1664704    22.62   0.000     3.439004    4.091941
 ------------------------------------------------------------------------------
 (est6 stored)

 . 
 . * View updated list of model estimates.
 . eststo dir

 -------------------------------------------------------
        name | command      depvar       npar  title 
 -------------+-----------------------------------------
        est1 | regress      hsat           17  FR
        est2 | regress      hsat           17  GB
        est3 | regress      esat           17  FR
        est4 | regress      esat           17  GB
        est5 | regress      gsat           17  FR
        est6 | regress      gsat           17  GB
 -------------------------------------------------------

 . 
 . * Compare DV and covariates in each country, using standardized coefficients,
 . * RMSE and R-squared to compare predicted variance across the models.
 . esttab est1 est3 est5, lab nogaps beta(2) se(2) sca(rmse) r2 ///
 >         mti("Health" "Education" "Government") ti("France")

 France
 --------------------------------------------------------------------
                              (1)             (2)             (3)   
                           Health       Education      Government   
 --------------------------------------------------------------------
 15.Age groups                0.08**          0.05            0.02   
                           (0.18)          (0.18)          (0.18)   
 25.Age groups                0.07**          0.02           -0.03   
                           (0.16)          (0.16)          (0.16)   
 35.Age groups                0.05            0.06*          -0.03   
                           (0.16)          (0.16)          (0.15)   
 45b.Age groups               0.00            0.00            0.00   
                              (.)             (.)             (.)   
 55.Age groups                0.00            0.03            0.00   
                           (0.16)          (0.16)          (0.15)   
 65.Age groups                0.03            0.00            0.04   
                           (0.15)          (0.15)          (0.15)   
 Gender                      -0.09***         0.00           -0.00   
                           (0.09)          (0.09)          (0.09)   
 1b.Subjective gene~h         0.00            0.00            0.00   
                              (.)             (.)             (.)   
 2.Subjective gener~h        -0.09**         -0.04           -0.06*  
                           (0.13)          (0.13)          (0.12)   
 3.Subjective gener~h        -0.10***        -0.07*          -0.06*  
                           (0.14)          (0.14)          (0.14)   
 4.Subjective gener~h        -0.06*          -0.05           -0.06*  
                           (0.22)          (0.23)          (0.21)   
 5.Subjective gener~h        -0.05*          -0.04           -0.03   
                           (0.45)          (0.46)          (0.44)   
 Subjective low inc~e        -0.04           -0.08**         -0.09***
                           (0.13)          (0.13)          (0.13)   
 1.Political views ~)        -0.12***        -0.09***        -0.30***
                           (0.12)          (0.12)          (0.11)   
 2b.Political views~)         0.00            0.00            0.00   
                              (.)             (.)             (.)   
 3.Political views ~)         0.08**          0.01            0.28***
                           (0.12)          (0.12)          (0.12)   
 --------------------------------------------------------------------
 Observations                 1942            1918            1927   
 R-squared                   0.064           0.029           0.299   
 rmse                        2.035           2.048           1.963   
 --------------------------------------------------------------------
 Standardized beta coefficients; Standard errors in parentheses
 * p<0.05, ** p<0.01, *** p<0.001

 . 
 . esttab est2 est4 est6, lab nogaps beta(2) se(2) sca(rmse) r2 ///
 >         mti("Health" "Education" "Government") ti("UK")

 UK
 --------------------------------------------------------------------
                              (1)             (2)             (3)   
                           Health       Education      Government   
 --------------------------------------------------------------------
 15.Age groups                0.02            0.10***         0.10***
                           (0.20)          (0.19)          (0.21)   
 25.Age groups               -0.01            0.06*           0.02   
                           (0.17)          (0.16)          (0.18)   
 35.Age groups               -0.01            0.04           -0.01   
                           (0.15)          (0.15)          (0.17)   
 45b.Age groups               0.00            0.00            0.00   
                              (.)             (.)             (.)   
 55.Age groups                0.06*           0.02            0.00   
                           (0.16)          (0.16)          (0.17)   
 65.Age groups                0.24***         0.07*           0.07*  
                           (0.15)          (0.15)          (0.16)   
 Gender                      -0.12***        -0.02           -0.06** 
                           (0.09)          (0.09)          (0.10)   
 1b.Subjective gene~h         0.00            0.00            0.00   
                              (.)             (.)             (.)   
 2.Subjective gener~h        -0.07**         -0.02           -0.04   
                           (0.11)          (0.11)          (0.12)   
 3.Subjective gener~h        -0.06*          -0.04           -0.09***
                           (0.14)          (0.13)          (0.15)   
 4.Subjective gener~h        -0.05*          -0.05*          -0.05*  
                           (0.22)          (0.22)          (0.24)   
 5.Subjective gener~h        -0.00           -0.01           -0.02   
                           (0.40)          (0.40)          (0.43)   
 Subjective low inc~e        -0.05*          -0.03           -0.07** 
                           (0.13)          (0.13)          (0.14)   
 1.Political views ~)         0.03            0.01            0.08***
                           (0.11)          (0.11)          (0.12)   
 2b.Political views~)         0.00            0.00            0.00   
                              (.)             (.)             (.)   
 3.Political views ~)         0.01           -0.07**         -0.05*  
                           (0.11)          (0.11)          (0.12)   
 --------------------------------------------------------------------
 Observations                 2079            2028            2070   
 R-squared                   0.077           0.020           0.040   
 rmse                        2.125           2.064           2.280   
 --------------------------------------------------------------------
 Standardized beta coefficients; Standard errors in parentheses
 * p<0.05, ** p<0.01, *** p<0.001

 . 
 . /* Basic usage of -estout- commands:
 >   
 >  - The -estout- commands work by storing model estimates with -eststo- and then
 >    putting them into tables with -esttab-. Use these commands at the end of your
 >    models: start with -reg- and -leanout-, then use -eststo- and -esttab-.
 >    
 >  - The -estout- command is especially practical when you run many models, as
 >    shown here when we compare the model between country cases and then check
 >    how the DV model compares to other satisfaction measures (covariates). */
 . 
 . 
 . * ==========================
 . * = REGRESSION DIAGNOSTICS =
 . * ==========================
 . 
 . 
 . * Note: what we call 'diagnostics' at that stage actually covers a broader range
 . * of postestimation commands like -margins- and -marginsplot- (marginal effects)
 . * or seemingly unrelated regression (SUREG). The overall logic of these commands
 . * is to help with the detection of patterns that are not taken into account by
 . * our 'front-end' linear regression model.
 . 
 . 
 . * (1) France: Residuals
 . * ---------------------
 . 
 . reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "FR"

      Source |       SS       df       MS              Number of obs =    1942
 -------------+------------------------------           F( 13,  1928) =   10.15
       Model |  546.017183    13  42.0013217           Prob > F      =  0.0000
    Residual |  7981.98076  1928  4.14003151           R-squared     =  0.0640
 -------------+------------------------------           Adj R-squared =  0.0577
       Total |  8527.99794  1941  4.39361048           Root MSE      =  2.0347

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .5757433   .1831204     3.14   0.002     .2166085    .9348781
         25  |   .4181367   .1620445     2.58   0.010     .1003357    .7359377
         35  |   .2737408   .1555479     1.76   0.079     -.031319    .5788006
         55  |   .0178386   .1577826     0.11   0.910    -.2916038     .327281
         65  |   .1721822   .1509785     1.14   0.254    -.1239161    .4682806
             |
      female |  -.3929954   .0930814    -4.22   0.000    -.5755461   -.2104446
             |
      health |
          2  |   -.367027   .1261025    -2.91   0.004    -.6143386   -.1197154
          3  |  -.4762367   .1408647    -3.38   0.001    -.7524999   -.1999735
          4  |  -.5536348   .2210454    -2.50   0.012    -.9871479   -.1201216
          5  |  -1.020825   .4543408    -2.25   0.025    -1.911876   -.1297743
             |
      lowinc |  -.2263043   .1330334    -1.70   0.089    -.4872088    .0346003
             |
        pol3 |
          1  |  -.5218848   .1154915    -4.52   0.000    -.7483861   -.2953834
          3  |   .3431802   .1195869     2.87   0.004     .1086469    .5777134
             |
       _cons |   6.458039   .1756112    36.77   0.000     6.113631    6.802447
 ------------------------------------------------------------------------------

 . 
 . * Variance inflation.
 . vif

    Variable |       VIF       1/VIF  
 -------------+----------------------
        age6 |
         15  |      1.47    0.682149
         25  |      1.60    0.623269
         35  |      1.66    0.601749
         55  |      1.64    0.608562
         65  |      1.81    0.551771
      female |      1.01    0.989806
      health |
          2  |      1.84    0.544567
          3  |      1.90    0.525788
          4  |      1.34    0.746783
          5  |      1.08    0.922074
      lowinc |      1.07    0.935009
        pol3 |
          1  |      1.48    0.676967
          3  |      1.50    0.666409
 -------------+----------------------
    Mean VIF |      1.49

 . 
 . * Residuals-versus-fitted values plot.
 . rvfplot, yline(0) ///
 >         name(rvf_fr, replace)

 . 
 . * Store the standardized residuals for the estimation sample (France only).
 . cap drop rst_fr

 . predict rst_fr if e(sample), rsta
 (2079 missing values generated)

 . 
 . * Distribution of the standardized residuals.
 . hist rst_fr, normal ///
 >         name(rst_fr_1, replace)
 (bin=32, start=-3.3201849, width=.17776279)

 . 
 . * Store the predicted values for the estimation sample (France only).
 . cap drop yhat_fr

 . predict yhat_fr if e(sample)
 (option xb assumed; fitted values)
 (2079 missing values generated)

 . 
 . * Plot the distribution of the standardized residuals over socio-demographics.
 . hist rst_fr, normal by(female age6, legend(off)) bin(10) xline(0) ///
 >         name(rst_fr_2, replace)

 . 
 . * Plot the residuals-versus-fitted values by income and political views.
 . sc rst_fr yhat_fr, by(pol3 lowinc, col(2) legend(off)) yline(0) ///
 >         name(rst_fr_3, replace)

 . 
 . 
 . * (2) France: Marginal effects
 . * ----------------------------
 . 
 . * Briefly recall the model by calling -reg- without any new specification.
 . reg

      Source |       SS       df       MS              Number of obs =    1942
 -------------+------------------------------           F( 13,  1928) =   10.15
       Model |  546.017183    13  42.0013217           Prob > F      =  0.0000
    Residual |  7981.98076  1928  4.14003151           R-squared     =  0.0640
 -------------+------------------------------           Adj R-squared =  0.0577
       Total |  8527.99794  1941  4.39361048           Root MSE      =  2.0347

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .5757433   .1831204     3.14   0.002     .2166085    .9348781
         25  |   .4181367   .1620445     2.58   0.010     .1003357    .7359377
         35  |   .2737408   .1555479     1.76   0.079     -.031319    .5788006
         55  |   .0178386   .1577826     0.11   0.910    -.2916038     .327281
         65  |   .1721822   .1509785     1.14   0.254    -.1239161    .4682806
             |
      female |  -.3929954   .0930814    -4.22   0.000    -.5755461   -.2104446
             |
      health |
          2  |   -.367027   .1261025    -2.91   0.004    -.6143386   -.1197154
          3  |  -.4762367   .1408647    -3.38   0.001    -.7524999   -.1999735
          4  |  -.5536348   .2210454    -2.50   0.012    -.9871479   -.1201216
          5  |  -1.020825   .4543408    -2.25   0.025    -1.911876   -.1297743
             |
      lowinc |  -.2263043   .1330334    -1.70   0.089    -.4872088    .0346003
             |
        pol3 |
          1  |  -.5218848   .1154915    -4.52   0.000    -.7483861   -.2953834
          3  |   .3431802   .1195869     2.87   0.004     .1086469    .5777134
             |
       _cons |   6.458039   .1756112    36.77   0.000     6.113631    6.802447
 ------------------------------------------------------------------------------

 . 
 . * What is observable above is the (positive) linear effect of one predictor onto
 . * the DV: all other things kept equal, rightwing views lead to a higher level of
 . * satisfaction with health services, independently of age, gender, income and so
 . * on. You can show the same thing by predicting the marginal effect of the IV on
 . * the DV with the -margins- command.
 . margins pol3

 Predictive margins                                Number of obs   =       1942
 Model VCE    : OLS

 Expression   : Linear prediction, predict()

 ------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        pol3 |
          1  |   5.560562   .0752119    73.93   0.000      5.41315    5.707975
          2  |   6.082447   .0876891    69.36   0.000      5.91058    6.254315
          3  |   6.425627   .0804466    79.87   0.000     6.267955      6.5833
 ------------------------------------------------------------------------------

 . marginsplot, ///
 >         name(margins_pol3_fr, replace)

  Variables that uniquely identify margins: pol3

 . 
 . * Let's plot a more complex interaction where we observe the effect of political
 . * views and health status combined. The linear effect of political views remains
 . * observable at good health but becomes indistinguishable when health degrades.
 . margins health#pol3

 Predictive margins                                Number of obs   =       1942
 Model VCE    : OLS

 Expression   : Linear prediction, predict()

 ------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 health#pol3 |
        1 1  |   5.903804   .1219382    48.42   0.000      5.66481    6.142799
        1 2  |   6.425689   .1317597    48.77   0.000     6.167445    6.683933
        1 3  |   6.768869   .1224024    55.30   0.000     6.528965    7.008773
        2 1  |   5.536777   .0923518    59.95   0.000     5.355771    5.717783
        2 2  |   6.058662   .1015952    59.64   0.000     5.859539    6.257785
        2 3  |   6.401842   .0963706    66.43   0.000     6.212959    6.590725
        3 1  |   5.427567   .1057111    51.34   0.000     5.220377    5.634757
        3 2  |   5.949452   .1147927    51.83   0.000     5.724463    6.174442
        3 3  |   6.292632   .1109473    56.72   0.000      6.07518    6.510085
        4 1  |   5.350169   .1974214    27.10   0.000     4.963231    5.737108
        4 2  |   5.872054    .203375    28.87   0.000     5.473446    6.270662
        4 3  |   6.215234   .2017018    30.81   0.000     5.819906    6.610562
        5 1  |   4.882979   .4425538    11.03   0.000     4.015589    5.750368
        5 2  |   5.404864   .4447404    12.15   0.000     4.533189    6.276539
        5 3  |   5.748044   .4453698    12.91   0.000     4.875135    6.620953
 ------------------------------------------------------------------------------

 . marginsplot, recast(line) recastci(rarea) ciopts(fi(25)) legend(row(1)) ///
 >         name(margins_health_pol3_fr, replace)

  Variables that uniquely identify margins: health pol3

 . 
 . 
 . * (3) Britain: Exercise
 . * ---------------------
 . 
 . reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "GB"

      Source |       SS       df       MS              Number of obs =    2079
 -------------+------------------------------           F( 13,  2065) =   13.25
       Model |  777.733419    13  59.8256476           Prob > F      =  0.0000
    Residual |  9321.59222  2065  4.51408824           R-squared     =  0.0770
 -------------+------------------------------           Adj R-squared =  0.0712
       Total |  10099.3256  2078  4.86011821           Root MSE      =  2.1246

 ------------------------------------------------------------------------------
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .1577784   .1958508     0.81   0.421    -.2263073     .541864
         25  |  -.0492595   .1659278    -0.30   0.767    -.3746627    .2761437
         35  |  -.0429892   .1537828    -0.28   0.780    -.3445747    .2585963
         55  |   .3753688      .1624     2.31   0.021      .056884    .6938537
         65  |   1.240436   .1496894     8.29   0.000      .946878    1.533994
             |
      female |  -.5118235   .0939119    -5.45   0.000    -.6959954   -.3276516
             |
      health |
          2  |   -.293325   .1115856    -2.63   0.009    -.5121571   -.0744928
          3  |  -.3068507   .1356245    -2.26   0.024    -.5728256   -.0408758
          4  |   -.450338   .2229726    -2.02   0.044    -.8876125   -.0130635
          5  |  -.0378434   .4049108    -0.09   0.926    -.8319194    .7562325
             |
      lowinc |  -.3094155    .129387    -2.39   0.017     -.563158    -.055673
             |
        pol3 |
          1  |   .1433662   .1130665     1.27   0.205      -.07837    .3651024
          3  |   .0290743   .1146745     0.25   0.800    -.1958155    .2539641
             |
       _cons |   6.101753   .1548257    39.41   0.000     5.798122    6.405383
 ------------------------------------------------------------------------------

 . 
 . * As an exercise, run your own selection of regression diagnostics and marginal
 . * effects for the British model. Compare the predictors in each country and see,
 . * for instance, if age and political views have the same effects in Britain.
 . 
 . 
 . * ==============
 . * = EXTENSIONS =
 . * ==============
 . 
 . 
 . * Note: this section showcases some methods that are related to the content of
 . * the course, but go beyond its scope. Both techniques yield corrected standard
 . * errors, which is crucial for panel data analysis. These methods require more
 . * theoretical support (and possibly different data) to operate, and are shown
 . * here for demonstration purposes only.
 . 
 . 
 . * (1) Bootstrapping
 . * -----------------
 . 
 . reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "FR", ///
 >         vce(bootstrap, r(100))
 (running regress on estimation sample)

 Bootstrap replications (100)
 ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
 ..................................................    50
 ..................................................   100

 Linear regression                               Number of obs      =      1942
                                                Replications       =       100
                                                Wald chi2(13)      =    190.75
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.0640
                                                Adj R-squared      =    0.0577
                                                Root MSE           =    2.0347

 ------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
        hsat |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .5757433   .1552132     3.71   0.000      .271531    .8799556
         25  |   .4181367   .1511703     2.77   0.006     .1218484     .714425
         35  |   .2737408   .1576567     1.74   0.083    -.0352607    .5827422
         55  |   .0178386   .1424735     0.13   0.900    -.2614044    .2970815
         65  |   .1721822   .1371706     1.26   0.209    -.0966671    .4410316
             |
      female |  -.3929954   .0873834    -4.50   0.000    -.5642636   -.2217271
             |
      health |
          2  |   -.367027   .1157262    -3.17   0.002    -.5938462   -.1402078
          3  |  -.4762367   .1525456    -3.12   0.002    -.7752206   -.1772528
          4  |  -.5536348   .2265738    -2.44   0.015    -.9977112   -.1095583
          5  |  -1.020825   .5374374    -1.90   0.058    -2.074183    .0325327
             |
      lowinc |  -.2263043   .1409385    -1.61   0.108    -.5025387    .0499301
             |
        pol3 |
          1  |  -.5218848   .1215707    -4.29   0.000    -.7601589   -.2836106
          3  |   .3431802   .1201522     2.86   0.004     .1076861    .5786742
             |
       _cons |   6.458039   .1466394    44.04   0.000     6.170631    6.745447
 ------------------------------------------------------------------------------

 . reg hsat ib45.age6 female i.health lowinc ib2.pol3 if cntry == "GB", ///
 >         vce(bootstrap, r(100))
 (running regress on estimation sample)

 Bootstrap replications (100)
 ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
 ..................................................    50
 ..................................................   100

 Linear regression                               Number of obs      =      2079
                                                Replications       =       100
                                                Wald chi2(13)      =    184.49
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.0770
                                                Adj R-squared      =    0.0712
                                                Root MSE           =    2.1246

 ------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
        hsat |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .1577784   .1683583     0.94   0.349    -.1721979    .4877546
         25  |  -.0492595   .1620299    -0.30   0.761    -.3668323    .2683134
         35  |  -.0429892   .1435742    -0.30   0.765    -.3243894     .238411
         55  |   .3753688   .1788985     2.10   0.036     .0247341    .7260035
         65  |   1.240436     .14953     8.30   0.000     .9473627    1.533509
             |
      female |  -.5118235    .086252    -5.93   0.000    -.6808744   -.3427726
             |
      health |
          2  |   -.293325   .1110796    -2.64   0.008    -.5110371   -.0756128
          3  |  -.3068507   .1450168    -2.12   0.034    -.5910784    -.022623
          4  |   -.450338    .227803    -1.98   0.048    -.8968236   -.0038524
          5  |  -.0378434   .4747934    -0.08   0.936    -.9684214    .8927345
             |
      lowinc |  -.3094155   .1357964    -2.28   0.023    -.5755715   -.0432595
             |
        pol3 |
          1  |   .1433662   .1139235     1.26   0.208    -.0799197    .3666521
          3  |   .0290743   .1141306     0.25   0.799    -.1946176    .2527663
             |
       _cons |   6.101753   .1599751    38.14   0.000     5.788207    6.415298
 ------------------------------------------------------------------------------

 . 
 . /* What happened here:
 > 
 >  - Bootstrapping is a simulation technique that resamples the data as many times
 >    as you ask it (here we ran 100 replications) and then computes the standard
 >    error from the standard deviation of these simulations.
 > 
 >  - Resampling means that the data used in each simulation is randomly selected
 >    from the original dataset, with replacement: one value may appear many times.
 >    The result is 100 simulations of the data with slightly different values.
 >  
 >  - Bootstrapping is particularly efficient at lower sample sizes, for which it
 >    provides more reliable standard errors than the 'square root of N' formula.
 >    It applies to parametric estimation commands like -su-, -reg-, etc. */
 . 
 . 
 . * (2) Clustered standard errors
 . * -----------------------------
 . 
 . * Remember that we saved the initial models as 'est1' (FR) and 'est2' (GB).
 . eststo dir

 -------------------------------------------------------
        name | command      depvar       npar  title 
 -------------+-----------------------------------------
        est1 | regress      hsat           17  FR
        est2 | regress      hsat           17  GB
        est3 | regress      esat           17  FR
        est4 | regress      esat           17  GB
        est5 | regress      gsat           17  FR
        est6 | regress      gsat           17  GB
 -------------------------------------------------------

 . 
 . * The next command stores the right-hand side of the regression equation, i.e.
 . * the list of predictors (IVs), into a convenient string of text handled by
 . * Stata as a local macro. This works almost like the global macro trick we saw
 . * before, and becomes useful when you have long lists of predictors.
 . local rhs "ib45.age6 female i.health lowinc ib2.pol3"

 . 
 . * IMPORTANT: storing the variable names into a local macro is technically more 
 . * appropriate than using a global one as we did in a earlier do-file. However, 
 . * this come with additional constraints: local macros are handled with `ticks'
 . * instead of the $dollar sign, and they have to be run in the same sequence as
 . * the regression commands to work properly, WITHOUT stopping execution. This
 . * means that your local macros will work only if you run the whole code block
 . * (the line below AND the -reg- commands), or the whole do-file.
 . 
 . * Store robust models.
 . eststo FRr: reg hsat `rhs' if cntry == "FR", vce(cluster regionfr)

 Linear regression                                      Number of obs =    1942
                                                       F(  7,     8) =       .
                                                       Prob > F      =       .
                                                       R-squared     =  0.0640
                                                       Root MSE      =  2.0347

                               (Std. Err. adjusted for 9 clusters in regionfr)
 ------------------------------------------------------------------------------
             |               Robust
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .5757433   .1394916     4.13   0.003     .2540751    .8974116
         25  |   .4181367   .2011238     2.08   0.071    -.0456555    .8819289
         35  |   .2737408   .1915253     1.43   0.191    -.1679173    .7153988
         55  |   .0178386   .2260695     0.08   0.939    -.5034787    .5391558
         65  |   .1721822   .1770583     0.97   0.359    -.2361149    .5804794
             |
      female |  -.3929954   .0839611    -4.68   0.002    -.5866099   -.1993808
             |
      health |
          2  |   -.367027   .1075007    -3.41   0.009    -.6149241   -.1191299
          3  |  -.4762367   .1366204    -3.49   0.008     -.791284   -.1611894
          4  |  -.5536348   .1513388    -3.66   0.006    -.9026228   -.2046468
          5  |  -1.020825   .4291435    -2.38   0.045    -2.010432   -.0312185
             |
      lowinc |  -.2263043    .136934    -1.65   0.137    -.5420745     .089466
             |
        pol3 |
          1  |  -.5218848   .1065375    -4.90   0.001    -.7675607   -.2762088
          3  |   .3431802   .1310755     2.62   0.031     .0409196    .6454407
             |
       _cons |   6.458039   .1159184    55.71   0.000     6.190731    6.725348
 ------------------------------------------------------------------------------

 . eststo GBr: reg hsat `rhs' if cntry == "GB", vce(cluster regiongb)

 Linear regression                                      Number of obs =    2079
                                                       F( 10,    11) =       .
                                                       Prob > F      =       .
                                                       R-squared     =  0.0770
                                                       Root MSE      =  2.1246

                              (Std. Err. adjusted for 12 clusters in regiongb)
 ------------------------------------------------------------------------------
             |               Robust
        hsat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        age6 |
         15  |   .1577784   .1483508     1.06   0.310    -.1687396    .4842963
         25  |  -.0492595   .2052812    -0.24   0.815    -.5010803    .4025613
         35  |  -.0429892   .1393327    -0.31   0.763    -.3496585    .2636801
         55  |   .3753688   .1441162     2.60   0.024     .0581713    .6925664
         65  |   1.240436   .1346074     9.22   0.000     .9441671    1.536705
             |
      female |  -.5118235   .0689489    -7.42   0.000     -.663579   -.3600681
             |
      health |
          2  |   -.293325    .138153    -2.12   0.057    -.5973977    .0107478
          3  |  -.3068507   .1851052    -1.66   0.126    -.7142644     .100563
          4  |   -.450338   .1975345    -2.28   0.044    -.8851085   -.0155674
          5  |  -.0378434   .5442734    -0.07   0.946    -1.235781    1.160094
             |
      lowinc |  -.3094155   .0971279    -3.19   0.009    -.5231926   -.0956384
             |
        pol3 |
          1  |   .1433662   .1649275     0.87   0.403    -.2196369    .5063693
          3  |   .0290743   .1177467     0.25   0.810    -.2300844    .2882331
             |
       _cons |   6.101753    .124926    48.84   0.000     5.826793    6.376713
 ------------------------------------------------------------------------------

 . 
 . * Compare both versions for a more realistic assessment of the standard errors.
 . esttab est1 FRr est2 GBr, nogaps b(2) se(2) sca(rmse) compress ///
 >         mti("FR" "FR robust" "GB" "GB robust")

 --------------------------------------------------------------
                 (1)          (2)          (3)          (4)   
                  FR    FR robust           GB    GB robust   
 --------------------------------------------------------------
 15.age6         0.58**       0.58**       0.16         0.16   
              (0.18)       (0.14)       (0.20)       (0.15)   
 25.age6         0.42**       0.42        -0.05        -0.05   
              (0.16)       (0.20)       (0.17)       (0.21)   
 35.age6         0.27         0.27        -0.04        -0.04   
              (0.16)       (0.19)       (0.15)       (0.14)   
 45b.age6        0.00         0.00         0.00         0.00   
                 (.)          (.)          (.)          (.)   
 55.age6         0.02         0.02         0.38*        0.38*  
              (0.16)       (0.23)       (0.16)       (0.14)   
 65.age6         0.17         0.17         1.24***      1.24***
              (0.15)       (0.18)       (0.15)       (0.13)   
 female         -0.39***     -0.39**      -0.51***     -0.51***
              (0.09)       (0.08)       (0.09)       (0.07)   
 1b.health       0.00         0.00         0.00         0.00   
                 (.)          (.)          (.)          (.)   
 2.health       -0.37**      -0.37**      -0.29**      -0.29   
              (0.13)       (0.11)       (0.11)       (0.14)   
 3.health       -0.48***     -0.48**      -0.31*       -0.31   
              (0.14)       (0.14)       (0.14)       (0.19)   
 4.health       -0.55*       -0.55**      -0.45*       -0.45*  
              (0.22)       (0.15)       (0.22)       (0.20)   
 5.health       -1.02*       -1.02*       -0.04        -0.04   
              (0.45)       (0.43)       (0.40)       (0.54)   
 lowinc         -0.23        -0.23        -0.31*       -0.31** 
              (0.13)       (0.14)       (0.13)       (0.10)   
 1.pol3         -0.52***     -0.52**       0.14         0.14   
              (0.12)       (0.11)       (0.11)       (0.16)   
 2b.pol3         0.00         0.00         0.00         0.00   
                 (.)          (.)          (.)          (.)   
 3.pol3          0.34**       0.34*        0.03         0.03   
              (0.12)       (0.13)       (0.11)       (0.12)   
 _cons           6.46***      6.46***      6.10***      6.10***
              (0.18)       (0.12)       (0.15)       (0.12)   
 --------------------------------------------------------------
 N               1942         1942         2079         2079   
 rmse            2.03         2.03         2.12         2.12   
 --------------------------------------------------------------
 Standard errors in parentheses
 * p<0.05, ** p<0.01, *** p<0.001

 . 
 . /* What happened here:
 > 
 >  - We clustered the data by geographical region in each regression, which means
 >    that the standard errors of the coefficients will increase if the variance of
 >    the data differs between regions, indicating some macro-level effect.
 >  
 >  - In this example, we assume that poorer and/or less populated regions will not
 >    benefit from the same health care facilities than others, which will create
 >    differences between predicted means of the DV clustered by region.
 > 
 >  - The results show that the clustered models lose some significant coefficients
 >    in comparison to the original ones, which should invite us to correct some of
 >    our initial interpretations, or consider more advanced modelling.
 >    
 >  - Robust (corrected) standard errors become crucial when the data form a panel,
 >    as with cross-sectional time-series (CSTS) data, because the observations are
 >    then country-years and variance will exist between and within them. */
 . 
 . 
 . * =======
 . * = END =
 . * =======
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done, and we have covered tons of stuff. Thanks for following!
 . * exit, clear
 . 
 end of do-file

 . 
 . * Check setup.
 . run setup/require estout fre scheme-burd spineplot

 . 
 . * Log results.
 . cap log using code/week12.log, replace

 . 
 . /* ------------------------------------------ SRQM Session 12 ------------------
 > 
 >    F. Briatte and I. Petev
 > 
 >  - TOPIC:  Sexual Partners in the United States
 > 
 >  - DATA:   U.S. General Social Survey (2010)
 >  
 >    What makes Americans likely to report high numbers of sexual partners in the
 >    last five years? What makes them more likely to report low numbers?
 >    
 >    For this session, all hypotheses are to be provided by the students.
 >    
 >    Last updated 2013-05-31.
 > 
 > ----------------------------------------------------------------------------- */
 . 
 . * Load GSS dataset for selected survey year.
 . use data/gss0012 if year == 2010, clear
 (U.S. General Social Survey 2000-2012)

 . 
 . * Inspect DV.
 . fre partnrs5

 partnrs5 -- how many sex partners r had in last 5 years
 -------------------------------------------------------------------------------
                                  |      Freq.    Percent      Valid       Cum.
 ----------------------------------+--------------------------------------------
 Valid   0  no partners            |        261      12.77      14.40      14.40
        1  1 partner              |        963      47.11      53.12      67.51
        2  2 partners             |        175       8.56       9.65      77.16
        3  3 partners             |        114       5.58       6.29      83.45
        4  4 partners             |         77       3.77       4.25      87.70
        5  5-10 partners          |        123       6.02       6.78      94.48
        6  11-20 partners         |         40       1.96       2.21      96.69
        7  21-100 partners        |         11       0.54       0.61      97.30
        8  more than 100 partners |          3       0.15       0.17      97.46
        9  1 or more, dk #        |         46       2.25       2.54     100.00
        Total                     |       1813      88.70     100.00           
 Missing .d                        |          2       0.10                      
        .i                        |        202       9.88                      
        .n                        |         27       1.32                      
        Total                     |        231      11.30                      
 Total                             |       2044     100.00                      
 -------------------------------------------------------------------------------

 . 
 . * Keep only valid observations, excluding oblivious respondents.
 . clonevar sxp = partnrs5 if partnrs5 < 9
 (277 missing values generated)

 . 
 . * Code missing values for deeper inspection.
 . gen missing = mi(sxp)

 . 
 . * Generate six age groups (15-24, 25-34, ..., 65+).
 . gen age6:age6 = irecode(age, 24, 34, 44, 54, 64, .)
 (3 missing values generated)

 . 
 . * Code the value as the lower bound of the age groups (the data buckets).
 . replace age6 = 10 * age6 + 15
 (2041 real changes made)

 . 
 . * Assign value labels.
 . la def age6 15 "15-24" 25 "25-34" 35 "35-44" ///
 >         45 "45-54" 55 "55-64" 65 "65+", replace

 . la var age6 "Age groups"

 . 
 . * Inspect missing values by age and sex.
 . gr bar (count) age, over(missing) asyvars stack over(age6) over(sex) ///
 >         name(missing_agesex, replace)

 . 
 . * Chi-squared test for age groups.
 . bys sex: tab age6 missing, col nof chi2

 ------------------------------------------------------------------------------------
 -> sex = male

           |        missing
 Age groups |         0          1 |     Total
 -----------+----------------------+----------
     15-24 |      9.27       9.73 |      9.33 
     25-34 |     18.40       7.08 |     16.97 
     35-44 |     17.76      17.70 |     17.75 
     45-54 |     19.69      24.78 |     20.34 
     55-64 |     18.28      16.81 |     18.09 
       65+ |     16.60      23.89 |     17.53 
 -----------+----------------------+----------
     Total |    100.00     100.00 |    100.00 

          Pearson chi2(5) =  11.8447   Pr = 0.037

 ------------------------------------------------------------------------------------
 -> sex = female

           |        missing
 Age groups |         0          1 |     Total
 -----------+----------------------+----------
     15-24 |      8.60       7.98 |      8.51 
     25-34 |     20.14      16.56 |     19.64 
     35-44 |     19.03      14.11 |     18.33 
     45-54 |     17.21      14.72 |     16.85 
     55-64 |     16.60      12.88 |     16.07 
       65+ |     18.42      33.74 |     20.59 
 -----------+----------------------+----------
     Total |    100.00     100.00 |    100.00 

          Pearson chi2(5) =  20.4871   Pr = 0.001


 . 
 . * Proportions test for sex groups.
 . prtest missing, by(sex)

 Two-sample test of proportions                  male: Number of obs =      891
                                              female: Number of obs =     1153
 ------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        male |   .1268238   .0111484                      .1049733    .1486743
      female |   .1422376   .0102867                      .1220761    .1623992
 -------------+----------------------------------------------------------------
        diff |  -.0154138   .0151691                     -.0451448    .0143171
             |  under Ho:   .0152674    -1.01   0.313
 ------------------------------------------------------------------------------
        diff = prop(male) - prop(female)                          z =  -1.0096
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.1563         Pr(|Z| < |z|) = 0.3127          Pr(Z > z) = 0.8437

 . 
 . * Comparison of average age between missing and nonmissing groups, by sex.
 . bys sex: ttest age, by(missing)

 ------------------------------------------------------------------------------------
 -> sex = male

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
       0 |     777    47.25097    .6029825    16.80797     46.0673    48.43464
       1 |     113    51.41593    1.717378    18.25598    48.01316    54.81869
 ---------+--------------------------------------------------------------------
 combined |     890    47.77978    .5713296     17.0444    46.65846    48.90109
 ---------+--------------------------------------------------------------------
    diff |           -4.164964    1.711306               -7.523641   -.8062872
 ------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -2.4338
 Ho: diff = 0                                     degrees of freedom =      888

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0076         Pr(|T| > |t|) = 0.0151          Pr(T > t) = 0.9924

 ------------------------------------------------------------------------------------
 -> sex = female

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
       0 |     988    47.26113    .5589403    17.56887    46.16429    48.35798
       1 |     163    53.26994    1.622315    20.71233    50.06633    56.47355
 ---------+--------------------------------------------------------------------
 combined |    1151    48.11208    .5352406    18.15878    47.06192    49.16223
 ---------+--------------------------------------------------------------------
    diff |           -6.008805    1.525558               -9.001996   -3.015614
 ------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -3.9388
 Ho: diff = 0                                     degrees of freedom =     1149

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0001          Pr(T > t) = 1.0000

 . 
 . * Inspect DV by age.
 . spineplot sxp age6, scheme(burd8) name(sp, replace)

 . 
 . * Inspect DV by age, sex and interviewer's sex.
 . gr bar sxp, over(sex) asyvars over(age6) by(intsex) ///
 >         name(dv_agesexint, replace)

 . 
 . * Inspect IVs.
 . fre sex age coninc educ marital wrkstat size, r(10)

 sex -- respondents sex
 --------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
 -----------------+--------------------------------------------
 Valid   1 male   |        891      43.59      43.59      43.59
        2 female |       1153      56.41      56.41     100.00
        Total    |       2044     100.00     100.00           
 --------------------------------------------------------------

 age -- age of respondent
 --------------------------------------------------------------------
                       |      Freq.    Percent      Valid       Cum.
 -----------------------+--------------------------------------------
 Valid   18             |         10       0.49       0.49       0.49
        19             |         24       1.17       1.18       1.67
        20             |         24       1.17       1.18       2.84
        21             |         35       1.71       1.71       4.56
        22             |         19       0.93       0.93       5.49
        :              |          :          :          :          :
        85             |          6       0.29       0.29      98.04
        86             |          7       0.34       0.34      98.38
        87             |          4       0.20       0.20      98.58
        88             |          9       0.44       0.44      99.02
        89 89 or older |         20       0.98       0.98     100.00
        Total          |       2041      99.85     100.00           
 Missing .n             |          3       0.15                      
 Total                  |       2044     100.00                      
 --------------------------------------------------------------------

 coninc -- family income in constant dollars
 ----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
 -------------------+--------------------------------------------
 Valid   401.5      |         43       2.10       2.38       2.38
        1606       |         24       1.17       1.33       3.71
        2810.5     |         17       0.83       0.94       4.65
        3613.5     |          8       0.39       0.44       5.10
        4416.5     |         19       0.93       1.05       6.15
        :          |          :          :          :          :
        66247.5    |        129       6.31       7.15      80.50
        80300      |        111       5.43       6.15      86.65
        96360      |         69       3.38       3.82      90.47
        112420     |         57       2.79       3.16      93.63
        152927.23  |        115       5.63       6.37     100.00
        Total      |       1805      88.31     100.00           
 Missing .i         |        239      11.69                      
 Total              |       2044     100.00                      
 ----------------------------------------------------------------

 educ -- highest year of school completed
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   0     |          5       0.24       0.25       0.25
        1     |          1       0.05       0.05       0.29
        2     |          5       0.24       0.25       0.54
        3     |          4       0.20       0.20       0.74
        4     |          9       0.44       0.44       1.18
        :     |          :          :          :          :
        16    |        334      16.34      16.38      86.41
        17    |         71       3.47       3.48      89.90
        18    |        101       4.94       4.95      94.85
        19    |         33       1.61       1.62      96.47
        20    |         72       3.52       3.53     100.00
        Total |       2039      99.76     100.00           
 Missing .d    |          1       0.05                      
        .n    |          4       0.20                      
        Total |          5       0.24                      
 Total         |       2044     100.00                      
 -----------------------------------------------------------

 marital -- marital status
 ----------------------------------------------------------------------
                         |      Freq.    Percent      Valid       Cum.
 -------------------------+--------------------------------------------
 Valid   1  married       |        891      43.59      43.61      43.61
        2  widowed       |        181       8.86       8.86      52.47
        3  divorced      |        341      16.68      16.69      69.16
        4  separated     |         65       3.18       3.18      72.34
        5  never married |        565      27.64      27.66     100.00
        Total            |       2043      99.95     100.00           
 Missing .n               |          1       0.05                      
 Total                    |       2044     100.00                      
 ----------------------------------------------------------------------

 wrkstat -- labor force status
 -------------------------------------------------------------------------
                            |      Freq.    Percent      Valid       Cum.
 ----------------------------+--------------------------------------------
 Valid   1  working fulltime |        917      44.86      44.93      44.93
        2  working parttime |        234      11.45      11.46      56.39
        3  temp not working |         33       1.61       1.62      58.01
        4  unempl, laid off |        145       7.09       7.10      65.12
        5  retired          |        319      15.61      15.63      80.74
        6  school           |         93       4.55       4.56      85.30
        7  keeping house    |        235      11.50      11.51      96.82
        8  other            |         65       3.18       3.18     100.00
        Total               |       2041      99.85     100.00           
 Missing .n                  |          3       0.15                      
 Total                       |       2044     100.00                      
 -------------------------------------------------------------------------

 size -- size of place in 1000s
 -----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
 --------------+--------------------------------------------
 Valid   0     |         52       2.54       2.54       2.54
        1     |         62       3.03       3.03       5.58
        2     |         71       3.47       3.47       9.05
        3     |         80       3.91       3.91      12.96
        4     |        137       6.70       6.70      19.67
        :     |          :          :          :          :
        1518  |         16       0.78       0.78      95.40
        1954  |          6       0.29       0.29      95.69
        2896  |         21       1.03       1.03      96.72
        3695  |         19       0.93       0.93      97.65
        8008  |         48       2.35       2.35     100.00
        Total |       2044     100.00     100.00           
 -----------------------------------------------------------

 . 
 . * Drop missing values.
 . drop if mi(sxp, age, coninc, educ, marital, wrkstat)
 (439 observations deleted)

 . 
 . * Drop ambiguous wrkstat category "Other".
 . drop if wrkstat == 8
 (45 observations deleted)

 . 
 . * Recode sex.
 . gen female = (sex == 1) if !mi(sex)

 . 
 . * Final sample size.
 . count
 1560

 . 
 . * Survey weights.
 . svyset vpsu [weight = wtssall], strata (vstrat)
 (sampling weights assumed)

      pweight: wtssall
          VCE: linearized
  Single unit: missing
     Strata 1: vstrat
         SU 1: vpsu
        FPC 1: <zero>

 . 
 . * Export summary stats.
 . stab using week12_stats.txt, replace ///
 >         mean(coninc educ size) ///
 >         prop(age6 marital wrkstat)
 (note: file week12_stats.txt not found)

 Variable                     mean           sd          min          max         mea
 > n           sd          min          max         mean           sd          min   
 >        max         mean           sd          min          max         mean       
 >     sd          min          max         mean           sd          min          m
 > ax         mean           sd          min          max         mean           sd  
 >         min          max

 Age groups                      %            %            %            %            
 > %            %            %            %

 marital status                  %            %            %            %            
 > %            %            %            %

 labor force status              %            %            %            %            
 > %            %            %            %

 N = 15600
 File: week12_stats.txt

 . 
 . 
 . * ===================
 . * = DV DISTRIBUTION =
 . * ===================
 . 
 . 
 . * Explore the DV.
 . fre sxp

 sxp -- how many sex partners r had in last 5 years
 ------------------------------------------------------------------------------
                                 |      Freq.    Percent      Valid       Cum.
 ---------------------------------+--------------------------------------------
 Valid   0 no partners            |        201      12.88      12.88      12.88
        1 1 partner              |        856      54.87      54.87      67.76
        2 2 partners             |        161      10.32      10.32      78.08
        3 3 partners             |        106       6.79       6.79      84.87
        4 4 partners             |         70       4.49       4.49      89.36
        5 5-10 partners          |        116       7.44       7.44      96.79
        6 11-20 partners         |         37       2.37       2.37      99.17
        7 21-100 partners        |         10       0.64       0.64      99.81
        8 more than 100 partners |          3       0.19       0.19     100.00
        Total                    |       1560     100.00     100.00           
 ------------------------------------------------------------------------------

 . 
 . * Histogram for normality assessment.
 . hist sxp, bin(10) percent addl norm ///
 >         name(dv_hist, replace)
 (bin=10, start=0, width=.8)

 .         
 . * Bivariate hypothesis test: mean DV by sex.
 . ttest sxp, by(female)

 Two-sample t test with equal variances
 ------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
       0 |     861    1.513357    .0483794    1.419587    1.418401    1.608312
       1 |     699    1.958512     .065634    1.735272    1.829648    2.087376
 ---------+--------------------------------------------------------------------
 combined |    1560    1.712821    .0401031    1.583944    1.634159    1.791482
 ---------+--------------------------------------------------------------------
    diff |           -.4451556    .0798758               -.6018309   -.2884803
 ------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -5.5731
 Ho: diff = 0                                     degrees of freedom =     1558

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

 . 
 . 
 . * =====================
 . * = REGRESSION MODELS =
 . * =====================
 . 
 . 
 . * A simple linear regression model test.
 . reg sxp i.female

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F(  1,  1558) =   31.06
       Model |  76.4503376     1  76.4503376           Prob > F      =  0.0000
    Residual |  3834.89325  1558  2.46142057           R-squared     =  0.0195
 -------------+------------------------------           Adj R-squared =  0.0189
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.5689

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4451556   .0798758     5.57   0.000     .2884803    .6018309
       _cons |   1.513357   .0534677    28.30   0.000      1.40848    1.618233
 ------------------------------------------------------------------------------

 . 
 . * Let's add some of our control variables one by one. Let's first control for
 . * income: is higher income associated with a higher number of partners?
 . reg sxp i.female coninc

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F(  2,  1557) =   29.90
       Model |  144.654674     2  72.3273369           Prob > F      =  0.0000
    Residual |  3766.68892  1557  2.41919648           R-squared     =  0.0370
 -------------+------------------------------           Adj R-squared =  0.0357
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.5554

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4884352   .0796061     6.14   0.000     .3322887    .6445816
      coninc |  -5.25e-06   9.89e-07    -5.31   0.000    -7.19e-06   -3.31e-06
       _cons |   1.743573   .0684809    25.46   0.000     1.609248    1.877898
 ------------------------------------------------------------------------------

 . 
 . * Let's transform income into a more meaningful scale: a dollar change in income
 . * is not enough to have a large effect. Let's measure income to 10,000s of USD.
 . gen inc = coninc / 10^4

 . 
 . * Regress again.
 . reg sxp i.female inc

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F(  2,  1557) =   29.90
       Model |  144.654674     2  72.3273371           Prob > F      =  0.0000
    Residual |  3766.68892  1557  2.41919648           R-squared     =  0.0370
 -------------+------------------------------           Adj R-squared =  0.0357
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.5554

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4884352   .0796061     6.14   0.000     .3322887    .6445816
         inc |  -.0525298   .0098932    -5.31   0.000    -.0719352   -.0331245
       _cons |   1.743573   .0684809    25.46   0.000     1.609248    1.877898
 ------------------------------------------------------------------------------

 . 
 . * Let's control for education as well.
 . reg sxp i.female inc educ

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F(  3,  1556) =   20.81
       Model |  150.858853     3  50.2862843           Prob > F      =  0.0000
    Residual |  3760.48474  1556  2.41676397           R-squared     =  0.0386
 -------------+------------------------------           Adj R-squared =  0.0367
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.5546

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4983867   .0798081     6.24   0.000     .3418439    .6549295
         inc |  -.0602989   .0110131    -5.48   0.000    -.0819011   -.0386968
        educ |   .0236624   .0147684     1.60   0.109    -.0053057    .0526304
       _cons |   1.449399    .195946     7.40   0.000     1.065053    1.833745
 ------------------------------------------------------------------------------

 . 
 . * Let's control for urban size.
 . reg sxp i.female inc educ size

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F(  4,  1555) =   17.99
       Model |    172.9803     4   43.245075           Prob > F      =  0.0000
    Residual |  3738.36329  1555  2.40409215           R-squared     =  0.0442
 -------------+------------------------------           Adj R-squared =  0.0418
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.5505

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4968538   .0796002     6.24   0.000     .3407187    .6529889
         inc |  -.0609605   .0109864    -5.55   0.000    -.0825102   -.0394109
        educ |   .0218699   .0147415     1.48   0.138    -.0070454    .0507851
        size |   .0001062    .000035     3.03   0.002     .0000375    .0001748
       _cons |   1.444007   .1954397     7.39   0.000     1.060654     1.82736
 ------------------------------------------------------------------------------

 . 
 . * How about working status?
 . reg sxp i.female inc educ size i.wrkstat

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 10,  1549) =   16.98
       Model |  386.460207    10  38.6460207           Prob > F      =  0.0000
    Residual |  3524.88338  1549  2.27558643           R-squared     =  0.0988
 -------------+------------------------------           Adj R-squared =  0.0930
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.5085

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |    .473281   .0797203     5.94   0.000     .3169099     .629652
         inc |  -.0637253   .0108897    -5.85   0.000    -.0850853   -.0423652
        educ |   .0178777    .014525     1.23   0.219     -.010613    .0463683
        size |    .000102   .0000341     2.99   0.003     .0000351     .000169
             |
     wrkstat |
          2  |  -.2652899    .124414    -2.13   0.033    -.5093275   -.0212522
          3  |  -.3174371   .2858415    -1.11   0.267    -.8781141      .24324
          4  |   .2735554   .1528662     1.79   0.074     -.026291    .5734019
          5  |  -.9783119   .1173846    -8.33   0.000    -1.208561   -.7480624
          6  |   .3471192   .1837131     1.89   0.059    -.0132334    .7074719
          7  |   -.373904   .1331162    -2.81   0.005    -.6350109   -.1127971
             |
       _cons |   1.699544   .2026013     8.39   0.000     1.302142    2.096945
 ------------------------------------------------------------------------------

 . fre wrkstat

 wrkstat -- labor force status
 ------------------------------------------------------------------------
                           |      Freq.    Percent      Valid       Cum.
 ---------------------------+--------------------------------------------
 Valid   1 working fulltime |        770      49.36      49.36      49.36
        2 working parttime |        186      11.92      11.92      61.28
        3 temp not working |         29       1.86       1.86      63.14
        4 unempl, laid off |        115       7.37       7.37      70.51
        5 retired          |        214      13.72      13.72      84.23
        6 school           |         76       4.87       4.87      89.10
        7 keeping house    |        170      10.90      10.90     100.00
        Total              |       1560     100.00     100.00           
 ------------------------------------------------------------------------

 . 
 . * Let's add a control for marital status.
 . reg sxp i.female inc educ size i.wrkstat i.marital

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 14,  1545) =   29.71
       Model |  829.670175    14  59.2621554           Prob > F      =  0.0000
    Residual |  3081.67341  1545  1.99461062           R-squared     =  0.2121
 -------------+------------------------------           Adj R-squared =  0.2050
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.4123

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4165171   .0754064     5.52   0.000     .2686074    .5644268
         inc |  -.0091845   .0110717    -0.83   0.407    -.0309016    .0125327
        educ |  -.0104019   .0137466    -0.76   0.449    -.0373658     .016562
        size |   .0000473   .0000323     1.46   0.143    -.0000161    .0001106
             |
     wrkstat |
          2  |  -.2570435   .1167104    -2.20   0.028     -.485971   -.0281161
          3  |  -.3358877    .268278    -1.25   0.211    -.8621152    .1903397
          4  |   .1626127   .1433551     1.13   0.257    -.1185785    .4438038
          5  |  -.6387619   .1158716    -5.51   0.000    -.8660441   -.4114796
          6  |  -.0355131   .1753451    -0.20   0.840    -.3794527    .3084266
          7  |  -.2376881   .1254588    -1.89   0.058    -.4837755    .0083994
             |
     marital |
          2  |  -.1436741   .1535087    -0.94   0.349    -.4447816    .1574333
          3  |   .6954988   .1079756     6.44   0.000     .4837047     .907293
          4  |   .5578938   .2157927     2.59   0.010     .1346163    .9811713
          5  |   1.346379   .0951923    14.14   0.000     1.159659    1.533098
             |
       _cons |    1.33734   .1949692     6.86   0.000     .9549077    1.719772
 ------------------------------------------------------------------------------

 . fre marital

 marital -- marital status
 ---------------------------------------------------------------------
                        |      Freq.    Percent      Valid       Cum.
 ------------------------+--------------------------------------------
 Valid   1 married       |        704      45.13      45.13      45.13
        2 widowed       |        113       7.24       7.24      52.37
        3 divorced      |        254      16.28      16.28      68.65
        4 separated     |         47       3.01       3.01      71.67
        5 never married |        442      28.33      28.33     100.00
        Total           |       1560     100.00     100.00           
 ---------------------------------------------------------------------

 . 
 . * Finally, let's control for age.
 . reg sxp i.female inc educ size i.wrkstat i.marital age

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 15,  1544) =   41.57
       Model |  1125.14908    15   75.009939           Prob > F      =  0.0000
    Residual |  2786.19451  1544  1.80453012           R-squared     =  0.2877
 -------------+------------------------------           Adj R-squared =  0.2807
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.3433

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4744982   .0718664     6.60   0.000     .3335321    .6154644
         inc |   .0000264   .0105555     0.00   0.998    -.0206783     .020731
        educ |  -.0030519   .0130878    -0.23   0.816    -.0287236    .0226198
        size |   .0000364   .0000307     1.18   0.236    -.0000239    .0000966
             |
     wrkstat |
          2  |  -.1573256   .1112833    -1.41   0.158    -.3756079    .0609567
          3  |  -.1842932   .2554498    -0.72   0.471    -.6853584     .316772
          4  |   .2808993   .1366664     2.06   0.040     .0128279    .5489708
          5  |   .1310702   .1255631     1.04   0.297     -.115222    .3773624
          6  |  -.3849779   .1690023    -2.28   0.023    -.7164761   -.0534797
          7  |  -.0950985   .1198503    -0.79   0.428    -.3301852    .1399881
             |
     marital |
          2  |   .4698645    .153682     3.06   0.002      .168417    .7713121
          3  |   .8262775   .1032092     8.01   0.000     .6238326    1.028722
          4  |   .5124794   .2052838     2.50   0.013     .1098149    .9151439
          5  |   .8806497   .0975843     9.02   0.000      .689238    1.072061
             |
         age |  -.0378645    .002959   -12.80   0.000    -.0436687   -.0320603
       _cons |   2.877276   .2210723    13.02   0.000     2.443643     3.31091
 ------------------------------------------------------------------------------

 . 
 .                  
 . * Reinterpretation of the constant
 . * --------------------------------
 . 
 . * Lastly, the constant reflects the value of y when the IVs are equal to the
 . * reference category for the categorical IVs (i.e., males, full-time employment,
 . * married) or 0 for the continuous IVs (income = 0, education = 0, age = 0, size =
 >  0).
 . * However, often for continuous variables, as in this case, the 0 category is
 . * unlikely (educ = 0 and income = 0) or unreal (age = 0 and size = 0). Therefore, 
 > the
 . * constant is not meaningful and interpretable. In such cases, it's best to
 . * recode your continuous IVs so that their mean is equal to 0, making the
 . * reference category for the constant the sample mean for each continuous IV.
 . * To do so, we simply nead to substract from each variable its mean.
 . 
 . su inc

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         inc |      1560    4.751758    4.002815     .04015   15.29272

 . gen zinc = inc - r(mean)

 . 
 . su size

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
        size |      1560    319.8955    1123.792          0       8008

 . gen zsize = size - r(mean)

 . 
 . su age

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
         age |      1560    46.68269    16.85957         18         89

 . gen zage = age - r(mean)

 . 
 . su educ

    Variable |       Obs        Mean    Std. Dev.       Min        Max
 -------------+--------------------------------------------------------
        educ |      1560    13.80385    2.970237          2         20

 . gen zeduc = educ - r(mean)

 . 
 . * Replicate the final regression model with transformed continuous variables.
 . reg sxp i.female zinc zeduc zsize i.wrkstat i.marital zage

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 15,  1544) =   41.57
       Model |  1125.14908    15  75.0099388           Prob > F      =  0.0000
    Residual |  2786.19451  1544  1.80453012           R-squared     =  0.2877
 -------------+------------------------------           Adj R-squared =  0.2807
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.3433

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4744982   .0718664     6.60   0.000     .3335321    .6154644
        zinc |   .0000264   .0105555     0.00   0.998    -.0206783     .020731
       zeduc |  -.0030519   .0130878    -0.23   0.816    -.0287236    .0226198
       zsize |   .0000364   .0000307     1.18   0.236    -.0000239    .0000966
             |
     wrkstat |
          2  |  -.1573256   .1112833    -1.41   0.158    -.3756079    .0609567
          3  |  -.1842932   .2554498    -0.72   0.471    -.6853584     .316772
          4  |   .2808993   .1366664     2.06   0.040     .0128279    .5489708
          5  |   .1310702   .1255631     1.04   0.297     -.115222    .3773624
          6  |  -.3849779   .1690023    -2.28   0.023    -.7164761   -.0534797
          7  |  -.0950985   .1198503    -0.79   0.428    -.3301852    .1399881
             |
     marital |
          2  |   .4698645    .153682     3.06   0.002      .168417    .7713121
          3  |   .8262775   .1032092     8.01   0.000     .6238326    1.028722
          4  |   .5124794   .2052838     2.50   0.013     .1098149    .9151439
          5  |   .8806497   .0975843     9.02   0.000      .689238    1.072061
             |
        zage |  -.0378645    .002959   -12.80   0.000    -.0436687   -.0320603
       _cons |   1.079297   .0741387    14.56   0.000     .9338733     1.22472
 ------------------------------------------------------------------------------

 . 
 . * The results do not change except for the constant. For this model, the constant
 . * stands for the average number of partners among respondents who are:
 . * - Male (female = 0)
 . * - With average income (zinc = 0)
 . * - With average education (...)
 . * - From a mid-sized town
 . * - Employed full-time
 . * - Married
 . * - Mid-age
 . 
 . 
 . * Standardized coefficients
 . * -------------------------
 . 
 . * Model with metric coefficients (in units of each variable).
 . reg sxp i.female zinc zeduc zsize i.wrkstat i.marital zage

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 15,  1544) =   41.57
       Model |  1125.14908    15  75.0099388           Prob > F      =  0.0000
    Residual |  2786.19451  1544  1.80453012           R-squared     =  0.2877
 -------------+------------------------------           Adj R-squared =  0.2807
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.3433

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4744982   .0718664     6.60   0.000     .3335321    .6154644
        zinc |   .0000264   .0105555     0.00   0.998    -.0206783     .020731
       zeduc |  -.0030519   .0130878    -0.23   0.816    -.0287236    .0226198
       zsize |   .0000364   .0000307     1.18   0.236    -.0000239    .0000966
             |
     wrkstat |
          2  |  -.1573256   .1112833    -1.41   0.158    -.3756079    .0609567
          3  |  -.1842932   .2554498    -0.72   0.471    -.6853584     .316772
          4  |   .2808993   .1366664     2.06   0.040     .0128279    .5489708
          5  |   .1310702   .1255631     1.04   0.297     -.115222    .3773624
          6  |  -.3849779   .1690023    -2.28   0.023    -.7164761   -.0534797
          7  |  -.0950985   .1198503    -0.79   0.428    -.3301852    .1399881
             |
     marital |
          2  |   .4698645    .153682     3.06   0.002      .168417    .7713121
          3  |   .8262775   .1032092     8.01   0.000     .6238326    1.028722
          4  |   .5124794   .2052838     2.50   0.013     .1098149    .9151439
          5  |   .8806497   .0975843     9.02   0.000      .689238    1.072061
             |
        zage |  -.0378645    .002959   -12.80   0.000    -.0436687   -.0320603
       _cons |   1.079297   .0741387    14.56   0.000     .9338733     1.22472
 ------------------------------------------------------------------------------

 . 
 . * Model with all coefficients expressed in standard deviation units.
 . reg sxp i.female zinc zeduc zsize i.wrkstat i.marital zage, b

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 15,  1544) =   41.57
       Model |  1125.14908    15  75.0099388           Prob > F      =  0.0000
    Residual |  2786.19451  1544  1.80453012           R-squared     =  0.2877
 -------------+------------------------------           Adj R-squared =  0.2807
       Total |  3911.34359  1559  2.50887979           Root MSE      =  1.3433

 ------------------------------------------------------------------------------
         sxp |      Coef.   Std. Err.      t    P>|t|                     Beta
 -------------+----------------------------------------------------------------
    1.female |   .4744982   .0718664     6.60   0.000                 .1490217
        zinc |   .0000264   .0105555     0.00   0.998                 .0000666
       zeduc |  -.0030519   .0130878    -0.23   0.816                 -.005723
       zsize |   .0000364   .0000307     1.18   0.236                 .0258157
             |
     wrkstat |
          2  |  -.1573256   .1112833    -1.41   0.158                -.0321976
          3  |  -.1842932   .2554498    -0.72   0.471                -.0157207
          4  |   .2808993   .1366664     2.06   0.040                 .0463562
          5  |   .1310702   .1255631     1.04   0.297                 .0284779
          6  |  -.3849779   .1690023    -2.28   0.023                -.0523401
          7  |  -.0950985   .1198503    -0.79   0.428                -.0187146
             |
     marital |
          2  |   .4698645    .153682     3.06   0.002                 .0769167
          3  |   .8262775   .1032092     8.01   0.000                 .1926589
          4  |   .5124794   .2052838     2.50   0.013                 .0553248
          5  |   .8806497   .0975843     9.02   0.000                 .2506167
             |
        zage |  -.0378645    .002959   -12.80   0.000                -.4030311
       _cons |   1.079297   .0741387    14.56   0.000                        .
 ------------------------------------------------------------------------------

 . 
 . 
 . * Residuals
 . * ---------
 . 
 . * Get residuals.
 . predict r, resid

 . 
 . * Distribution of the residuals.
 . kdensity r, norm

 . 
 . * Residuals-versus-fitted values plot.
 . rvfplot

 . 
 . 
 . * Extensions
 . * ----------
 . 
 . recode partnrs5 (0 = 0) (1 = 1) (2 = 2) (3 = 3) (4 = 4) ///
 >                                 (5 = 8) (6 = 15) (7 = 60) (8 = 120) (else = .), ge
 > n(sxp_count)
 (166 differences between partnrs5 and sxp_count)

 . 
 . * Multiple linear regression.
 . eststo LIN: reg sxp_count i.female inc educ size i.wrkstat i.marital age

      Source |       SS       df       MS              Number of obs =    1560
 -------------+------------------------------           F( 15,  1544) =    8.80
       Model |  6862.01082    15  457.467388           Prob > F      =  0.0000
    Residual |  80250.7578  1544  51.9758794           R-squared     =  0.0788
 -------------+------------------------------           Adj R-squared =  0.0698
       Total |  87112.7686  1559  55.8773371           Root MSE      =  7.2094

 ------------------------------------------------------------------------------
   sxp_count |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   1.244053   .3856958     3.23   0.001     .4875102    2.000596
         inc |  -.0379668   .0566498    -0.67   0.503    -.1490854    .0731519
        educ |  -.1309148   .0702401    -1.86   0.063    -.2686908    .0068612
        size |   .0000231   .0001649     0.14   0.889    -.0003003    .0003465
             |
     wrkstat |
          2  |  -.9158042   .5972397    -1.53   0.125    -2.087291    .2556825
          3  |  -1.379695   1.370959    -1.01   0.314    -4.068833    1.309443
          4  |   .6738155   .7334673     0.92   0.358    -.7648818    2.112513
          5  |   .1258675   .6738774     0.19   0.852    -1.195944    1.447679
          6  |  -1.345912   .9070085    -1.48   0.138    -3.125011    .4331864
          7  |  -1.048245   .6432179    -1.63   0.103    -2.309918    .2134281
             |
     marital |
          2  |   1.339347   .8247873     1.62   0.105     -.278475    2.957168
          3  |   .9072996   .5539073     1.64   0.102    -.1791905     1.99379
          4  |    2.18047   1.101726     1.98   0.048     .0194335    4.341507
          5  |   1.998603   .5237195     3.82   0.000      .971326    3.025879
             |
         age |  -.0861799   .0158807    -5.43   0.000    -.1173299   -.0550299
       _cons |   7.521305    1.18646     6.34   0.000     5.194062    9.848548
 ------------------------------------------------------------------------------

 . 
 . * Negative binomial regression (for count data).
 . eststo NBR: nbreg sxp_count i.female inc educ size i.wrkstat i.marital age

 Fitting Poisson model:

 Iteration 0:   log likelihood = -4765.8027  
 Iteration 1:   log likelihood = -4765.5101  
 Iteration 2:   log likelihood =   -4765.51  

 Fitting constant-only model:

 Iteration 0:   log likelihood = -3370.3246  
 Iteration 1:   log likelihood = -3360.0198  
 Iteration 2:   log likelihood = -3360.0186  
 Iteration 3:   log likelihood = -3360.0186  

 Fitting full model:

 Iteration 0:   log likelihood = -3119.4143  
 Iteration 1:   log likelihood = -3092.2597  
 Iteration 2:   log likelihood = -3010.6521  
 Iteration 3:   log likelihood = -3009.4699  
 Iteration 4:   log likelihood = -3009.4695  
 Iteration 5:   log likelihood = -3009.4695  

 Negative binomial regression                      Number of obs   =       1560
                                                  LR chi2(15)     =     701.10
 Dispersion     = mean                             Prob > chi2     =     0.0000
 Log likelihood = -3009.4695                       Pseudo R2       =     0.1043

 ------------------------------------------------------------------------------
   sxp_count |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
    1.female |   .4677508   .0593528     7.88   0.000     .3514216    .5840801
         inc |  -.0098874     .00899    -1.10   0.271    -.0275075    .0077327
        educ |  -.0382383   .0114983    -3.33   0.001    -.0607746    -.015702
        size |   .0000109    .000025     0.44   0.661     -.000038    .0000598
             |
     wrkstat |
          2  |  -.2188368   .0960093    -2.28   0.023    -.4070116   -.0306621
          3  |  -.3174595   .2301175    -1.38   0.168    -.7684815    .1335625
          4  |   .2093893   .1045834     2.00   0.045     .0044096    .4143691
          5  |   .0796039   .1174751     0.68   0.498    -.1506432    .3098509
          6  |  -.3848193   .1280587    -3.01   0.003    -.6358097   -.1338289
          7  |     -.4379   .1063386    -4.12   0.000    -.6463198   -.2294802
             |
     marital |
          2  |   .2248589   .1511692     1.49   0.137    -.0714272    .5211451
          3  |   .5367478   .0863664     6.21   0.000     .3674727    .7060229
          4  |   .6616965   .1558389     4.25   0.000     .3562578    .9671352
          5  |    .574297   .0771317     7.45   0.000     .4231217    .7254723
             |
         age |  -.0375803   .0026774   -14.04   0.000    -.0428278   -.0323327
       _cons |   2.572448   .1841411    13.97   0.000     2.211538    2.933358
 -------------+----------------------------------------------------------------
    /lnalpha |  -.3374227    .050378                     -.4361617   -.2386838
 -------------+----------------------------------------------------------------
       alpha |   .7136071   .0359501                      .6465132    .7876639
 ------------------------------------------------------------------------------
 Likelihood-ratio test of alpha=0:  chibar2(01) = 3512.08 Prob>=chibar2 = 0.000

 . 
 . * Compare models.
 . esttab LIN NBR, b(1) wide compress mti("Lin. reg." "Neg. bin.")

 --------------------------------------------------------
                 (1)                    (2)             
           Lin. reg.              Neg. bin.             
 --------------------------------------------------------
 main                                                    
 0b.female        0.0          (.)       0.0          (.)
 1.female         1.2**     (3.23)       0.5***    (7.88)
 inc             -0.0      (-0.67)      -0.0      (-1.10)
 educ            -0.1      (-1.86)      -0.0***   (-3.33)
 size             0.0       (0.14)       0.0       (0.44)
 1b.wrkstat       0.0          (.)       0.0          (.)
 2.wrkstat       -0.9      (-1.53)      -0.2*     (-2.28)
 3.wrkstat       -1.4      (-1.01)      -0.3      (-1.38)
 4.wrkstat        0.7       (0.92)       0.2*      (2.00)
 5.wrkstat        0.1       (0.19)       0.1       (0.68)
 6.wrkstat       -1.3      (-1.48)      -0.4**    (-3.01)
 7.wrkstat       -1.0      (-1.63)      -0.4***   (-4.12)
 1b.marital       0.0          (.)       0.0          (.)
 2.marital        1.3       (1.62)       0.2       (1.49)
 3.marital        0.9       (1.64)       0.5***    (6.21)
 4.marital        2.2*      (1.98)       0.7***    (4.25)
 5.marital        2.0***    (3.82)       0.6***    (7.45)
 age             -0.1***   (-5.43)      -0.0***  (-14.04)
 _cons            7.5***    (6.34)       2.6***   (13.97)
 --------------------------------------------------------
 lnalpha                                                 
 _cons                                  -0.3***   (-6.70)
 --------------------------------------------------------
 N               1560                   1560             
 --------------------------------------------------------
 t statistics in parentheses
 * p<0.05, ** p<0.01, *** p<0.001

 . 
 . * Export in wide format.
 . esttab LIN NBR using week12_regressions.txt, ///
 >         b(1) wide compress mti("Lin. reg." "Neg. bin.")
 (output written to week12_regressions.txt)

 . 
 . 
 . * ========
 . * = EXIT =
 . * ========
 . 
 . 
 . * Close log (if opened).
 . cap log close

 . 
 . * We are done. Just quit the application, have a nice week, and see you soon :)
 . * exit, clear
 . 
 end of do-file