mepsrajput/Data_Set_Operations.md

Last active November 14, 2021 02:59

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mepsrajput/f77bcb0471e7b37e5e50bdf339ea4bf6.js"></script>
Save mepsrajput/f77bcb0471e7b37e5e50bdf339ea4bf6 to your computer and use it in GitHub Desktop.

Download ZIP

My SAS Notes

Raw

Data_Set_Operations.md

1. Read Raw Data

1.1 Reading ASCII(Text) Data Set

DATA TEMP; 
   INFILE '/folders/myfolders/World Happiness/practice text dataset.txt' firstobs= 2; 
   INPUT @1 ID @5 Name $ 5-17 Location $;
RUN;
PROC PRINT DATA = TEMP;
RUN;

1.2 Reading Delimited Data

DATA TEMP; 
   INFILE '/folders/myfolders/World Happiness/2019.csv'  dlm="," firstobs=2; 
   INPUT Rank Region $ Score GDP_per_capita SS HL_Expectancy Freedom_To_Make_Life_Choices 
   		 Geneorisity Perception_Of_curroption;
RUN;
PROC PRINT DATA = TEMP;
RUN;

2. Write Data Set

2.1 PROC EXPORT

PROC EXPORT DATA = sashelp.cars OUTFILE = '/folders/myfolders/sasuser.v94/car_data.txt' DBMS = dlm;
   DELIMITER = ';';
RUN;

2.2 Writing a CSV file ;

PROC EXPORT DATA = sashelp.cars OUTFILE = '/folders/myfolders/sasuser.v94/car_data_csv.csv' DBMS = csv; RUN;

2.3 Writing a tab delimited file ;

PROC EXPORT DATA = sashelp.cars OUTFILE = '/folders/myfolders/sasuser.v94/car_tab.txt' DBMS = tab;
RUN;

3. Concatenate Data Sets

DATA ITDEPT; 
	INPUT empid name $ salary; 
	DATALINES; 
	1 Rick 623.3 
	3 Mike 611.5 
	6 Tusar 578.6 
	; 
RUN; 
DATA NON_ITDEPT; 
	INPUT empid name $ salary; 
	DATALINES; 
	2 Dan 515.2 
	4 Ryan 729.1 
	5 Gary 843.25 
	7 Pranab 632.8 
	8 Rasmi 722.5 
	;
RUN; 
DATA All_Dept; 
   SET ITDEPT NON_ITDEPT; 
RUN; 
PROC PRINT DATA = ITDEPT;
RUN;  
PROC PRINT DATA = NON_ITDEPT; * Merged ITDEPT and NON_ITDEPT vertically ;
RUN; 
PROC PRINT DATA = All_Dept; * Print Vertically Merged ITDEPT and NON_ITDEPT ;
RUN;

3.1 Scenarios

3.1.1 Different number of variables

DATA ITDEPT; 
	INPUT empid name $ salary DOJ date9.; 
	DATALINES; 
	1 Rick 623.3 02APR2001
	3 Mike 611.5 21OCT2000
	6 Tusar 578.6 01MAR2009  
	; 
RUN; 
DATA NON_ITDEPT; 
	INPUT empid name $ salary  ; 
	DATALINES; 
	2 Dan 515.2 
	4 Ryan 729.1 
	5 Gary 843.25 
	7 Pranab 632.8 
	8 Rasmi 722.5 
	;
RUN; 
DATA All_Dept; 
   SET ITDEPT NON_ITDEPT; 
RUN; 
PROC PRINT DATA = All_Dept; 
RUN;

3.1.2 Different variable name

DATA ITDEPT; 
	INPUT empid ename $ salary; 
	DATALINES; 
	1 Rick 623.3 
	3 Mike 611.5 
	6 Tusar 578.6 
	; 
RUN; 
DATA NON_ITDEPT; 
	INPUT empid empname $ salary; 
	DATALINES; 2 Dan 515.2 
	4 Ryan 729.1 
	5 Gary 843.25 
	7 Pranab 632.8 
	8 Rasmi 722.5 
	;
RUN; 
DATA All_Dept; 
   SET ITDEPT(RENAME =(ename = empname)) NON_ITDEPT; * renaming ename to empname & then merging both datset ;
RUN; 
PROC PRINT DATA = All_Dept; 
RUN;

3.1.3 Different variable lengths

DATA SALARY; 
	INPUT empid name $ salary; 
	DATALINES; 
	1 Rick 623.3		 
	2 Dan 515.2 		
	3 Mike 611.5 
	;
RUN; 
DATA DEPT; 
	INPUT empid dEPT $; 
	DATALINES; 
	1 IT 
	2 OPS
	3 IT
	;
RUN; 
DATA All_details;
	MERGE SALARY DEPT;
	BY empid;
RUN;
PROC PRINT DATA = All_details; 
RUN;

4. Merge Data Sets

4.1 Data Merging

DATA SALARY; 
	INPUT empid name $ salary; 
	DATALINES; 
	1 Rick 623.3		 
	2 Dan 515.2 		
	3 Jack 416.4
	;
RUN; 
DATA DEPT; 
	INPUT empid dEPT $ ; 
	DATALINES; 
	1 IT 
	2 OPS
	3 IT
	;
RUN; 
DATA All_details;
	MERGE SALARY DEPT;
	BY empid;
RUN;
PROC PRINT DATA = All_details; 
RUN;

4.1.1 Missing Values in the Matching Column ;

There may be cases when some values of the common variable will not match between the data sets. In such cases the data sets still get merged but give missing values in the result.

4.2.2 Merging only the Matches ;

To avoid the missing values in the result we can consider keeping only the observations with matched values for the common variable.

DATA SAL1; 
	INPUT empid name $ salary; 
	DATALINES; 
	1 Rick 623.3		 
	2 Dan 515.2 		
	4 Jack 416.4
	;
RUN; 
DATA DEPT1; 
	INPUT empid dEPT $ ; 
	DATALINES; 
	1 IT 
	2 OPS
	3 IT
	;
RUN;
DATA All_det;
	MERGE SAL1(IN = a) DEPT1(IN = b);
	BY empid;
	IF a = 1 and b = 1;
RUN;
PROC PRINT DATA = All_details; 
RUN;

5. Subsetting Data Sets

5.1 Subsetting Variables

DATA Employee; 
	INPUT empid ename $ salary DEPT $ ; 
	DATALINES; 
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS	
	3 Mike 611.5 IT
	;
RUN;
DATA OnlyDept;
	SET Employee;
	KEEP ename DEPT;
RUN;
PROC PRINT DATA = OnlyDept; 
RUN;

The same result can be obtained by dropping the variables that are not required.

DATA OnlyDept2;
   SET Employee;
   DROP empid salary;
RUN;
PROC PRINT DATA = OnlyDept2; 
RUN;

5.2 Subsetting Observations: to extract only few observations from the entire data set. ;

DATA Employee; 
	INPUT empid name $ salary DEPT $ ; 
	DATALINES; 
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS	
	3 Mike 611.5 IT 	
	4 Ryan 729.1 HR 
	5 Gary 843.25 FIN 
	;
RUN;
DATA OnlyDept;
	SET Employee;
	IF salary < 700 THEN DELETE;
RUN;
PROC PRINT DATA = OnlyDept; 
RUN;

6. Sort Data Sets ;

Syntax

PROC SORT DATA = original dataset OUT = Sorted dataset;
	BY variable name;

Variable name - column name on which the sorting happens. Original dataset - the dataset name to be sorted. Sorted dataset - the dataset name after it is sorted.

6.1 Ascending

DATA Employee; 
	INPUT empid name $ salary DEPT $; 
	DATALINES; 
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS	
	3 Mike 611.5 IT
	;
RUN;
PROC SORT DATA = Employee OUT = Sorted_sal;
	BY salary;
RUN;
PROC PRINT DATA = Sorted_sal;
RUN;

6.2 Reverse Sorting

DATA Employee; 
	INPUT empid name $ salary DEPT $ ; 
	DATALINES; 
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS	
	3 Mike 611.5 IT
	;
RUN;
PROC SORT DATA = Employee OUT = Sorted_sal_reverse ;
	BY DESCENDING salary;
RUN;
PROC PRINT DATA = Sorted_sal_reverse;
RUN;

6.3 Sorting Multiple Variables

DATA Employee; 
	INPUT empid name $ salary DEPT $ ; 
	DATALINES;
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS	
	3 Mike 611.5 IT 	
	4 Ryan 729.1 HR 
	5 Gary 515.2 FIN
	6 Rina 515.2 ABC
	;
RUN;
PROC SORT DATA = Employee OUT = Sorted_dept_sal ;
	BY salary DEPT;
RUN;
PROC PRINT DATA = Sorted_dept_sal;
RUN;

7. Format Data Sets

DATA Employee; 
	INPUT empid name $ salary DEPT $; 
	FORMAT name $upcase9.;
	DATALINES; 
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS	
	3 Mike 611.5 IT
	;
RUN;
PROC PRINT DATA = Employee; 
RUN;

7.1 Using PROC FORMAT ;

DATA Employee; 
	INPUT empid name $ salary DEPT $ ; 
	DATALINES; 
	1 Rick 623.3 IT 		 
	2 Dan 515.2 OPS
	3 Mike 611.5 IT 	
	4 Ryan 729.1 HR 
	5 Gary 843.25 FIN 
	;
PROC FORMAT;
	VALUE $DEP 'IT' = 'Information Technology' 'OPS'= 'Operations' ;
RUN;
   PROC PRINT DATA = Employee; 
   FORMAT name $upcase9. DEPT $DEP.; 
RUN;

8. SQL

8.1 SQL Create Operation

DATA TEMP;
	INPUT ID $ NAME $ SALARY DEPARTMENT $ 10.;
	DATALINES;
	1 Rick 623.3 IT
	2 Dan 515.2 Operations
	3 Michelle 611 IT
	4 Ryan 729 HR
	5 Gary 843.25 Finance
	;
RUN;
PROC SQL;
	CREATE TABLE EMPLOYEES AS SELECT * FROM TEMP;
QUIT;
PROC PRINT data = EMPLOYEES;
RUN;

8.2 SQL Read Operation

PROC SQL;
	SELECT make,model,type FROM SASHELP.CARS WHERE Make = "Audi" and Type = 'Sports';
QUIT;

8.3 SQL UPDATE Operation

DATA TEMP;
	INPUT ID $ NAME $ SALARY DEPARTMENT $;
	DATALINES;
	1 Rick 623.3 IT
	2 Dan 515.2 Operations
	3 Michelle 611 IT
	4 Ryan 729 HR
	5 Gary 843.25 Finance
	;
RUN;
PROC SQL;
	CREATE TABLE EMPLOYEES2 AS
	SELECT ID as EMPID,
	Name as EMPNAME, SALARY as SALARY, DEPARTMENT as DEPT, SALARY*0.23 as COMMISION
	FROM TEMP;
QUIT;
PROC SQL;
	UPDATE EMPLOYEES2 SET SALARY = SALARY*1.25;
QUIT;
PROC PRINT DATA = EMPLOYEES2;
RUN;

8.3 SQL DELETE Operation

PROC SQL;
	DELETE FROM EMPLOYEES2 WHERE SALARY > 900;
QUIT;
PROC PRINT DATA = EMPLOYEES2;
RUN;

9. ODS (Output Delivery System): To convert a SAS program to more user friendly forms like html or PDF ;

Syntax:

ODS outputtype
PATH path name
FILE = Filename and Path
STYLE = StyleName
;
PROC some proc
;
ODS outputtype CLOSE;

PATH: the statement used in case of HTML output, in other types of output we include the path in the filename.
Style: one of the in-built styles available in the SAS environment.

9.1 Creating HTML Output

ODS HTML 
	PATH = '/folders/myfolders/sasuser.v94/'
	FILE = 'CARS2.html'
	STYLE = EGDefault;
PROC SQL;
	SELECT make, model, invoice 
	FROM sashelp.cars
	WHERE make in ('Audi','BMW')
	AND type = 'Sports'
	;
QUIT;
PROC SQL;
	SELECT make, mean(horsepower) AS meanhp
	FROM sashelp.cars
	WHERE make IN ('Audi','BMW')
	GROUP BY make;
QUIT;
ODS HTML CLOSE;

9.2 Creating PDF Output

ODS PDF FILE = '/folders/myfolders/sasuser.v94/CARS2.pdf' STYLE = EGDefault;
PROC SQL;
	SELECT make, model, invoice 
	FROM sashelp.cars
	WHERE make IN ('Audi','BMW')
	AND type = 'Sports';
QUIT;
PROC SQL;
	SELECT make, mean(horsepower) AS meanhp
	FROM sashelp.cars
	WHERE make IN ('Audi','BMW')
	GROUP BY make;
QUIT;
ODS PDF CLOSE;

9.3 Creating TRF(Word) Output

ODS RTF 
FILE = '/folders/myfolders/sasuser.v94/CARS.rtf'
STYLE = EGDefault;
PROC SQL;
	SELECT make, model, invoice 
	FROM sashelp.cars
	WHERE make IN ('Audi','BMW')
	AND type = 'Sports'
	;
QUIT;
PROC SQL;
	SELECT make, mean(horsepower) AS meanhp
	FROM sashelp.cars
	WHERE make IN ('Audi','BMW')
	GROUP BY make;
QUIT;
ODS RTF CLOSE;

Raw

SAS_basics.md

1. Variables

DATA TEMP;
	INPUT ID NAME $ SALARY DEPT $ DOJ DATE9. ;
	FORMAT DOJ DATE9. ;
	DATALINES;
	1 Rick 623.3 IT 02APR2001
	2 Dan 515.2 OPS 11JUL2012
	3 Michelle 611 IT 21OCT2000
	;
RUN;
PROC PRINT DATA = TEMP;
RUN;

2. Strings

2.1 Declaring String Variables ;

data string_examples;
   /*String variables of length 6 and 5 */
   String1 = 'Hello 7';
   String2 = 'World';
   Joined_strings =  String1 ||String2 ;
run;
proc print data = string_examples noobs;
run;

2.2 String Functions

2.2.1 SUBSTRN: extracts a substring using the start and end positions.

data string_examples;
   String1 = 'Hello';
   sub_string1 = substrn(String1,2,4) ; * Extract from position 2 to 4 ;
   sub_string2 = substrn(String1,3) ; * Extract from position 3 onwards ;
run;
proc print data = string_examples noobs;
run;

2.2.2 TRIMN: removes the trailing space form a string.

data string_examples;
   LENGTH string1 $ 7  ;
   String1='Hello  ';
   length_string1 = lengthc(String1); * Length of String1 (7) ;
   length_trimmed_string = lengthc(TRIMN(String1));  * Length of String1 (5 after removing spaces with TRIMN);
run;
proc print data = string_examples noobs;
run;

3. Arrays: used to store and retrieve a series of values using an index value.

Syntax

ARRAY ARRAY-NAME(SUBSCRIPT) ($) VARIABLE-LIST ARRAY-VALUES

ARRAY : SAS keyword to declare an array.
ARRAY-NAME : name of the array which follows the same rule as variable names.
SUBSCRIPT : number of values the array is going to store.
($) : optional parameter to be used only if the array is going to store character values.
VARIABLE-LIST : optional list of variables which are the place holders for array values.
ARRAY-VALUES : actual values that are stored in the array, can be declared here or can be read from a file or dataline.

3.1 Various Ways of Array Declaration ;

ARRAY AGE[5] (12 18 5 62 44); * Declare an array of length 5 named AGE with values. ;
ARRAY COUNTRIES(0:8) A B C D E F G H I; * An array of length 5 named COUNTRIES, values starting at index 0. ;
ARRAY QUESTS(1:5) $ Q1-Q5; * Declare an array of length 5 named QUESTS which contain character values. ;
ARRAY ANSWER(*) A1-A100; * Declare an array of required length as per the number of values supplied. ;

3.1.1 Accessing Array Values

DATA array_example;
	INPUT a1 $ a2 $ a3 $ a4 $ a5 $;
	ARRAY colours(5) $ a1-a5;
	mix = a1||'+'||a2;
	DATALINES;
	yellow pink orange green blue
	;
RUN;
PROC PRINT DATA = array_example;
RUN;

3.1.2 Using the OF operator ;

DATA array_example_OF;
	INPUT A1 A2 A3 A4;
	ARRAY A(4) A1-A4;
	A_SUM = SUM(OF A(*));
	A_MEAN = MEAN(OF A(*));
	A_MIN = MIN(OF A(*));
	DATALINES;
	21 4 52 11
	96 25 42 6
	;
RUN;
PROC PRINT DATA = array_example_OF;
RUN;

3.1.3 Using the IN operator

DATA array_in_example;
	INPUT A1 $ A2 $ A3 $ A4 $;
	ARRAY COLOURS(4) A1-A4;
	IF 'yellow' IN COLOURS THEN available = 'Yes';ELSE available = 'No';
	DATALINES;
	Orange pink violet yellow
	;
RUN;
PROC PRINT DATA = array_in_example;
RUN;

4. Numeric Formats

Syntax

Varname Formatnamew.d

Varname: is the name of the variable.
Formatname: name of the numeric format applied to the variable.
w: the maximum number of data columns (including digits after . & the . itself) allowed to be stored for the variable.
d: is the number of digits to the right of the decimal.

4.1 Reading Numeric formats

Format - Use

n. : Maximum "n" number of columns with no decimal point.
n.p : Maximum "n" number of columns with "p" decimal points.
COMMAn.p : Maximum "n" number of columns with "p" decimal places which removes any comma or dollar signs.
COMMAn.p : Maximum "n" number of columns with "p" decimal places which removes any comma or dollar signs.

4.2 Displaying Numeric formats ;

Format - Use n. : Write maximum "n" number of digits with no decimal point. n.p : Write maximum "n.p" number of columns with "p" decimal points. DOLLARn.p : Write maximum "n" number of columns with p decimal places, leading dollar sign and a comma at the thousandth place.

Note:

If the number of digits after the decimal point is less than the format specifier then 'zeros will be appended' at the end.
If the number of digits after the decimal point is greater than the format specifier then the last digit will be 'rounded off'.

DATA MYDATA1;
	input x 6.; /*maxiiuum width of the data*/
	format x 6.3;
	datalines;
	8722
	93.2
	.1122
	15.116
	;
PROC PRINT DATA = MYDATA1;
RUN;

DATA MYDATA2;
	input y 6.; /*maximum width of the data*/
	format y 5.2;
	datalines;
	8722
	93.2
	.1122
	15.116
	;
PROC PRINT DATA = MYDATA2;
RUN;

DATA MYDATA3;
	input z 6.; /*maximum width of the data*/
	format z DOLLAR10.2;
	datalines;
	8722
	93.2
	.1122
	15.116
	;
PROC PRINT DATA = MYDATA3;
RUN;

Output:

MYDATA1 Obs x 1 8722.0 # Display 6 columns with zero appended after decimal. 2 93.200 # Display 6 columns with zero appended after decimal. 3 0.112 # No integers before decimal, so display 3 available digits after decimal. 4 15.116 # Display 6 columns with 3 available digits after decimal.

MYDATA2 Obs x 1 8722 # Display 5 columns. Only 4 are available. 2 93.20 # Display 5 columns with zero appended after decimal. 3 0.11 # Display 5 columns with 2 places after decimal. 4 15.12 # Display 5 columns with 2 places after decimal.

MYDATA3 Obs x 1 $8,722.00 # Display 10 columns with leading $ sign, comma at thousandth place and zeros appended after decimal. 2 $93.20 # Only 2 integers available before decimal and one available after the decimal. 3 $0.11 # No integers available before decimal and two available after the decimal. 4 $15.12 # Only 2 integers available before decimal and two available after the decimal.

5. Operators

5.1 Arithmetic Operators: Addition (+), Subtraction (-), Multiplication (*), Division (/), Exponentiation (**)

DATA MYDATA1;
	input @1 COL1 4.2 @7 COL2 3.1; 
	Add_result = COL1+COL2;
	Sub_result = COL1-COL2;
	Mult_result = COL1*COL2;
	Div_result = COL1/COL2;
	Expo_result = COL1**COL2;
	datalines;
	11.21 5.3
	3.11  11
	;
PROC PRINT DATA = MYDATA1;
RUN;

5.2 Logical Operators: ( AND (&), OR (|), NOT (~) )

DATA MYDATA1;
	input @1 COL1 5.2 @7 COL2 4.1; 
	and_=(COL1 > 10 & COL2 > 5 );
	or_ = (COL1 > 12 | COL2 > 15 );
	not_ = ~( COL2 > 7 );
	datalines;
	11.21 5.3
	3.11  11.4
	;
PROC PRINT DATA = MYDATA1;
RUN;

5.3 Comparison Operators: ( EQUAL (=), NOT EQUAL (^=), LESS THAN (<), LESS THAN or EQUAL TO (<=), GREATER THAN (>), GREATER THAN or EQUAL TO (>=), IN ) ;

DATA MYDATA1;
	input @1 COL1 5.2 @7 COL2 4.1; 
	EQ_ = (COL1 = 11.21);
	NEQ_= (COL1 ^= 11.21);
	GT_ = (COL2 => 8);
	LT_ = (COL2 <= 12);
	IN_ = COL2 in( 6.2,5.3,12 );
	datalines;
	11.21 5.3
	3.11  11.4
	;
PROC PRINT DATA = MYDATA1;
RUN;

5.4 Minimum/Maximum Operators: ( AND (&), OR (|), NOT (~) ) ;

DATA MYDATA1;
	input @1 COL1 5.2 @7 COL2 4.1 @12 COL3 6.3; 
	min_ = MIN(COL1 , COL2 , COL3);
	max_ = MAX( COL1, COl2 , COL3);
	datalines;
	11.21 5.3 29.012
	3.11  11.4 18.512
	;
PROC PRINT DATA = MYDATA1;
RUN;

5.5 Concatenation Operator: ( AND (&), OR (|), NOT (~) ) ;

DATA MYDATA1;
	input COL1 $ COL2 $ COL3 $; 
	concat_ = (COL1 || COL2 || COL3);
	datalines;
	Tutorial s point
	simple easy learning
	;
PROC PRINT DATA = MYDATA1;
RUN;

6. Loops

6.1 DO Index - loop continues from the start value till the stop value of the index variable.

DATA MYDATA1;
	SUM = 0;
	DO VAR = 1 to 5;
	   SUM = SUM+VAR;
	END;
RUN;
PROC PRINT DATA = MYDATA1;
RUN;

6.2 DO WHILE - loop continues till the while condition becomes false.

DATA MYDATA;
	SUM = 0;
	VAR = 1;
	DO WHILE(VAR<6) ;
	   SUM = SUM+VAR;
	   VAR+1;
	END;
RUN;
PROC PRINT;
RUN;

6.2 DO UNTIL - loop continues till the UNTIL condition becomes True.

DATA MYDATA;
	SUM = 0;
	VAR = 1;
	DO UNTIL(VAR>5);
	   SUM = SUM+VAR;
	   VAR+1;
	END;
PROC PRINT;
RUN;

7. Decision Making

7.1 IF

Syntax: IF (condition );

7.2 IF THEN ELSE

Syntax:

IF (condition ) THEN result1;
ELSE result2;

7.3 IF THEN ELSE IF

Syntax:

IF (condition1) THEN result1;
ELSE IF (condition2) THEN result2;
ELSE IF (condition3) THEN result3;

7.4 IF THEN DELETE

Syntax: IF (condition) THEN DELETE;

8. Functions

Syntax FUNCTIONNAME(argument1, argument2...argumentn)

8.1 Mathematical Functions: used to apply some mathematical calculations on the variable values.

DATA Math_functions;
	v1=21; v2=42; v3=13; v4=10; v5=29;
	max_val = MAX(v1,v2,v3,v4,v5); /* Get Maximum value */
	min_val = MIN (v1,v2,v3,v4,v5); /* Get Minimum value */
	med_val = MEDIAN (v1,v2,v3,v4,v5); /* Get Median value */
	rand_val = RANUNI(0); /* Get a random number */
	SR_val= SQRT(sum(v1,v2,v3,v4,v5)); /* Get Square root of sum of the values */
PROC PRINT DATA = Math_functions NOOBS;
run;

8.2 Date and Time Functions: used to process date and time values.

data date_functions;
	INPUT @2 date1 DDMMYY10. @13 date2 DDMMYY10.;
	Years_ = INTCK('YEAR', date1, date2); /* Get the interval between the dates in years */
	months_ = INTCK('MONTH', date1, date2); /* Get the interval between the dates in months */
	weekday_ =  WEEKDAY(date1); /* Get the week day from the date */
	today_ = TODAY(); /* Get Today's date in SAS date format */
	time_ = TIME(); /* Get current time in SAS time format */
	format date1 DATE9. date2 DATE9.;
	DATALINES;
	21/10/2000 16/08/1998
	01/03/2009 11/07/2012
	;
	RUN;
PROC PRINT DATA = date_functions NOOBS;
RUN;

8.3 Character Functions: used to process character or text values.

DATA character_functions;
	lowcse_ = LOWCASE('HELLO'); /* Convert the string into lower case */
	upcase_ = UPCASE('hello'); /* Convert the string into upper case */
	reverse_ = REVERSE('Hello'); /* Reverse the string */
	nth_letter_ = SCAN('Learn SAS Now', 2); /* Return the nth word */
RUN;
PROC PRINT DATA = character_functions NOOBS;
RUN;

8.4 Truncation Functions: used to truncate numeric values.

DATA trunc_functions;
	ceil_ = CEIL(11.85); /* Nearest greatest integer */
	floor_ = FLOOR(11.85); /* Nearest greatest integer */
	int_ = INT(32.41); /* Integer portion of a number */
	round_ = ROUND(5621.78); /* Round off to nearest value */
RUN;
PROC PRINT DATA = trunc_functions NOOBS;
RUN;

8.5 Miscellaneous Functions ;

DATA misc_functions;
	state2=zipstate('01040'); /* Nearest greatest integer */ 
	payment = mort(50000, . , .10/12,30*12); /* Amortization calculation */
RUN;
PROC PRINT DATA = misc_functions NOOBS;
RUN;

9. Input Methods

9.1 List Input Method

DATA TEMP;
	INPUT EMPID ENAME $ DEPT $ ;
	DATALINES;
	1 Rick IT
	2 Dan OPS
	3 Tusar IT
	4 Pranab OPS
	5 Rasmi FIN
	;
PROC PRINT DATA = TEMP;
RUN;

9.2 Named Input Method

DATA TEMP;
	INPUT EMPID = ENAME = $ DEPT = $ ;
	DATALINES;
EMPID=101 ENAME=Rick DEPT=IT
EMPID=102 ENAME=Dan DEPT=OPS
EMPID=103 ENAME=Tusar DEPT=IT
EMPID=104 ENAME=Pranab DEPT=OPS
EMPID=105 ENAME=Rasmi DEPT=FIN
;
PROC PRINT DATA = TEMP;
RUN;

9.3 Column Input Method

DATA TEMP;
	INPUT EMPID 1-3 ENAME $ 4-12 DEPT $ 13-16;
	DATALINES;
14 Rick     IT 
241Dan      OPS 
30 Sanvi    IT 
410Chanchal OPS 
52 Piyu     FIN 
;
PROC PRINT DATA = TEMP;
RUN;

9.4 Formatted Input Method

DATA TEMP;
	INPUT @2 EMPID $ @5 ENAME $ @14 DEPT $ ;
	DATALINES;
	14 Rick     IT 
	241 Dan      OPS 
	30 Sanvi    IT 
	410 Chanchal OPS 
	52 Piyu     FIN 
	;
PROC PRINT DATA = TEMP;
RUN;

10. Macros: hold a value to be used again & again by a SAS program

10.1 Global Macro variable

PROC PRINT DATA = sashelp.cars;
	where make = 'Audi' and type = 'Sports' ;
	TITLE "Sales as of &SYSDAY &SYSDATE";
RUN;

10.2 Local Macro variable: used to supply different varaibels to the same SAS statements so that they can process different observations of a data set

%LET make_name = 'Audi';
%LET type_name = 'Sports';
proc print data = sashelp.cars;
	where make = &make_name and type = &type_name ;
	TITLE "Sales as of &SYSDAY &SYSDATE";
RUN;

10.3 Macro Programs: group of SAS statements that is referred by a name and to use it in program anywhere, using that name. It starts with a %MACRO statement and ends with %MEND statement.

%MACRO show_result(make_ , type_);
PROC PRINT DATA = sashelp.cars;
	where make = "&make_" and type = "&type_" ;
	TITLE "Sales as of &SYSDAY &SYSDATE";
RUN;
%MEND;

%show_result(BMW,SUV);

10.4 Commonly Used Macros

10.4.1 %PUT

DATA _null_;
	CALL SYMPUT ('today', TRIM(PUT("&sysdate"d,worddate22.)));
RUN;
%PUT &today;

10.4.2 %RETURN

%MACRO check_condition(val);
   %IF &val = 10 %THEN %RETURN;
   DATA p;
      x = 34.2;
   RUN;  
%MEND check_condition;  
%check_condition(11);

10.4.3 %END

%MACRO test(finish);
   %LET i = 1;
   %DO %WHILE (&i <&finish);
      %PUT the value of i is &i;
      %LET i=%EVAL(&i+1);
   %END;
%MEND test;
%test(5)

11. Date & Time

Input Date        | Date width | Informat
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
03/11/2014        | 10	       | mmddyy10.
03/11/14          | 8	       | mmddyy8.
December 11, 2012 | 20	       | worddate20.
14mar2011         | 9	       | date9.
14-mar-2011       | 11	       | date11.
14-mar-2011       | 15	       | anydtdte15.

11.1 Date Informat & Output Format

DATA TEMP;
	INPUT @2 Date1 date11. @13 Date2 anydtdte15. @24 Date3 mmddyy10. ;
	FORMAT Date1 DATE9. Date2 date11. Date3 DATE9..;
	DATALINES;
	02-mar-2012 3/02/2012 3/02/2012
	;
PROC PRINT DATA = TEMP;
RUN;

Raw

SAS_notes.md

SAS (Statistical Analysis System)

Data entry, retrieval, and management
Report writing and graphics
Statistical and mathematical analysis

FORMATTING

SAS is more relaxed than other coding languages when it comes to capitalization, indentation, and line breaks.

SAS is not case sensitive; uppercase and lowercase letters are recognized as the same, even for variable names.
Indentations or spacing before a statement are ignored.
Extra lines between statements are ignored.
Multiple statements on the same line are okay, provided they are separated by a semicolon.
A statement can span more than one line, as long as it ends with a semicolon.

Comments

* Single Line Comment;

/* Multi-line Comment*/

Note: If you want to comment out one or more lines in your program, highlight the text you want to turn in the comments, then press Ctrl + /. To uncomment those lines, highlight the commented lines and press Ctrl + Shift + /.

RULES FOR SAS STATEMENTS

SAS statements may begin in any column of the line.
SAS statements end with a semicolon (;).
Some SAS statements consist of more than one line of commands.
A SAS statement may continue over more than one line.3

Note: One or more blanks should be placed between items in SAS statements. If the items are special characters such as '=', '+', '$', the blanks are not necessary.

RULES FOR SAS NAMES

SAS variable/variable names must be between 1 and 32 characters long.
The first character must be a letter or an underscore.
Characters after the first may be letters, digits or underscores.

Data Step (Data Creation and Manipulation)

defining the variables
read input files
assign values to the variables,
creating new variables,
merging two data sets
formatting and labeling variables
assignment of missing values.

Syntax

data dataset_name;
 INFILE "filename.csv" DSD MISSOVER FIRSTOBS=2;
 INPUT variable_name variable_type column(s);
 SET <dataset name> <OBS=n>;
RUN;

INFILE AND INPUT STATEMENT

INFILE is used to read external files (such as mainframe file, text files, comma delimited files etc).

Infile options

DLM=

The dlm= option can be used to specify the delimiter that separates the variables in your raw data file.
For example, dlm=’,’ indicates a comma is the delimiter (e.g., a comma separated file, .csv file). Or, dlm=’09’x indicates that tabs are used to separate your variables (e.g., a tab separated file).

DSD

The dsd option has 2 functions.

First, it recognizes two consecutive delimiters as a missing value. For example, if your file contained the line 20,30,,50 SAS will treat this as 20 30 50 but with the dsd option SAS will treat it as 20 30 . 50 , which is probably what you intended.
Second, it allows you to include the delimiter within quoted strings. For example, you would want to use the dsd option if you had a comma separated file and your data included values like "George Bush, Jr.". With the dsd option, SAS will recognize that the comma in "George Bush, Jr." is part of the name, and not a separator indicating a new variable.

FIRSTOBS=

This option tells SAS what on what line you want it to start reading your raw data file. If the first record(s) contains header information such as variable names, then set firstobs=n where n is the record number where the data actually begin.
For example, if you are reading a comma separated file or a tab separated file that has the variable names on the first line, then use firstobs=2 to tell SAS to begin reading at the second line (so it will ignore the first line with the names of the variables).

MISSOVER

This option prevents SAS from going to a new input line if it does not find values for all of the variables in the current line of data.
For example, you may be reading a space delimited file and that is supposed to have 10 values per line, but one of the line had only 9 values. Without the missover option, SAS will look for the 10th value on the next line of data. If your data is supposed to only have one observation for each line of raw data, then this could cause errors throughout the rest of your data file. If you have a raw data file that has one record per line, this option is a prudent method of trying to keep such errors from cascading through the rest of your data file.

OBS=

Indicates which line in your raw data file should be treated as the last record to be read by SAS.
This is a good option to use for testing your program.
For example, you might use obs=100 to just read in the first 100 lines of data while you are testing your program. When you want to read the entire file, you can remove the obs= option entirely.

INPUT is used to define names and order of variables for the SAS dataset. SET reads an existing SAS dataset or concatenate SAS datasets

Proc Step (statistical analysis)

Every data step begins with the word DATA and every proc step begins with the word PROC. For the most part, all data and proc steps should end with a RUN statement. The RUN statement tells SAS to execute the preceding block.

IF/THEN; ELSE; STATEMENT

Used for conditional checking. Used in a data step.

IF LANG='Spanish' or LANG='French' THEN
NEWLANG='NotEngl';
ELSE
NEWLANG='English';

SUBSETTING "IF" STATEMENT

To subset, or take a portion of the data set

DATA FORGNER;
  IF LANG= ‘ENGLISH’;
  IF TAX < 20000 THEN DELETE;
RUN;

In the above example it will select all observations where LANG equals ‘ENGLISH’ and where TAX greater than or equal to 20000.

LIBNAME STATEMENT

Associates a libref with a SAS library. It’s kind of location pointer.
This is generally used to save the SAS dataset in a permanent location. When a libref is not associated with SAS dataset, SAS assumes it is created or read from SAS work area which is temporary in nature.

LIBNAME EMPLIB1 'C:\data\SAS\EMPDATA';

The path 'C:\data\SAS\EMPDATA' will be assigned to libref EMPLIB1. Say there is a SAS dataset EMPDAT in that library, it can be access as,

DATA EMPFILE1;
 SET EMPLIB1.EMPDAT;
RUN;

MERGE STATEMENT

Joins corresponding observations from two or more SAS data sets. Two types one to one, match merging.

DATA NEW;
 MERGE NAMES
 SURVY;
 BY NAME; * Sorting By Name
RUN;

Input data sets must be sorted by the same BY variables before you can merge them. Usually merge statement followed by sort procedure.

SAS BUILT IN PROCEDURES

PROC CONTENTS

Lists structure of a SAS dataset. It is very useful when you receive an existing SAS file from someone and you need to start working on it. Before you start you may want to see the structure of this file.

PROC CONTENTS DATA = <SAS dataset name> <options> ;
RUN;

PROC PRINT

Lists data as a table of observations.
Large datasets are very hard to visualize, so it is useful when you want to print couple of observations for some variables.

PROC PRINT DATA=FILE1;
  VAR NAME AGE GENDER;
RUN;

PROC DELETE

Delete one or more SAS dataset. It is good programming practice to delete unwanted SAS datasets to increase SAS work space.

PROC DELETE DATA = <Data set name(s)>;
RUN;

PROC SORT

To rearrange the observations in SAS datasets according to one or more variables.

PROC SORT DATA = ORIG OUT = SORTORIG;
  BY GENDER DESCENDING AGE;
RUN;

Sorts file first in ascending order of GENDER and within those categories, 1 (Male) and 2 (Female), the cases are further sorted in descending order of AGE.
PROC SORT is used most often for sorting a data set so that other SAS procedures can process that data set in subsets using BY statements. Data sets must also be sorted before they can be merged or updated.

PROC FREQ

The FREQ procedure produces one way to n-way frequency tables and creates distribution reports of the variable values. This procedure used for categorical data.

PROC MEANS

The MEANS procedure summarizes data. It computes descriptive statistics for variables within groups of observations or across all observations. This procedure generally used for continuous data.

VAR salary;
  BY department;
  OUTPUT OUT = SumEmpFile;
RUN;

IMPORTANT SAS FUNCTIONS

DATE(): produces the current date as SAS date, value representing the number of days between January 1, 1960 and the current date.

E.g., Today_Date = DATE();
Then a SAS data format can be used to read the date. Refer to PUT function for details.

DAY (DATE): It returns an Integer representing the day of the month from SAS DATE value.

E.g. Day1 = DAY(Today_Date);
The value of Day1 will be 25.

MONTH (DATE): It returns the numeric value representing the month from SAS DATE value.

E.g. Month1 = MONTH(Today_Date);
The value of Month1 will be 8.

YEAR (DATE): It returns the 4 digits numeric value representing the year from SAS DATE value.

E.g. Year1 = YEAR(Today_Date);
The value of Year1 will be 2010.

LENGTH(argument): Calculate length of a variable or a constant. Default length of numeric variable is 8.

E.g.

String2 = 'Rocky Hill';
Len1 = LENGTH(String2);

The value of Len1 will be 10 which is the length of the string 'Rocky Hill'.

SUM(variable-1, variable-2, ….,variable-n): Calculates sum of non-missing arguments.

E.g.

Num1 = 5;
Num2 = 2.5;
Sum1 = SUM(Num1,Num2);

The value of Sum1 will be 7.5

MIN(variable-1, variable-2, ….,variable-n): Returns the value of the lowest of its non-missing arguments.

E.g. Min1 = MIN(9,0,-2,45);
The value of Min1 will be -2

MAX(variable-1, variable-2, ….,variable-n): Returns the value of the lowest of its non-missing arguments.

E.g. Max1 = MAX(9,0,-2,45);
The value of Max1 will be 45

INT(argument): This function taken one numeric argument and returns integer portion of the argument.

E.g. Int1 = INT(332.502);
The value of Int1 will be 332

CEIL(argument): The CEIL function returns the smallest integer that is greater than or equal to the argument.

E.g.

Ceil1 = CEIL(9.231);
Ceil2 = CEIL(-9.231);

The value of Ceil1 will be 10 and Ceil2 will be -9.

FLOOR(argument): The FLOOR function takes one numeric argument and returns the largest integer that is less than or equal to the argument.

E.g.

Floor1=FLOOR(9.231);
Floor2=FLOOR(-9.231);

The value of Floor1 will be 9 and Floor2 will be -10.

PUT (source,format): Write the value of source with a specified format. PUT function is used to convert numeric value to a character value and also to format numeric and date variables into various formats.

E.g. Date_Formatted = PUT(Today_Date, DDMMYY10.);
Here DDMMYY10. is the DATE format. Please refer to SAS online documentation to get a list of various formats.
The value of Date_Formatted will be 25/08/2010.

UPCASE(argument): Converts all letters in an argument to uppercase.

E.g.

String2='Rocky Hill'; 
Upcase1=UPCASE(String2);

The value of Upcase1 will be 'ROCKY HILL'.

LOWCASE(argument): Converts all letters in an argument to uppercase.

E.g.

String2='Rocky Hill';
Lowcase1=LOWCASE(String2);

The value of Lowcase1 will be 'rocky hill'.

CATS(variable-1, variable-2, ….,variable-n) and ||: It removes leading and trailing blanks, and concatenates the arguments. || (two pipe signs) also can be use to concatenate multiple strings. Unlike CATS function || does not remove leading and trailing blanks.

E.g.

Var1="United ";
       Var2=" States";
       Con1=CATS(Var1,Var2,"of America");
       Con2='Var1 || Var2 || "of America"';

The value of Con1 will be ‘UnitedStatesof America’ where as value of Con2 will be ‘United Statesof America’.

TRANWRD(source, target, replacement): TRANWRD replaces or removes all occurrences of a word in a string.

E.g.

OldText='This is OLD';
OldWord='OLD';
NewWord='NEW';
NewText=TRANWRD(OldText,OldWord,NewWord);

The value of NewText will be 'This is NEW'.

SUBSTR (argument, position<, n>) : It returns the portion of the string specified in the argument from the position up to the number of characters specified by n.

E.g.

String1='THE HARTFORD';
SubString1=SUBSTR(String1,5,4);

The value of SubString1 will be 'HART'.

TRIM (argument): It copies a character argument removing any trailing elements.

E.g.

String3='SAS Institute ';
Trim1=trim(String3);

The value of Trim1 will be 'SAS Institute'.

Reading Data

ASCII(Text) Data Set

data TEMP; 
 infile 
 '/folders/myfolders/sasuser.v94/TutorialsPoint/emp_data.txt'; 
 input empID empName $ Salary Dept $ DOJ date9. ;
 format DOJ date9.;
 run;
PROC PRINT DATA = TEMP;
RUN;

Delimited Data

data TEMP; 
 infile '/folders/myfolders/sasuser.v94/TutorialsPoint/emp.csv' dlm=","; 
 input empID empName $ Salary Dept $ DOJ date9. ;
 format DOJ date9.;
 run;
PROC PRINT DATA = TEMP;
RUN;

Excel Data

FILENAME REFFILE "/folders/myfolders/TutorialsPoint/emp.xls" TERMSTR = CR;

PROC IMPORT DATAFILE = REFFILE DBMS = XLS OUT = WORK.IMPORT;
 GETNAMES = YES;
RUN;

PROC PRINT DATA = WORK.IMPORT 
RUN;

Author

mepsrajput commented Nov 14, 2021

Hi Pradeep Singh,

I'm currently using SAS Enterprise Guide, and it seems I'm not able to create this .md (markdown) kind of file. May I know how did you create this in SAS? It would be very helpful for my learning, thank you for your time on this.

Cheers, AJ

Hi @ambareesh-j The .md can't be created in SAS. This file is created in GitHub. For using the code, simply copy-paste in SAS, and save it as .SAS file (if needed).