4. Introduction to the SAS Environment
1. SAS Introduction
2. SAS Programs
3. SAS Data Sets and Data Libraries
4. Creating SAS Data Sets
5. What is SAS?
• SAS is a comprehensive statistical software system which
integrates utilities for storing, modifying, analyzing, and
graphing data.
• SAS runs on both Windows and UNIX platforms
• SAS is used in a wide range of industries such as
healthcare, education, financial services, life sciences,…
• Check out the webpage to learn more
• http://www.sas.com/
11. SAS User Interface
Log Window
Explorer
Window
Editor Window
Output Window (not shown)
Results
Window
(not shown)
Run button – click on this button to run
SAS code
Click here for SAS help
New Window button
Save button
Tool bar similar
to Windows applications
14. Libraries Folder
Contents of the Libraries
Folder
The Work Folder contains
data sets created in SAS
Contents of the Work Folder
These are the data sets that
have been created in SAS
through inputting data and
by creating data sets in SAS
programs
15. Log Window
The Log Window contains a record
of all commands submitted to
SAS and shows errors in the
commands.
16. Output Window
The Output Window contains output
based on SAS programs submitted in the
Editor Window.
17. Results Window
The Results Window shows a
listing of SAS programs
that have been submitted
in the order that they were
submitted.
Click on any procedure to
view all output parts of the
procedure and click on any
individual part to view the
actual output.
19. SAS Programs
• File extension - .sas
• Editor window has four uses:
– Access and edit existing SAS
programs
– Write new SAS programs
– Submitting SAS programs for
execution
– Saving SAS programs
• SAS program
– Sequence of steps that the user
submits for execution
• Submitting SAS programs
– Entire program
– Selection of the program
• 2 Basic steps in SAS programs:
– Data Steps
• Typically used to create SAS
datasets and manipulate
data,
• Begins with DATA statement
– Proc Steps
• Typically used to process
SAS data sets
• Begins with PROC statement
• The end of the data or proc steps
are indicated by:
– RUN statement – most steps
– QUIT statement – some steps
– Beginning of another step (DATA
or PROC statement)
20. • SAS Data Libraries
– Contain SAS data sets
– Identified by assigning a library
reference name – libref
– Temporary
• Work library
• SAS data files are deleted
when session ends
• Library reference name not
necessary
– Permanent
• SAS data sets are saved
after session ends
• SASUSER library
• You can create and access
your own libraries
SAS Data Sets and Data Libraries
22. 1. Data Set Information
2. Data Set Manipulation
3. Combining Data Sets
A. Concatenating/Appending
B. Merging
Working With SAS Data Sets
23. • Proc Contents
– Output contains a table of contents of the specified data set
– Data Set Information
• Data set name
• Number of observations
• Number of Variables
– Variable Information
• Type (numeric or character)
• Length
– Syntax:
PROC CONTENTS DATA=input_data_set;
RUN;
Data Set Information
24. • Create a new SAS data set using an existing SAS data set as input
– Specify name of the new SAS data set after the DATA statement
– Use SET statement to identify SAS data set being read
– Syntax:
DATA output_data_set;
SET input_data_set;
<additional SAS statements>;
RUN;
– By default the SET statement reads all observations and variables from the
input data set into the output data set.
Data Set Manipulation
25. • Assignment Statements
– Evaluate an expression
– Assign resulting value to a variable
– General Form: variable = expression;
– Example: miles_per_hour = distance/time;
• SAS Functions
– Perform arithmetic functions, compute simple statistics, manipulate
dates, etc.
– General Form: variable=function_name(argument1, argument2,…);
– Example: Time_worked = sum(Day1,Day2, Day3, Day4, Day5);
Data Set Manipulation
26. • Conditional Processing
– Uses IF-THEN-ELSE logic
– General Form: IF <expression1> THEN <statement>;
ELSE IF <expression2> THEN <statement>;
ELSE <statement>;
– <expression> is a true/false statement, such as:
• Day1=Day2, Day1 > Day2, Day1 < Day2
• Day1+Day2=10
• Sum(day1,day2)=10
• Day1=5 and Day2=5
Data Set Manipulation
27. • Conditional Processing
Symbolic Mnemonic Example
= EQ IF region=‘Spain’;
~= or ^= NE IF region ne ‘Spain’;
> GT IF rainfall > 20;
< LT IF rainfall lt 20;
>= GE IF rainfall ge 20;
<= LE IF rainfall <= 20;
& AND IF rainfall ge 20 & temp < 90;
| or ! OR IF rainfall ge 20 OR temp < 90;
IS NOT
MISSING
IF region IS NOT MISSING;
BETWEEN
AND
IF region BETWEEN ‘Plain’ AND ‘Spain’;
CONTAINS IF region CONTAINS ‘ain’;
IN IF region IN (‘Rain’, ‘Spain’, ‘Plain’);
Data Set Manipulation
28. • PROC SORT sorts data according to specified variables
• General Form:
PROC SORT DATA=input_data_set <options>;
BY Variable1 Variable2;
RUN;
• Sorts data according to Variable1 and then Variable2;
• By default, SAS sorts data in ascending order
– Number low to high
– A to Z
• Use DESCENDING statement for numbers high to low and letters Z to A
– BY City DESCENDING Population;
– SAS sorts data first by city A to Z and then Population high to low
Data Set Manipulation
29. • Merging Data Sets
– One-to-One Match Merge
• A single record in a data set corresponds to a single record in all other
data sets
• Example: Patient and Billing Information
– One-to-Many Match Merge
• Matching one observation from one data set to multiple observations in
other data sets
• Example: County and State Information
– Note: Data must be sorted before merging can be done
(PROC SORT)
Combining Data Sets
30. • Concatenating (or Appending)
• Stacks each data set upon the other
• If one data set does not have a variable that the other datasets do, the
variable in the new data set is set to missing for the observations from
that data set.
• General Form:
DATA output_data_set;
SET data1 data2;
run;
• PROC APPEND may also be used
Combining Data Sets
33. • PROC PRINT is used to print data to the output window
• By default, prints all observations and variables in the SAS data set
• General Form: PROC PRINT DATA=input_data_set <options>
<optional SAS statements>;
RUN;
• Some Options
– input_data_set (obs=n) - Specifies the number of observations
to be printed in the output
– NOOBS - Suppresses printing observation
number
– LABEL - Prints the labels instead of variable
names
Print Procedure
34. • Used to create basic scatter plots of the data
• Use PROC GPLOT or PROC SGPLOT for more sophisticated plots
• General Form:
PROC PLOT DATA=input_data_set;
PLOT vertical_variable * horizontal_variable/<options>;
RUN;
• By default, SAS uses letters to mark points on plots
– A for a single observation, B for two observations at the same point, etc.
• To specify a different character to represent a point
– PLOT vertical_variable * horizontal variable = ‘*’;
• To specify a third variable to use to mark points
– PLOT vertical_variable * horizontal_variable = third_variable;
• To plot more than one variable on the vertical axis
– PLOT vertical_variable1 * horizontal_variable=‘2’
vertical_variable2 * horizontal_variable=‘1’/OVERLAY;
Plot Procedure
35. • PROC UNIVARIATE is used to examine the distribution of data
• Produces summary statistics for a single variable
– Includes mean, median, mode, standard
deviation, skewness, kurtosis, quantiles, etc.
• General Form:
PROC UNIVARIATE DATA=input_data_set<options>;
VAR variable1 variable2 variable3;
RUN ;
• If the variable statement is not used, summary statistics will be produced for all
numeric variables in the input data set.
• Options include:
– PLOT – produces Stem-and-leaf plot, Box plot, and Normal probability plot;
– NORMAL – produces tests of Normality
Univariate Procedure
36. • Similar to the Univariate procedure
• General Form:
PROC MEANS DATA=input_data_set options;
<Optional SAS statements>;
RUN;
• With no options or optional SAS statements, the Means procedure will print out
the number of non-missing values, mean, standard deviation, minimum, and
maximum for all numeric variables in the input data set
• Optional SAS Statements
– VAR Variable1 Variable2;
• Specifies which numeric variables statistics will be produced for
– BY Variable1 Variable2;
• Calculates statistics for each combination of the BY variables
– Output out=output_data_set;
• Creates data set with the default statistics
Means Procedure
37. • Options
– Statistics Available
– Note: The default alpha level for confidence limits is 95%. Use ALPHA=
option to specify different alpha level.
CLM Two-Sided Confidence Limits RANGE Range
CSS Corrected Sum of Squares SKEWNESS Skewness
CV Coefficient of Variation STDDEV Standard Deviation
KURTOSIS Kurtosis STDERR Standard Error of Mean
LCLM Lower Confidence Limit SUM Sum
MAX Maximum Value SUMWGT Sum of Weight Variables
MEAN Mean UCLM Upper Confidence Limit
MIN Minimum Value USS Uncorrected Sum of Squares
N Number Non-missing Values VAR Variance
NMISS Number Missing Values PROBT Probability for Student’s t
MEDIAN (or P50) Median T Student’s t
Q1 (P25) 25% Quantile Q3 (P75) 75% Quantile
P1 1% Quantile P5 5% Quantile
P10 10% Quantile P90 90% Quantile
P95 95% Quantile P99 99% Quantile
Means Procedure
38. • PROC FREQ is used to generate frequency tables
• Most common usage is create table showing the distribution of categorical
variables
• General Form:
PROC FREQ DATA=input_data_set;
TABLE variable1*variable2*variable3/<options>;
RUN;
• Options
– LIST – prints cross tabulations in list format rather than grid
– MISSING – specifies that missing values should be included in the
tabulations
– OUT=output_data_set – creates a data set containing frequencies, list
format
– NOPRINT – suppress printing in the output window
• Use BY statement to get percentages within each category of a variable
Freq Procedure
40. • Proc SQL is the SAS implementation of SQL
• Proc SQL is a powerful SAS procedure that combines the functionality
of the SAS data step with the SQL language
• Proc SQL can sort, subset, merge and summarize data – all at once
• Proc SQL can combine standard SQL functions with virtually all SAS
functions
• Proc SQL can work remotely with RDBMS such as Oracle
Introduction - What is PROC SQL
41. PROC SQL – What can do?
– To perform a query – Using SELECT statement.
– To save queried result into SAS dataset – Using CREATE TABLE
statement
– To save the query itself – Using CREATE VIEW statement
– To sort dataset
– To merge more than one datasets in a number of ways
– To import dataset from Oracle Clinical to SAS
– To enter new records into a SAS dataset
– To modify/ edit the SAS dataset
42. PROC SQL - Why
• The Advantage of using SQL
– Combined functionality
– Faster for smaller tables
– SQL code is more portable for non-SAS applications
– Not require presorting
– Not require common variable names to join on. (need same
type , length)
43. • It is used to perform a query. It does not create any dataset.
• The simplest SQL code, need 3 statements
• By default, it will print the resultant query, use NOPRINT option to
suppress this feature
• Begin with PROC SQL, end with QUIT; not RUN;
• Need at least one SELECT… FROM statement
Performing Query – SELECT
Statement
44. PROC SQL;
SELECT *
FROM VITALS;
QUIT;
Performing Query – SELECT
Statement
To select all the variables
use ‘*’ after SELECT
statement
45. PROC SQL;
SELECT Patient, pulse
FROM VITALS;
QUIT;
Performing Query – SELECT
Statement
To select only particular variable(s) write down the variable names after SELECT
statement. Variable names should be separated by commas.
46. PROC SQL;
SELECT DISTINCT Patient
FROM VITALS;
QUIT;
Performing Query – SELECT
Statement
To select only distinct observations and to delete duplicate observations.
47. PROC SQL ;
SELECT *
FROM Vitals
ORDER BY date;
QUIT;
Ordering/Sorting Query Results
• SELECT * means we select all variables from dataset VITALS
• Put ORDER BY after FROM.
Sorting by Date
48. PROC SQL;
SELECT *
FROM vitals
WHERE Name CONTAINS 'J';
QUIT;
Subsetting:
- Character searching in WHERE
• Always put WHERE after FROM
• CONTAINS in WHERE statement only for character variables
Print observations with name
containing ‘J’.
49. PROC SQL;
SELECT *
FROM vitals
WHERE Name LIKE ‘%o%';
QUIT;
Subsetting
- Character searching in WHERE
• LIKE in WHERE statement only for character variables
Print observations with name
containing ‘o’ in between.
50. • In SELECT, the results of a query are converted to an output object (printing).
• Query results can also be stored as data.
• The CREATE TABLE statement creates a table with the results of a query.
• The CREATE VIEW statement stores the query itself as a view. Either way, the
data identified in the query can beused in later SQL statements or in other SAS
steps.
Creating New Data
51. PROC SQL;
CREATE TABLE bp
AS SELECT
patient, date, pulse
FROM Vitals
WHERE temp>98.5;
QUIT;
Creating New Data - Create Table
CREATE TABLE … AS…
Statement Creates a New
table from an existing table.
These statements will
copy all the variables to
the new dataset
PROC SQL;
CREATE TABLE bp
AS SELECT *
FROM Vitals
WHERE temp>98.5;
QUIT;
52. Creating New Data - Create Table
We can also assign different variable name, Label, Length, and format name
PROC SQL;
CREATE TABLE bp
AS SELECT
patient AS Patient LABEL='Subject number' LENGTH =5,
date AS Date LABEL='Date of Expt' FORMAT=WORDDATE8.,
pulse,
temp
FROM Vitals
WHERE temp>98.5;
QUIT;
53. PROC SQL;
CREATE VIEW bp
AS SELECT patient, date, pulse, temp
FROM Vitals;
WHERE temp>98.5
QUIT;
Creating New Data - Create View
• First step-creating a view,no output is produced.
• When a table is created, the query is executed and the resulting data is stored
in a file. When a view is created, the query itself is stored in the file. The data is
not accessed at all in the process of creating a view.
54. • The order of each statement is important
• CASE …END AS should in between SELECT and FROM
• Use WHEN … THEN ELSE… to redefine variables
• New variable GENDER is created from PATIENT.
Case Logic
- reassigning/recategorize
PROC SQL;
CREATE TABLE BP AS
SELECT Patient, Pulse,
CASE Patient
WHEN 101 THEN 'Male'
WHEN 102 THEN 'Female'
WHEN 103 THEN 'Female'
ELSE 'Male'
END AS Gender
FROM Vitals;
QUIT;
New Variable
Source variable
58. • No prior sorting required – one advantage over DATA MERGE
• Use comma (,) to separate two datasets in FROM
• Without WHERE, all possible combinations of rows from each tables is
produced, all columns are included
Join Tables (Merge datasets)
- Inner Join: Using WHERE
PROC SQL;
CREATE TABLE new AS
SELECT dosing.patient,
dosing.date,
dosing.med,
vitals.pulse,
vitals.temp
FROM dosing, vitals
WHERE dosing.patient=vitals.patient
AND dosing.date=vitals.date;
QUIT;
60. Resultant dataset will contain all & only those observations which comes from
DOSING dataset.
Join Tables (Merge datasets)
- Left Joins using ON
PROC SQL;
CREATE TABLE new1 AS
SELECT dosing.patient,
dosing.date,
dosing.med,
vitals.pulse,
vitals.temp
FROM dosing LEFT JOIN vitals
ON dosing.patient=vitals.patient
AND dosing.date=vitals.date;
QUIT;
62. Resultant dataset will contain all & only those observations which comes from
VITALS dataset.
Join Tables (Merge datasets)
- Right Joins using ON
PROC SQL;
CREATE TABLE new1 AS
SELECT dosing.patient,
dosing.date,
dosing.med,
vitals.pulse,
vitals.temp
FROM dosing RIGHT JOIN vitals
ON dosing.patient=vitals.patient
AND dosing.date=vitals.date;
QUIT;
64. Resultant dataset will contain all observation if they come from at least one of the
datasets.
Join Tables (Merge datasets)
- Full Joins using ON
PROC SQL;
CREATE TABLE new1 AS
SELECT dosing.patient,
dosing.date,
dosing.med,
vitals.pulse,
vitals.temp
FROM dosing FULL JOIN vitals
ON dosing.patient=vitals.patient
AND dosing.date=vitals.date;
QUIT;
66. SQL Functions
♦ PROC SQL supports almost all the functions available to the SAS DATA
step that can be used in a proc sql select statement
♦ Common Functions:
◘ COUNT
◘ DISTINCT
◘ MAX
◘ MIN
◘ SUM
◘ AVG
◘ VAR
◘ STD
◘ STDERR
◘ NMISS
◘ RANGE
◘ SUBSTR
◘ LENGTH
◘ UPPER
◘ LOWER
◘ CONCAT
◘ ROUND
◘ MOD
67. PROC SQL functions
PROC SQL;
SELECT avg(Age) AS mean,
std(Age) AS sd,
min(Age) AS min,
max(Age) AS max,
count(Age) AS count,
N (Age) AS Count
FROM sashelp.class;
quit;
68. PROC SQL functions
PROC SQL;
SELECT sex,
avg(Age) AS mean,
std(Age) AS sd,
min(Age) AS min,
max(Age) AS max,
count(Age) AS count,
N (Age) AS Count
FROM sashelp.class;
GROUP BY Sex
quit;
69. /*Deleting rows*/
PROC SQL;
DELETE
FROM class
WHERE age le 13;
QUIT;
Editing Data – Deleting rows and
Dropping columns
/*Droping variables*/
PROC SQL;
CREATE TABLE New (DROP=age) AS
SELECT *
FROM Class;
QUIT;
• Deleting columns can be done in SELECT or in DROP on created table