Graduation - Retention Database
University Planning and Analysis
05/07/2002
Overview
The purpose of this revised system is to make improvements to an already
sound Attrition Database. The principle improvements add more information about
a student at each time point, in a more efficient storage format, at more time
points per year. There are two files that make up the database, a demographic
file where there is only one record per student in a cohort, and a term file
where there is one record for each time point in each student’s undergraduate
career.
This discussion will assume familiarity with the source datasets and a
student’s progress through the R&R system as coded in the data.
Cohorts are created from the SDFX and then merged with other information
from the term process to produce demographic dataset and the term dataset. The
term information comes from the “SUPERRAT” files taken at census and at the end
of each semester. The STUSEM started in 1992 and we can go back to 1989 by
using the SUPERRAT, therefore SUPERRAT is the choice. The R&R and the ACS
applications development teams assure us that the data is equivalent (the STUSEM
just has other data items for more other uses). To get the DE and Coop program
information the STUCRS (back to 1992) and Revised Rolls (pre-1992) must be read
for census and end-of-semester. The MOUSE history file is used to add degrees
conferred.
Specific improvements over the ‘old’ system:
- Changed database format from the ‘old’ array of time points contained in a
single record per student to multiple term records per student, one for each
time point. “Empty” time points do not have a record in the tall stack of
records and must be accounted for in report processing logic.
- Added official student class status from registration system at each time
point.
- Added DE and coop flags to cohort record and specific term record as well.
This is a cumulative flag that indicates the presence of at least one semester
in that status.
- Added more time points, there are now 4 census and 4 end-of-semester
time points with existing system data captured at every time point. This does
lead to more problems when interpreting data, such as the presence of an end-of-term
record with no matching census record, but guarantees that the database shows
all data that is present.
- Graduation information is recorded from degree files at each time point it
occurs, and multiple degrees are all recorded at each time point if
present.
- Degree information summary is added to cohort record.
Processing Information:
Program library: k:\student\attrition\new
Central program: new_atr_0.sas
defines libraries, files, parameters, and calls all processing programs
Report Model: report2.sas
Process to update database for adding new semesters:
At this time new semesters are added by using a new end date and re-running
the entire build process from the beginning. At some point the begin date can
move forward to drop off old semesters. All files used in entire process must
be present on PC hard disk (or network drive) in ascii-text or SAS format.
Carefully place source data and work libraries on opposed disks to achieve
maximum efficiency in processing.
The best time to build a new database will be in the middle of a semester
after all graduation and end-of-term data is in for the previous semesters. In
the mid-fall run for new retention and graduation rates, the most-current-
semester census-data files must also be included even though the history files
will not include the end of that semester. In the mid-spring run there should
be no need to include spring census, summer run can be any time after spring
graduation data is in, with no need for summer census data.
- Download needed base files from mainframe (see below),
- Assemble needed base SAS datasets (see below),
- Change end date parameter at top of central program,
- Submit, (on Lewis’ PC should take about 30 minutes),
- Look for problems in added semesters,
- Distribute.
Specific datasets required
- SSN change file from mainframe database by running
P010501.J9550.SOURCLIB (GETSSNEX). Download is automatic to proper dataset on K
drive. As long as file date is greater than data date, the file is good to
go.
- Census SUPERRAT look for job in P010501.J9550.SOURCLIB to copy tape
superrat to P010501.TEMP.SUPERRAT and then download from disk (SUPERRAT may
already be on disk in which case download directly).
- History SUPERRAT look for job in P010501.J9550.SOURCLIB to copy tape
superrat to P010501.TEMP.SUPERRAT and then download from disk (SUPERRAT may
already be on disk in which case download directly). Note one history SUPERRAT
will have multiple semesters in it by the end of the year.
- MOUSE history look for job in P010501.J9550.SOURCLIB to copy tape
mouse to P010501.TEMP.MOUSE and then download from disk (MOUSE may already be
on disk in which case download directly).
- Census STUCRSlook for job in P010501.J9550.SOURCLIB to copy tape
STUCRS to P010501.TEMP. STUCRS and then download from disk (STUCRS may already
be on disk in which case download directly).
- History STUCRS and / or GRADES file these should already be
accessible on PC in SAS format, if not look for job in P010501.J9550.SOURCLIB to
copy tape STUCRS to P010501.TEMP. STUCRS and then download from disk (STUCRS may
already be on disk in which case download directly).
- SDFX files census for each semester should already be in SAS format
waiting.
- Note for all these files: the ASCII text will be read in from SUPERRATs and
MOUSE files, while the SDFX, STUCRS, and GRADES files must be in SAS dataset
format.
Check For Problems in Processing Outputs:
- Proc freq from each semester should have non-zero counts. Check the
academic performance flags with abbreviations in database to see if a new flag
was added. If so then code in a new flag IF statement and re-run database.
- Proc freq of each MOUSE can be checked for zeroes.
- The list of student Ids with no term records needs to be checked. There are
many records in the list now, which tells me the change to new ID is in error.
These are checked against the change SSN source file and put out into
SSNCHECK3.TXT file. Take changes and put correct adjustments into SSNCHECK4.TXT
file which will feed back into the merge step.
- List of students in more than one cohort needs to be checked. If there is
an error condition, the record in error can be deleted in the MERGE_FINAL.SAS
program around statement 30 in the change SSN re-processing. There are
legitimate reasons to be in multiple cohorts. See programming notes below.
Programming notes:
- At the current time, rebuilding the whole database is cheaper than making a
routine just to add a new semester(s) on the end of the existing. The old
database will have to be processed for SSN changes and resorted anyway, and
there may be other issues to address as well (such as adding data variables, or
changing length of existing variables).
- There is no attempt to put in a DE cohort, or load term records for those in
a DE cohort. This would require a lot of assumptions in the datasets, such as
how to construct old cohorts, and how to pull in census information not present
in the TRACS system until after RT census.
- Note that SSN changes are stored as SAS datasets that are read into a format
at each semester processing. When running the build process, this is about half
the time span of processing. If re-running the process in the same SAS session,
then the ‘adjust_ssns’ step may be bypassed (commented out in _0 module) and
existing SSN changes will work. Typically programming changes in RAT, Mouse, or
SDFX modules are made, so rerun will be much faster when SSN processing is
skipped.
- Adding a field to all records means going to the KEEP statement and
including that field, both the ‘read_rat_cens.sas’ and ‘read_mouse.sas’ output
fields must match (the mouse data is appended with FORCE specified so make sure
there are no extra fields in the MOUSE process, or that numeric / character
mismatches do not occur). Note that the KEEP statements are in the SET part of
the data step, and changes must be coordinated for all places (rat census, rat
history, mouse history).
- RAT processing is done with a common set of code in a macro loaded in PREPS
module. Date of birth is hardwired to 20th century, COM college
conversion is hardwired in, etc.
- DE and COOP process are done with a common set of code in a macro loaded in
PREPS module. Additions here must work for both census and history time
points.
Next Steps:
- At this time there are about 6 students who seem to be legitimately placed
in 2 cohorts. The current system structure ignores this possibility and the
‘extra’ record will be removed at reporting time with a nodupkey sort option.
This will place the student in the first entry cohort and compute time
accordingly (incorrectly). The correct way to fix this is to add a record ID to
the end of the student id that would identify the second and subsequent
cohorts. This record ID would be 1 for first cohorts, the second cohort would
be identified with a 2, and so on. The term file build logic would have to be
smart enough to decide when a second cohort entry was started and to increment
the record ID for all term records after that new cohort membership
started.
- Add data field = GPA in major at history time point in fall and spring from
SUPEREXT files.
Report Printing:
- Program named ‘report2.sas’ in home directory produces a report equivalent
to the old upside-down stairstep report.
- Comparison with attrition database is part of process. The status of each
student is compared semester by semester and exceptions are saved to a
dataset.
- The report process needs to do more work than the old report, such as taking
care of dirty data conditions that exist in the data. An example is where a
student registers for the semester after the census and so will be missing an
'C' record, but will have an 'E' record for the history of that semester. Other
dirty data conditions are where a student takes 5-week and 10-week courses in
the same summer season.
Output from SAS PROC CONTENTS of variables, attributres, and labels
Cohort Dataset
Cohort and Term Dataset Contents 10:08
Tuesday, April 30, 2002 1
The
CONTENTS Procedure
Data Set Name: NEWATR.COHORT
Observations: 67603
Member Type: DATA
Variables: 42
Engine: V8
Indexes: 0
Created: 10:26 Tuesday,
April 30, 2002 Observation Length: 168
Last Modified: 10:26 Tuesday,
April 30, 2002 Deleted Observations: 0
Protection:
Compressed: NO
Data Set Type: Sorted:
NO
Label:
-----Engine/Host
Dependent Information-----
Data Set Page Size:
16384
Number of Data Set Pages:
698
First Data Page:
1
Max Obs per Page:
97
Obs in First Data Page:
64
Number of Data Set Repairs:
0
File Name:
d:\student\newatr\cohort.sas7bdat
Release Created:
8.0202M0
Host Created:
WIN_PRO
Cohort Variables
Cohort and Term Dataset Contents
10:08 Tuesday, April 30, 2002 2
The
CONTENTS Procedure
-----Variables Ordered
by Position-----
# Variable Type
Len Label
1 SSN Char
9 Student ID
2 SENRSTAT Char
1 Enroll Status
3 SRESID Char
3 GA Res Code
4 SRACE Char
1 Race
5 SSEX Char
1 Sex
6 SCITIZ Char
1 Citizenship
7 SDEGI Char
1 Degree Intent
8 STRINST Char
6 Transfer Inst
9 SSATV Char
3 SAT Verbal
10 SSATM Char
3 SAT Math
11 SHSPCTL Char
2 Class Rank
12 SHSGPA Char
3 Hign School GPA
13 SUPGPA Char
3 Predicted GPA
14 SHSFLAG Char
1 HS Grad Yr Flag
15 SCEEBA Char
6 HS CEEB Code
16 SCOLL Char
2 College
17 SCLASS Char
2 Class
18 SCUR Char
5 Curriculum
19 SNAME Char 23
Student Name
20 SRES Char
3 NCSU Res Code
21 SHSRANK Num
8 HS Rank
22 SHSIZE Num
8 HS Class Size
23 dobym Char
6 Date of birth, YYYYMM
24 yrsem Char
5 Cohort semester
25 matdate Char
5 SDFX Matriculation Date
26 scoop Char
1 Cooperative Ed student?
27 sde Char
1 Distance Ed student?
28 dcoll01 Char 2
Degree 1 college
29 dcoll02 Char
2 Degree 2 college
30 dcoll03 Char
2
31 dcoll04 Char
2
32 dcoll05 Char
2
33 dcurr01 Char
3 Degree 1 curriculum
34 dcurr02 Char
3 Degree 2 curriculum
35 dcurr03 Char
3
36 dcurr04 Char
3
37 dcurr05 Char 3
38 dsem01 Char
5 Degree 1 semester
39 dsem02 Char
5 Degree 2 semester
40 dsem03 Char
5
41 dsem04 Char
5
42 dsem05 Char
5
Term Dataset
Cohort and Term Dataset Contents
10:08 Tuesday, April 30, 2002 3
The
CONTENTS Procedure
Data Set Name: NEWATR.TERM
Observations: 1047295
Member Type: DATA
Variables: 23
Engine: V8
Indexes: 0
Created: 10:26 Tuesday,
April 30, 2002 Observation Length: 88
Last Modified: 10:26 Tuesday,
April 30, 2002 Deleted Observations: 0
Protection:
Compressed: NO
Data Set Type:
Sorted: NO
Label:
-----Engine/Host Dependent
Information-----
Data Set Page Size:
8192
Number of Data Set Pages:
11385
First Data Page:
1
Max Obs per Page:
92
Obs in First Data Page:
51
Number of Data Set Repairs:
0
File Name:
d:\student\newatr\term.sas7bdat
Release Created:
8.0202M0
Host Created:
WIN_PRO
Term Variables
-----Variables
Ordered by Position-----
# Variable Type
Len Label
1 SSN Char
9 Student ID
2 rectype Char
1 Time Point, E,C,G
3 ssemhr Num
4 Semester hours
4 scumhrsp Num
4 Cumulative hours passed
5 ssemhrsp Num
4 Semester hours passed
6 ssemhrsf Num
4 Semester hours failed
7 stranshr Num
4 Transfer hours
8 senrst Char
1 Enroll status
9 sschool Char
2 College
10 sclass Char
2 Class
11 scurr Char
3 Curriculum
12 stuires Char
3 Tuition residence (NCSU)
13 scumgpa Num
8 Cumulative GPA
14 ssemgpa Num
8 Semester GPA
15 shouse Char
1 Housing code
16 sregst Char
1 Registration status
17 sdblmaj Char
5 Double major
18 acaperf Char
3 Academic performance flag
19 yrsem Char
5 Cohort semester
20 scoop Char
1 Cooperative Ed student?
21 sde Char
1 Distance Ed student?
22 sdehrs Num
8 DE hours this term
23 sdecourses Num
4 DE courses this term