Webinar PowerPoint Slides - Center on Response to Intervention

January 17, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Statistics
Share Embed Donate


Short Description

Download Webinar PowerPoint Slides - Center on Response to Intervention...

Description

Iowa’s Application of Rubrics to Evaluate Screening and Progress Tools John L. Hosp, PhD University of Iowa

Overview of this Webinar • Share rubrics for evaluating screening and progress tools • Describe process Iowa Department of Education used to apply rubrics

Purpose of the Review • Survey of universal screening and progress tools currently being used by LEAs in Iowa • Review these tools for technical adequacy • Incorporate one tool into new state data system • Provide access to tools for all LEAs in state

Collaborative Effort

The National Center on Response to Intervention

Structure of the Review Process Core Group

Other IDE staff as well as stakeholders from LEAs, AEAs, and IHEs from across the state

IDE staff responsible for administration and coordination of the effort

Vetting Group

Work Group

IDE and AEA staff who conducted the actual reviews

Overview of the Review Process • The work group was divided into 3 groups: Group A Key elements of tools: name, what it measures, grades it is used with, how it is administered, cost, time to administer

Group B Technical features: reliability, validity, classification accuracy, relevance of criterion measure

Group C Application features: alignment with CORE, training time, computer system feasibility, turn around time for data, sample, disaggregated data

• Within each group, members worked in pairs

Overview of the Review Process • Each pair: ▫ had a copy of the materials needed to conduct the review ▫ reviewed and scored their parts together and then swapped with the other pair in their group

• Pairs within each group met only if there were discrepancies in scoring ▫ A lead person from one of the other groups participated to mediate reconciliation

• This allowed each tool to be reviewed by every work group member

Overview of the Review Process • All reviews will be completed and brought to a full work group meeting • Results will be compiled and shared • Final determinations across groups for each tool will be shared with the vetting group two weeks later • The vetting group will have one month to review the information and provide feedback to the work group

Structure and Rationale of Rubrics • Separate rubrics for universal screening and progress monitoring ▫ Many tools reviewed for both ▫ Different considerations

• Common header and descriptive information • Different criteria for each group (a, b, c)

Universal Screening Rubric Header on cover page Iowa Department of Education Universal Screening Rubric for Reading (Revised 10/24/11) What is a Universal Screening Tool in Reading: It is a tool that is administered at school with ALL students to identify which students are at-risk for reading failure on an outcome measure. It is NOT a placement screener and would not be used with just one group of students (e.g., a language screening test) Why use a Universal Screening Tool: It tells you which students are at-risk for not performing at the proficient level on an end of year outcome measure. These students need something more and/or different to increase their chances of becoming a proficient reader. What feature is most critical: Classification Accuracy because it provides a demonstration of how well a tool predicts who may and may not need something more. It is critical that Universal Screening Tools identify the correct students with the greatest degree of accuracy so that resources are allocated appropriately and students who need additional assistance get it.

Group A Group A Information Relied on to make determinations: (circle all that apply, minimum of two) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook

On-Line publisher Info.

Outside Resource other than Publisher or Researcher of Tool Name of Screening Tool:

Grades: (circle all that apply) K Criteria Cost (minus administrative fees like printing)

Student time spent engaged with tool

Skill/Area Assessed with Screener:

1

2

3

4

5

6 Above6

Justification Tools need to be economically viable meaning the cost would be considered “reasonable” for the state or a district to use. Funds that are currently available can be used and can be sustained. One time funding to purchase something would not be considered sustainable. The amount of student time required to obtain the data. This does not include set-up and scoring time.

How Screener Administered: (circle one) Score 3 Free

Score 2 $.01 to $1.00 per student

Score 1 $1.01 to $2.00 per student

≤ 5 minutes per student

6 to 10 minutes per student

11 to 15 minutes per student

Group

or

Score 0 $2.01 to $2.99 per student

Individual Kicked out if: ≥ $3.00 & over per student

>15 minutes per student

Group B Criteria Criterion Measure used for Classification Accuracy (Sheet for Judging Criterion Measure)

Justification The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined.

Classification Accuracy (Sheet for Judging Classification Accuracy for Screening Tool)

Tools need to demonstrate they can accurately determine which students are in need of assistance based on current performance and predicted performance on a meaningful outcome measure. This is evaluated with: Area Under the Curve (AUC), Specificity and Sensitivity The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined.

Criterion Measure used for Universal Screening Tool (Sheet for Judging Criterion Measure)

Group B Score 3 Score 2 15-12 points 11-8 points on on criterion criterion measure form measure form

Score 1 7-4 points on criterion measure form

Score 0 3-0 points on criterion measure form

9-7 points on classification accuracy form

6-4 points on classification accuracy form

3-1 points on classification accuracy form

0 points on classification accuracy form

15-12 points on criterion measure form

11-8 points on criterion measure form

7-4 points on criterion measure form

3-0 points on criterion measure form

Kicked out if: Same test but uses different subtest or composite OR same test given at a different time No data provided

Same test but uses different subtest or composite OR same test given at a different time

Judging Criterion Measure Additional Sheet for Judging the External Criterion Measure (Revised 10/24/11) Used for: Circle all that apply Screening: Classification Accuracy Name of Criterion Measure: Gates

Screening: Criterion Validity Progress Monitoring: Criterion Validity How Criterion Administered: (circle one) Group or Individual

Information Relied on to make determinations: (circle all that apply) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook info. Outside Resource other than publisher or Researcher of Measure

1. An appropriate Criterion Measure is: a) External to the screening or progress monitoring tool b) A Broad skill rather than a specific skill c) Technically adequate for reliability d) Technically adequate for validity e) Validated on a broad sample that would also represent Iowa’s population

On-Line publisher

Judging Criterion Measure (cont) Feature Justification Criterion Measure is: a) External to the The criterion measure should be Screening or separate and not related to the Progress screening or progress monitoring Monitoring Tool tool. Meaning the outside measure should be by a different author/publisher and use a different sample. (e.g., NWF can’t predict ORF by the same publisher)

Score 3

b) A broad skill rather than a specific skill

Broad reading skills are measured (e.g., Total reading score on ITBS)

We are interested in generalizing to a larger domain and therefore, the criterion measure should assess a broad area rather than splinter skills.

Score 2

External with no/little overlap. Different author/publisher, standardization group.

Score 1

Score 0

External with some/ a lot of overlap. Same author/publisher, and standardization group.

Broad reading skills are measured but in one area (e.g., comprehension made up of two subtests)

Specific skills measured in two areas (e.g., comprehension and decoding)

Kicked Out Internal (same test using different subtest or composite OR same test given at a different time)

Specific skill measured in one area (e.g., PA, decoding, vocabulary, spelling)

Judging Criterion Measure (cont) c) Technically adequate for Reliability

d) Technically adequate for Validity

e) A broad sample is used

Student performance needs to be consistently measured. Typically demonstrated with reliability under different items (alternate form, split half, coefficient alpha) The tool measures what it purports to measure. We focused on criterion-related validity to make this determination. The extent to which this criterion measure relates to another external measure that is determined good. The sample used in determining the technical adequacy for a tool should represent a broad audience. While a representative sample by grade is desirable it is often not reported therefore, taken as a whole does the population used represent all students or is it specific to a region or state?

Some form of reliability above .80

Some form of reliability between .70 and .80

Some form of reliability between .60 and .70

All forms of reliability below .50

Criterion ≥ .70

Criterion .50-.69

Criterion .30 -.49

Criterion .10 - .29

National sample

Several States (3 or more) across more than one region

States (3, 2 or 1 in one region)

Sample of convenience, does not represent a state.

Judging Classification Accuracy Additional Sheet for Judging Classification Accuracy for Screening Tool (Revised 10/24/11) Assessment: (Include name and grade) Complete the Additional Sheet for Judging the Criterion Measure. If it is not kicked out complete review for: 1) Area Under the Curve (AUC) 2) Specificity/Sensitivity 3) Lag time between when the assessments are given

Feature Justification 1) Area Under the Curve (AUC) Technical Adequacy is Area Under the Curve is one Demonstrated for Area way to gauge how accurately a Under the Curve tool identifies students in need of assistance. It is derived from Receiver Operating Characteristic curves (ROC) and is presented as a number to 2 decimal places. One AUC is reported for each comparison— each grade level, each subgroup, each outcome tool, etc.

Score 3 AUC ≤ .90

Score 2 AUC ≥ .80

Score 1 AUC ≥ .70

Score 0 AUC < .70

Kicked Out

Judging Classification Accuracy (cont) 2) Specificity or Sensitivity Technical Adequacy is Specificity/Sensitivity is another Demonstrated for way to gauge how accurately a Specificity or Sensitivity tool identifies students in need (see below) of assistance. Specificity and Sensitivity can give the same information depending on how the developer reported the comparisons. Sensitivity is often reported as accuracy of positive prediction (yes on both tools). Therefore if the developer predicted positive/proficient performance, Sensitivity will express how well the screening tool identifies students who are proficient. If predicting at-risk or non-proficient, this is what Sensitivity shows. It is important to verify what the developer is predicting so that consistent comparisons across tools can be made (see below) 3) Lag time between when the assessments are given Lag time- length of time Time between when the between when the assessments are given should criterion and screening be shorter to eliminate effects assessment is given associated with differential instruction

Sensitivity or Specificity ≥ .90

Sensitivity or Specificity ≥ .85

Sensitivity or Specificity ≥ .80

Sensitivity or Specificity < .80

Under two weeks

Between two weeks and 1 month

Between 1 month and 6 months

Over 6 months

Sensitivity and Specificity Considerations and Explanations

Explanations: True means “in agreement between screening and outcome”. So true can be negative to negative in terms of student performance (i.e., negative meaning at-risk or nonproficient). This could be considered either positive or negative prediction depending on which the developer intends the tool to predict. As an example, a tool that has a primary purpose of identifying students at-risk for future failure would probably use ‘true positives’ to mean ‘those students who were accurately predicted to fail the outcome test’. Sensitivity = true positives/true positives + false negatives Specificity = true negatives/true negatives + false positives

Key + = proficiency/mastery - = nonproficiency/at-risk 0 = unknown = Sensitivity = Specificity

Consideration 1: Determine whether developer is predicting a positive outcome (i.e., proficiency, success, mastery, at or above a criterion or cut score) from a positive performance on the screening tool (i.e., at or above benchmark or a criterion or cut score) or a negative outcome (i.e., failure, nonproficiency, below a criterion or cut score) from negative performance on the screening tool (i.e., below a benchmark, criterion, or cut score). Prediction is almost always positive to positive or negative to negative; however in rare cases it might be positive to negative or negative to positive. Figure 1a Outcome This is an example of positive to positive prediction. In this case, Sensitivity + - is positive performance on the screening tool predicting positive outcome. Screening

+ Figure 1b

Screening

+

Outcome This is the opposite prediction—negative to negative as the main focus. In - + this case, Sensitivity is negative (or at-risk) performance on the screening tool predicting negative outcome. Using the same information in these two tables, Sensitivity in the top table will equal Specificity in the second table. Because our purpose is to predict proficiency, in this instance we would use Specificity as the metric for judging.

Consideration 2: Some developers may include a third category—unknown prediction. If this is the case, it is still important to determine whether they are predicting a positive or negative outcome because Sensitivity and Specificity are still calculated the same way. Figure 2a Outcome

+

0

-

+ Screening 0 Figure 2b Outcome

Screening 0 +

0

+

This is an example of positive to positive prediction. In this case, Sensitivity is positive performance on the screening tool predicting positive outcome. It represents a similar comparison to that in Figure 1a.

This is the opposite prediction—negative to negative as the main focus. In this case, Sensitivity is negative (or at-risk) performance on the screening tool predicting negative outcome. It represents a similar comparison to that in Figure 1b.

Using the same information in these two tables, Sensitivity in the top table will equal Specificity in the second table. Because our purpose is to predict proficiency, in this instance we would use Specificity as the metric for judging.

Consideration 3: In (hopefully) rare cases, the developer will set up the tables in opposite directions (reversing screening and outcome or using a different direction for the positive/negative for one or both). This illustrates why it is important to consider which column or row is positive and negative for both the screening and outcome tools. Screening

+ Outcome 0 -

0

+

Notice that the Screening and Outcome tools are transposed. This makes Sensitivity and Specificity align within rows rather than columns.

Group B (cont) Criterion Validity for Universal Screening Tool. From technical manual

Reliability for Universal Screening Tool.

Reliability across raters for Universal Screening Tool.

Tools need to demonstrate that they actually measure what they purport to measure (i.e., validity). We focused on criterion-related validity because it is a determination of the relation between the screening tool and a meaningful outcome measure. Tools need to demonstrate that the test scores are stable across items and/or forms. We focused on:  alternate form  split half  coefficient alpha

How reliable scores are across raters is critical to the utility of the tool. If the tool is complicated to administer and score it can be difficult to train people to use it leading to different scores from person to person.

Criterion ≥ .70

Criterion .50-.69

Criterion .30 -.49

Criterion .10 - .29

Criterion < .10 or no information provided

Alternate Form > .80

Alternate Form > .70

Alternate Form > .60

Alternate Form > .50

There is no evidence of reliability

Split-half > .80

Split-half > .70

Split-half > .60

Split-half > .50

Coefficient alpha >.80

Coefficient alpha >.70

Coefficient alpha >.60

Coefficient alpha >.50

Rater ≥.90

Rater .89-.85

Rater .84-.80

Rater ≤.75

Group C Criteria Alignment with Iowa CORE/ Demonstrated Content Validity

Justification It is critical that tools assess skills identified in the Iowa Core. Literature & Informational:  Key Ideas & Details  Craft & Structure  Integration of knowledge & ideas  Range of reading & level of text complexity Foundational: (K – 1)  Print Concepts  Phonological Awareness  Phonics and Word Recognition  Fluency Foundational: (2 – 5)  Phonics and Word Recognition  Fluency

Group C Score 3 Score 2 Has a direct Has alignment alignment with Iowa with the Iowa CORE (Provide CORE (provide Broad Area) Broad Area and Specific Skill)

Score 1

Score 0

Kicked out if: Has no alignment with the Iowa CORE

Group C (cont) Training Required

Computer Application (tool and data system)

Data Administration and Data Scoring

The amount of time needed for Less than 5 training is one consideration related hours of to the utility of the tool. Tools that can training be learned in a matter of hours and (1 day) not days would be considered appropriate. Many tools are given on a computer Computer or which can be helpful if: schools have hard copy of computers, the computers are tool available. compatible with the software, and the Data reporting data reporting can be separated from is separate the tool itself. It is also a viable option if hard copies of the tools can be used if computers are not available. The number of people needed to Student takes administer and score the data speaks assessment to the efficiency of how data is on computer collected and the reliability of scoring. and it is automatically scored by computer at end of test

5.5 to 10 hours of training (2 days)

10.5 to 15 hours of training (3 days)

Over 15.5 hours of training (4+ days)

Computer application only. Data reporting is separate

Computer or hard copy of tools available. Data reporting is part of the system

Computer application only. Data reporting is part of the system

Adult administers assessment to student and enters student’s responses (in real time) into computer and it is automatically scored by computer at end of test

Adult administers assessment to student and then calculates a score at end of test by conducting multiple steps

Adult administers assessment to student and then calculates a score at end of test by conducting multiple steps AND referencing additional materials to get a score (having to look up information in additional tables)

Group C (cont) Data Retrieval (time The data needs to be available in a for data to be timely manner in order to use the useable) information to make decisions about students

Data can be used instantly

Data can be used Same day

Data can be used Next day

Data are not available until 2 – 5 days later

A broad sample is used

The sample used in determining the technical adequacy for a tool should represent a broad audience. While a representative sample by grade is desirable it is often not reported therefore, taken as a whole does the population used represent all students or is it specific to a region or state?

National sample

Several States (3 or more) across more than one region

States (3, 2 or 1 in one region)

Sample of convenience, does not represent a state.

Disaggregated Data

Viewing disaggregated data by subgroups (i.e, race, English language learners, economic status, special ed. status) helps determine how the tool works with each group. This information is often not reported but it should be considered if it is available.

Race, economic status, and special ed. status are reported separately

At least two disaggregated groups are listed

One disaggregated group is listed

No information on disaggregated groups

Takes 5+ days to use data (have to send data out to be scored)

Progress Monitoring Rubric Header on cover page Iowa Department of Education Progress Monitoring Rubric (Revised 10/24/11) Why use Progress Monitoring Tools: They quickly and efficiently provide an indication of a student’s response to instruction. Progress monitoring tools are sensitive to student growth (i.e., skills) over time, allowing for more frequent changes in instruction. They allow teachers to better meet the needs of their students and determine how best to allocate resources. What feature is most critical: Sufficient number of equivalent forms so that student skills can be measured over time. In order to determine if students are responding positively to instruction, they need to be assessed frequently to evaluate their performance and the rate at which they are learning.

Descriptive info on each work group’s section Information Relied on to make determinations: (circle all that apply, minimum of two) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook

On-Line publisher Info.

Outside Resource other than Publisher or Researcher of Tool Name of Progress Monitoring Tool:

Grades: (circle all that apply) Name of Criterion Measure:

K

1

Skill/Area Assessed with Progress Monitoring Tool:

2

3

4

5

6

Above6

How Progress Monitoring Administered: (circle one) How Criterion Administered: (circle one)

Group

Group

or

or

Individual

Individual

Group A Criteria Number of equivalent forms

Cost (minus administrative fees like printing)

Student time spent engaged with tool

Justification Progress monitoring requires frequently assessing a student’s performance and making determinations based on their growth (i.e., rate of progress). In order to assess students’ learning frequently, progress monitoring is typically conducted once a week. Therefore, most progress monitoring tools have 20 to 30 alternate forms. Tools need to be economically viable meaning the cost would be considered “reasonable” for the state or a district to use. Funds that are currently available can be used and can be sustained. One time funding to purchase something would not be considered sustainable. The amount of student time required to obtain the data. This does not include set-up and scoring time. Tools need to be efficient to use. This is especially true of measures that teachers would be using on a more frequent basis.

Score 3 20 or more alternate forms

Score 2 15 – 19 alternate forms

Score 1 10 – 14 alternate forms

Score 0 9 alternate forms

Kicked out if: < 9 alternate forms

Free

$.01 to $1.00 per student

$1.01 to $2.00 per student

$2.01 to $2.99 per student

≥$3.00 & over per student

≤ 5 minutes per student

6 to 10 minutes per student

11 to 15 minutes per student

>15 minutes per student

Group B Criteria Forms are of Equivalent Difficulty (Need to provide detail of what these are when publish review)

Judgment of Criterion Measure (see separate sheet for judging criterion measure) Technical Adequacy is Demonstrated for Validity of Performance score (sometimes called Level)

Justification Alternate forms need to be of equivalent difficulty to be useful as a progress monitoring tool. Having many forms of equivalent difficulty allows a teacher to determine how the student is responding to instruction because the change in score can be attributed to student skill versus a change in the measure. Approaches include:  Readability formulae (e.g., FleishKincaid, Spache, Lexile, FORCAST)  Euclidian Distance  Equipercentiles  Stratified Item Sampling The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined. Performance score is a student’s performance at a given point in time rather than a measure of his/her performance over time (i.e., rate of progress). We focused on criterionrelated validity to make this determination because it is a determination of the relation between the progress monitoring tool and a meaningful outcome.

Score 3 Addressed equating in multiple ways

Score 2 Addressed equating in 1 way that is reasonable

Score 1

Score 0 Addressed equating in a way that is NOT reasonable

15-12 points on criterion measure form

11-8 points on criterion measure form

7-4 points on criterion measure form

3-0 points on criterion measure form

Criterion ≥ .70

Criterion .50-.69

Criterion .30 -.49

Criterion .10 - .29

Kicked out if: Does Not Provide any indication of equating forms

Group B (cont) Technical Adequacy is Demonstrated for Reliability of Performance score

Technical Adequacy is Demonstrated for Reliability of slope

Tools need to demonstrate that the test scores are stable across item samples/forms, raters, and time. Across item samples/forms:  coefficient alpha  split half  KR-20  alternate forms Across raters:  interrater (i.e., interscorer, interobserver) Across time:  Test-retest

Item samples/ forms ≥.80

Item samples/ forms .79-.70

Item samples/ forms .69-.60

Item samples/ forms ≤.59

Rater ≥.90

Rater .89-.85

Rater .84-.80

Rater ≤.75

Time ≥.80

Time .79-.70

Time .69-.60

Time ≤.59

The Reliability of the slope tells us how well the slope represents a student’s rate of improvement. Two criteria are used:  Number of observation, that is student data points needed to calculate slope.  Coefficients, that is reliability for slope. This should be reported via HLM (also called LMM or MLM) results. If calculated via OLS, the coefficients are likely to be lower. *

10 or more observations/ data points

9-7 observations/ data points

6-4 observations/ data points

3 or fewer observations/ data points

Coefficient >.80

Coefficient >.70

Coefficient >.60

Coefficient
View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF