Build your evaluation framework

On this page:

This step explains how to identify and organise the information you will need to support the evaluation’s purpose(s). This can be described as preparing an ‘evaluation framework’. The evaluation framework:

  • Provides a common framework to use for STEM education initiatives — it will work for all initiatives, big or small, no matter where they are undertaken.
  • Is based on frameworks that have been used successfully by professional evaluators, with opportunities to adjust for smaller, less complex evaluations.
  • Helps manage some of the challenges in evaluating STEM education initiatives, such as the difficulty in finding reliable evidence that directly shows evidence of impact or outcomes.

By using the framework provided in this Toolkit, you will be able to map out the information you need and follow the logic of the initiative you are evaluating. You will also be collecting information in a consistent way, which will help contribute to the broader, shared understanding of what works in STEM education initiatives.

By the end of this step you will:

  • Understand the STEM Toolkit evaluation framework and how to use it.
  • Customise the evaluation framework to include measures important to your evaluation.

Understand the STEM Toolkit evaluation framework and how to use it

You’ve done the background work by writing down your objective and speaking to the people you want involved. Now you can pull your thinking together in one place through an evaluation framework. An evaluation framework describes and organises the information that you need to collect in an evaluation. Even in a complex evaluation, you can summarise this on a page.

Every evaluation is different. This means it’s important to capture measures that are specific to your STEM education initiative. Fortunately, STEM education initiatives have enough in common that we can work from a standard evaluation framework.

Start with a key evaluation question

Return to the key evaluation question: Did the initiative achieve its intended objectives? In other words, did the initiative do what you wanted it to do? Before you can do anything more you need to understand the evidence you will need to answer this question.

Categories of evidence

To answer your key evaluation question, you need evidence about the initiative from four categories: design, implementation, outputs and outcomes. These are defined below:

Key evaluation question: Did the initiative achieve its intended outcomes?

Design Implementation Outputs Outcomes
Does the initiative's design set it up for success? How has the initiative been implemented in practice What has the initiative produced or delivered? What impacts or consequences did the initiative have for students?

These categories intentionally flow from each other. You start with the initiative’s design and end with its outcomes. This is so you can easily spot bottlenecks, or things that block progress, during the evaluation. For example, if there are problems in the implementation phase it’s likely this will have consequences for outcomes.

To answer the questions in each category, you need to identify specific measures for each:

Design measures

Design measures look at whether the design of the initiative set it up for success from the beginning. Measures tend to ask whether there was justification or evidence for particular design decisions of an initiative, or whether a process took place to make specific design decisions. Good examples are whether you had a reasonable process to choose a target audience, and whether you considered initiative options before settling on a preferred approach.

Implementation measures

Implementation measures look at how the initiative has been implemented in practice. These measures fall into two groups that look at whether the initiative lived up to expectations:

  • The rollout of the initiative — e.g. did the initiative meet its intention to organise five scientists to mentor chemistry school students?
  • Generic program factors — e.g. did the initiative deliver what it was supposed to on time and within budget?

Output measures

Outputs are different from outcomes but sometimes the two are confused. You can think about outputs as things produced on the way to achieving outcomes. For example, a 2km run is an output on the way to achieving a fitness outcome. Outputs fall into three categories:

  • Number of ‘things’ the initiative produced (e.g. resources, equipment / technology distributed).
  • Number of people the initiative reached (e.g. number of students or teachers).
  • Time spent in initiative-related activities (e.g. teacher hours in professional learning).

While these measures are useful, it is also important to capture demographic information (that is, information about the people and place such as age, gender, location) as an additional type of measure. This is so the evaluation can analyse ‘who received what?’ as certain groups might have received more resources / events / opportunities than others, or achieved better outcomes.

Outcome measures

Outcome measures look at what actually changed as a result of the initiative. Outcomes can be about student engagement or student achievement.

It’s great when student engagement and achievement outcomes can be measured directly. Direct measures are where you can observe or count the exact thing you are trying to achieve. For example, you might observe students are engaged in a class where they are learning about STEM and see how many complete a particular task. Or you might be able to count test results. You might also be able to observe student engagement.

Direct measures are good to use when they are available, but often they will not be. For example, it can be hard to get evidence about a specific set of students. Or the evidence you gather may not allow you to isolate only the impact of your STEM initiative.

When direct measures are not available you can used proxies instead. A proxy is something associated with what you are trying to measure. Common types of proxies for STEM education evaluations are:

  • Behaviours as proxies: You might observe particular behaviours associated with improved engagement / achievement and infer these behaviours will lead to improved outcomes. For example, teachers report that students speak up more in science and can answer / ask harder questions about the content. This might suggest improvements in their achievement in science.
  • Beliefs as proxies: You might hear or observe particular beliefs associated with improved engagement / achievement and infer these beliefs will lead to improved outcomes. For example, teachers’ enjoyment of teaching maths increases, which can suggest improvements to students’ engagement.
  • Engagement as a proxy for achievement: You might measure positive impacts on student engagement but have no means of capturing impact on achievement. However, research shows that improved engagement is linked to improved achievement, so the evaluator may infer that improvements in engagement would be likely to lead to improved achievement.

Key evaluation question: Did the initiative achieve its intended outcomes?

Does the initiative's design set it up for success?
How has the initiative been implemented in practice
What has the initiative produced or delivered?
What impacts or consequences did the initiative have for students?
Measures for design
  • Measures for rollout
  • Generic program measures
  • Measures for things produced
  • Measures for people reached
  • Measures for time spent
  • Measures about who received the initiative
  • Direct measures of engagement and achievement
    • Direct measures of engagement
    • Direct measures of achievement
  • Proxies to measure engagement and achievement
    • Behaviours as proxies
    • Beliefs as proxies

One particular measurement challenge in STEM education is lag times for what you are ultimately trying to measure. For example, an initiative for Year 5 students aims to improve the enrolments in senior secondary STEM subjects (Year 11 and 12). This timeframe may be too long for an evaluation to measure this directly, so the measure might instead rely on what students think their future choices will be, and then infer that this indicates their senior subject choices.