A common criticism of project assessment is the subjectivity and inconsistency of raters in scoring. In the present article, we provide the result of validity and inter-rater reliability test of the project assessment instrument. The instrument with a rubric was used to assess students’ project task in grade eight for function and relation topic. The task was adopted from mathematics textbooks used in the schools. The instrument has been tested to 10 raters/teachers and 94 grade eight students from three schools (in Surabaya and Gresik). Data were collected through the project assessment sheet along with its rubric as the scoring guidance for the teachers. Construct validity was analyzed through confirmatory factor analysis, while a reliability test was conducted by using inter-rater reliability method with

The approach in mathematics assessment continues to change (^{st}-century context. It is due to the assessment in mathematics learning relies more on tests. The changes were triggered by two things: the constantly changing of school mathematics curriculum and the development of mathematics learning theories (

Changes in assessment at the secondary school level also occurred in Korea, following the change of its curriculum implemented in 2009 (

Authentic assessment is an assessment that requires students to perform real-world tasks and shows the essence of knowledge and skill implementation (

Project tasks enable students to learn actively based on the inquiry because students are required to carry out a series of activities to solve the contextual mathematical problems. In solving the problems, a series of activities (planning, implementing, and reporting the project result) must be undergone by students, so they should apply their potential and knowledge. When carrying out the project, students are required to actively seek or collect the required data, and they usually do it in a group. Therefore, students are encouraged to use higher order thinking through problem-solving skills when they work in a project task (

A problem arises when the instrument to assess project tasks is still limited (

Referring to

Some studies (e.g.,

The aforementioned researches (e.g.,

Prior researches put attention on the validity and reliability test of performance assessment instruments (

The participants in this study were 94 lower secondary school students from three different schools in Surabaya and Gresik, East Java. In the first school, the participants consisted of 28 students formed into 5 groups with 3 mathematics teachers. In the second school, the participants consisted of 34 students divided into 5 groups with 4 mathematics teachers. Whereas in the third school, the participants consisted of 32 students made into 5 groups with 3 mathematics teachers. The five groups in each school were asked to do a project task and the teachers gave an assessment of the projects result carried out by the students.

The selection of students involved in this study was entirely determined by the teachers in each school. Meanwhile, teachers were selected based on their teaching experience and their expertise in assessment. The teachers’ experiences in teaching range from 5 to 25 years. Further, the teachers' expertise in assessment was based on their experience of participating in training related to assessment in the curriculum 2013. Project assessment was carried out in the learning process for three meetings conducted by the 10 mathematics teachers (raters).

At the first meeting, students were asked to make plans for the project along with its date and day. They also discussed task sharing among the group members. Furthermore, raters evaluated the project plans made by each group. At the second meeting, students were asked to process and to analyze data obtained through observation. At the end of the activity, the raters gave a conclusion about the relevance of the topics to the project being worked on. At the third meeting, students were asked to present the results of the project. Then, the raters assessed students' presentations, results of data processing, analyzing and drawing conclusion process and the systematics of students' written reports. The raters conducted an assessment using the assessment rubric guidelines that had been prepared with a range of assessment scores for each criterion from 1 to 4. The sequence of students and teachers' activities are depicted in Figure 1.

Data in this study were collected through a project assessment sheet. The sheet comprises the criteria or aspects that will be assessed in the project task and the score that must be given by the raters or teachers. The sheet was used to assess project tasks given to students. To guide the teacher in assessing the project, a rubric was developed. The rubric was used to assure the objectivity of the assessment. The rubric contained criteria or aspects that are assessed along with its descriptions for a score of 1 to 4.

Project tasks were adapted from the textbook (Curriculum 2013) published by the Ministry of Education and Culture on the topic of relations and functions. Furthermore, in this study, the task was modified by adding goals and indicators of the problems. The language in the project assessment sheet was also clarified for each aspect or assessment criteria, while the rubric was improved by clarifying descriptors on each assessment criterion. In this case, the content of the tasks remains similar to the original one in the textbooks.

The instructions given on the sheet of project task consists of three stages, i.e., (a) planning the project, seeking information on how to determine telephone rates and task sharing to group members. (b) doing data processing, analyzing data that has been obtained, linking the results of observations with the topic of relations and functions and present the results of observations in the form of diagrams (bar charts, tables, lines, and sequential pairs of pairs). And (c) making a project report from the observations and present it to the class.

Furthermore, students completed project tasks, and each of the 10 teachers was asked to assess based on ten predetermined criteria. The ten criteria included: planning the stages of project, tasks sharing among the group members, determining the tools and materials needed, time of project implementation, quantity of data sources, data processing, data analysis, drawing conclusions (relationship between relations and functions in daily life ), the format of report and presentation. The maximum score in this assessment was 40 because all criteria have the same score, which was 4.

The teachers assessed those 10 criteria on three occasions. First, the teachers were asked to rate four criteria in the planning and preparation stages of the project, including the plan of project implementation, division of tasks to group members, tools and materials needed to carry out project tasks and time allocation of the project. Second, the teachers were asked to rate four criteria in the project implementation stage, including the amount of data obtained, data processing, data analysis, and drawing a conclusion. Third, the teachers were asked to rate two criteria in the stages of the project report, namely systematic writing of the report and presentation.

The results of the assessment data conducted by ten teachers were processed in a table. The first column contains criteria in the rubric, the second until the 11th column contained the evaluation results of each criterion by the raters. Table 1 shows the sample of teachers’ scores.

No | Criteria | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 | R10 |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | Plan a project | 4 | 3 | 3 | 3 | 4 | 3 | 3 | 3 | 3 | 5 |

2 | ... | ||||||||||

. | ... | ||||||||||

10 | ... |

Furthermore, a Confirmatory Factor Analysis (CFA) was carried out, namely variable selection using the SPSS version 21. The procedures for carrying out factor analysis were: the selection of the variables, formation of factors, interpreting analysis result, and factor naming. In the process of analyzing the variable selection, KMO-MSA (Kaiser-Meyer-Olkin Measure of Sampling Adequacy) was required. The provisions of the KMO-MSA refer to

Assessment results from the 10 raters on student project tasks were included in the table. The first column comprises group order of the students (group 1 to group 5), the second column is filled up with variables or criteria for assessment (criteria 1 to 9), and the third until twelveth column contains the results of the rater's assessment on each variable from each group. Then an inter-rater reliability test or the ICC (Intraclass Correlation Coefficient) was conducted by using the SPSS version 21. Interpretation of the inter-rater reliability coefficient was based on criteria, i.e., the reliability value is < 0.40 (less), 0.40 - 0.59 (low), 0.60 - 0.74 (good) and 0.75 - 1.00 (very good) (

Validity is the accuracy of a measuring instrument in carrying out its measuring functions (

No | Criteria or factors | Item | Number of items |
---|---|---|---|

1 | Preparation and Planning stage | 1, 2, 3, 4 | 4 |

2 | Implementation stage | 5, 6, 7, 8 | 4 |

3 | Reporting stage | 9, 10 | 2 |

Total | 10 |

The statistical test used in this study was CFA with the help of SPSS version 21. CFA is a method used to determine the construct validity that has been done by previous researchers related to the field of psychology and education (

In the analyzing process of variable selection, the computational results showed that the Kaiser-Meyer-Olkin value Measure of Sampling Adequacy (KMO-MSA) was 0.645 and the significance was 0,000. Given the value of KMO-MSA above 0.500, it is included in the good category. From the Bartlett test for Test of Sphericity, the Chi-Square was 125.063 in the degree of freedom 45 with significance at 0.000 < 0.05. This means that the correlation matrix formed was not an identity matrix and finally, factor analysis can be done (

The next process was a factor formation analysis that aimed to simplify a set of initial variables. The results of factor formation analysis in the Total Variance Explained table show that the characteristic values (eigen value) of all factors were above 1 (> 1) (in Table 3). As recommended by

Kaiser-Meyer-Olkin Measure of Sampling Adequacy | .645 | |
---|---|---|

Bartlett's Test of Sphericity | Approx. Chi-Square | 125.063 |

Df | 45 | |

Sig. | .000 |

In addition, there was a factor load variance found which can explain the quality variance existence of the conducted project by the students. The first factor contributed 26,676% of the variance, the second factor explained 21,434% of the variance, the third factor explained 13,265% of the variance, while the fourth factor explained 10.401% of the total variance (Table 4).

Factors | Eigen values | Percentage of Variance | Comulative Percentages |
---|---|---|---|

I | 2.668 | 26.676 | 26.676 |

II | 2.143 | 21.434 | 48.110 |

III | 1.326 | 13.265 | 61.375 |

IV | 1.040 | 10.401 | 71.775 |

Scree plot diagram (Figure 2) shows the decreasing tendency of eigen value. This diagram can also be used to determine subjectively the number of factors used. It also appears that in the fifth factor, the eigen value is below 1. This fact indicates that the four factors as described earlier are enough to summarize the nine existing variables.

After analyzing factor formation, the next step was the interpretation of factor analysis results. Table 5 contains a Rotated Component Matrix that can be used to determine which factor is suitable for a variable. According to

Furthermore, item 3 has a factor loading value above 0.500 on two components, namely component 1 and component 2. That is, item 3 can be received in component 1 or component 2. However, the factor loading value has a greater effect on component 2 so item 3 included in component 2. In the same way, item 4 and 5 belong to component 2, item 6 and item 7 both are included in component 3. Item 8, 9 and 10 include component 4.

Items | Components | |||
---|---|---|---|---|

1 | 2 | 3 | 4 | |

1 | .787 | .218 | .182 | |

2 | .780 | -.116 | -.254 | |

3 | .511 | .630 | -.334 | .110 |

4 | .104 | .850 | .190 | |

5 | .708 | .572 | ||

6 | .124 | .109 | .811 | |

7 | .172 | .827 | .113 | |

8 | .508 | -.429 | .118 | .517 |

9 | .173 | .817 | ||

10 | .238 | -.104 | .438 | .526 |

After a factor was formed with the items under study, the last stage was to give a name to the four factors formed based on the characteristics of its members. In the end, factor 1 consisted of two items, namely item 1 and item 2. Item 1 is related to the planned stages of project implementation and item 2 is related to the division of tasks to group members. Looking at the characteristics of the two items found in factor 1, then the project planning stage is a suite name for it. Factor 2 consists of three items, namely item 3 (determining the tools and materials needed), item 4 (project processing time) and item 5 (quantity of data). Judging from the items in factor 2, the exact name for factor 2 is the project preparation stage. Factor 3 consists of 2 items, namely item 6 (data processing) and item 7 (data analysis), so factor 3 is relevant with the name of the project implementation stage. There are also 3 items contained in factor 4, namely item 8 (conclusion), item 9 (systematic report writing) and item 10 (presentation). Noting the characteristics of the three items, then factor 4 is named the final stage or reporting the project.

Therefore, the composition of items in each factor in Table 2 changes. Next, the final composition of each factor and items contained therein are presented in Table 6. Referring to the number of factors or criteria in Table 2 and Table 6, there seems to be a difference in the number of initial construction factors (3 factors) with the results of the empirical test (become 4 factors). Thus, it can be said that the project assessment instrument used is invalid in terms of the validity of the construct.

No. | Factors or Criteria | Item | Number of items |
---|---|---|---|

1 | Project planning stage | 1, 2 | 2 |

2 | Project planning stage | 3, 4, 5 | 3 |

3 | Project implementation stage | 6, 7 | 2 |

4 | Final stage or reporting the project | 8, 9, 10 | 3 |

Total | 10 |

The invalidity of the project assessment instrument is influenced by many things. One argument that can be put forward is that the process of instrument construction is not through theoretical review since the instrument was taken directly from the book published by the Ministry of Education and Culture for the implementation of Curriculum 2013 with improvements as needed, such as language. In fact, in determining factors or criteria in the development of affective domain assessment instruments (including skills), it is important to be careful in considering the necessary theories (

Related to this,

Those facts indicate that in the process of developing a project assessment instrument must be careful, especially when developing operational definitions which are further developed into factors or criteria to be assessed. A good operational definition, according to

After conducting the construct validity test, the next step was the inter-rater reliability test. Reliability test was carried out after the project assessment instrument was adjusted to the last condition, namely the aspects or items contained in the instrument have been arranged and grouped into 4 factors as Table 5.

For this purpose, the rater or the teacher involved in the research were given the same perception at the beginning of the activity related to how to use the project assessment instrument and its rubric. Including the meaning of each aspect in the rubric. In this way, it was expected that the same understanding among the raters occurred and when using it to assess the results of student work the scores are not far adrift.

Furthermore, the level of inter-rater reliability (ten teachers) can be explained from the results of the calculation of the inter-rater reliability coefficient using the Interclass Correlation Coefficient (ICC). A summary of the ICC calculation results by using SPSS version 21 is presented in Table 7.

Interclass Correlation | |
---|---|

Single Measures | 0.672 |

Average Measures | 0.953 |

Table 7 shows that 10 existing aspects in the assessment instrument, the mean value between rater is 0.953. While the reliability value for each rater is 0.672. Referring to the opinion of

The statistical results above are certainly still in general and need to be explored further on how the variance between rater in each aspect or item. The results of this study are very important in order to see aspects of the rubric which still make a significant difference in interpretation among the rater. Hence, it can be used as a basis for improving the instrument at the next stage.

Table 8 presents a case processing summary that can be used as a basis in examining rater behavior when using project assessment instruments with a rubric guide. From Table 8 it can be seen that the results of the assessment of the rater, 20 data are excluded. The excluded data means that from the assessment results of the raters, these data have a high difference score given by one rater compare to the other, from score 1 to score 4 on an item or certain aspect.

N | % | ||
---|---|---|---|

Cases | Valid | 50 | 71.4 |

Excluded^{a} |
20 | 28.6 | |

Total | 70 | 100 |

Descriptions on rubrics that have a range of high scores between rater, occur in item 2 (a division of tasks to group members), item 3 (determine the tools and materials needed), item 4 (project processing time) and item 9 (systematics of writing the report). The four aspects have a wide range of assessment scores between the raters because they have a different understanding of the description of the rubric used. The different understanding between raters, one of which is caused by unclear and too long descriptors given.

Referring to

Drawing from the findings and discussion, we note some important cases, i.e., (1) to develop a valid instrument of project assessment, we need to decide the constructs which become the factor or criteria to be assessed in an attentive way and (2) the descriptors in the rubric should be clear and short. It intends to overcome the raters’ difficulty and promote mutual understanding. A different understanding will possibly lead to the weakness of reliability of project assessment instrument. In this case, the current study significantly contributes to curriculum developers especially the authors of the mathematics textbooks.

The validity test of the project assessment instrument shows that the instrument used is not constructively valid. The invalidity is characterized by the difference in the number of factors. It changes from 3 factors in the initial construction to become 4 factors after the empirical test. It is conjectured that the development of the instrument did not equip with a relevant theoretical review. A representative theoretical review of the instrument will be very contributive to the validity. However, in terms of inter-rater reliability, the project assessment instruments used are reliable and included in the high category. Several weaknesses emerged during this research process such as the criteria which are the object of assessment and set forth in the project assessment sheet, are not made based on an in-depth theoretical review. Besides, no further validity testing of the new project assessment sheet has been carried out. Responding to the weaknesses, the following suggestions are raised: (a) when developing assessments criteria or aspects, it should be derived from in-depth theoretical studies. With a deep and strong theory, a valid operational definition will be produced on each criterion or aspect; (b) when the new project assessment sheet is obtained, the rotation results of several aspects should be tested for further validity. In this way, the validity of the new instrument will be known; and (c) the teacher or education practitioner should formulate a short and clear descriptor on the assessment rubric. With a descriptor that is too long often makes the teacher confused, and in the end, the teacher will give an incorrect assessment.