The application of GPT-4 in grading design university students’ assignment and providing feedback: An exploratory study
by Qian Huang, Thijs Willems, King Wang Poon https://arxiv.org/abs/2409.17698
Study Aims:
- Investigate if GPT-4 can effectively grade design university assignments and provide useful feedback
- Design projects have subjective nature, leading to grading inconsistencies between raters
- Employ iterative research approach to develop a Custom GPT for reliable results
Findings:
- Inter-reliability: Between GPT and human raters reached acceptable level by providing accurate prompts and iterating a Custom GPT
- Intra-reliability: Consistency of GPT's scoring is between 0.65 and 0.78, indicating reliable results with adequate instructions
- Comparability and consistency: A precondition for educational assessment; Custom GPT adheres to these rules
- Useful feedback: Tested whether Custom GPT can provide students with constructive feedback
- Educator's role: Reflecting on how educators can develop and iterate a Custom GPT as a complementary rater.
ChatGPT's Impact on Education
Revolutionizing Industries:
- ChatGPT has revolutionized various industries, including education (Montenegro-Rueda et al., 2023)
- Potential for ChatGPT to transform educational settings has sparked significant debate and discussion (Arif, Munaf & Ul-Haque, 2023; Hwang & Chen, 2023)
Embracing Technology in Education:
- A growing number of educators recognize the necessity of embracing technology to enhance learning outcomes (Sim, 2023)
- Universities like Arizona State University have integrated ChatGPT into their teaching and research frameworks
Guidelines for Ethical and Effective Use:
- Organizations and educational institutions are beginning to establish guidelines for the ethical and effective use of generative AI in educational contexts (Halaweh, 2023)
Advancements with GPT-4:
- Significant leap forward in AI accuracy and utility in academic contexts (Chen, 2023; Nori et al., 2023)
- Custom GPT models like "Scholar GPT2" have been developed to enhance research productivity and tailor solutions to specific professional needs
Limitations of GPT-4:
- Struggles with reasoning tasks and consistently solving complex problems (Arkoudas, 2023)
- Prone to hallucination and fabrication in responses (Currie, 2023)
- Concerns regarding potential erosion of students' critical thinking and problem-solving skills due to over-reliance on AI (Luckin, 2017)
Ethical Considerations:
- Potential for AI systems, including ChatGPT, to perpetuate biases present in their training data (Weidinger et al., 2022)
- Continuous efforts to audit and refine AI models are essential to ensure fairness and inclusivity (Halaweh, 2023)
Study Objective:
- This study seeks to empirically study the extent to which a Custom GPT-4 model can assist teachers in grading subjective and open-ended assignments, such as design projects
ChatGPT in Education
Applications of ChatGPT:
- Provides personalized learning experiences
- Facilitates administrative tasks
- Reshapes traditional educational landscapes
Educators can leverage ChatGPT:
- Assist in creating and delivering content
- Generate lecture materials, interactive discussions, and tailor assessments to student needs
- Support innovative testing approaches like adaptive testing
Transformative capabilities of ChatGPT in Education:
- Academic writing: Error detection, writing improvement, content generation
- Personalized feedback: Generating personalized feedback for programming assignments
- Preparing teaching material: Designing teaching contents, materials, quizzes
- Analyzing qualitative data: Enhancing research capabilities in various fields
Reliability of Educational Assessment
Inter-Rater Reliability:
- Degree of consistency among multiple raters in their evaluations
- Measured using intraclass correlation coefficient (ICC)
- ICC above 0.85 recommended for high-stakes assessments
- ICC between 0.60 and 0.74 acceptable for normal assessments
- ICC above 0.70 good, indicating strong consistency
Intra-Rater Reliability:
- Degree of consistency of a single rater over time
- Measured using intraclass correlation coefficient (ICC)
- ICC values and interpretation similar to inter-rater reliability
Importance of Ensuring Assessment Reliability:
- Critical for valid and equitable evaluation of student performance
- Intra-rater and inter-rater reliability essential for maintaining consistency and fairness in assessments.
Study Background
- Conducted in a design thinking course at a university in Singapore for first-year students
- Observed that engineering instructors and architecture instructors gave different advice, leading to student confusion on how to balance feedback
- Researchers used ChatGPT-4 to grade design assignments of 10 students (20 posters)
- Employed Design-Based Research (DBR) approach: multiple iterations, prototypes, and testing for continuous refinement
Study Process
-
Part I: Comparison without role-play
- 1st iteration: General GPT grades assignments
- 2nd iteration: ChatGPT-4 with instructor rubrics
- 3rd iteration: Custom GPT using good example as benchmark
-
Part II: Comparison with role-play instructions
- 1st iteration: General GPT for grading (no role-play)
- 2nd iteration: ChatGPT-4 for grading with rubrics (role-play)
- 3rd iteration: Custom GPT, added rubrics and a good example (role-play)
-
Part III: Intra-reliability of GPT assessment
- Checked if custom GPT evaluates assignments consistently over time
Research Questions
- Can ChatGPT accurately grade students' design course assignments?
- Under what conditions can a Custom ChatGPT act as a reliable grader to serve as a complementary rater?
Assumptions
- Assumption 1: Inter-reliability increases progressively in Part I without role-play
- Assumption 2: Inter-reliability increases progressively in Part II with role-play
- Assumption 3: Custom ChatGPT has an intra-reliability greater than 0.5 when scoring the same assignment at different times.
1st Iteration:
- General GPT process: Graded student assignments using three dimensions - Design goal, Site drawing, MacroAEIOU
- Compared to instructor reliability: Low inter-reliability between Architect Instructor and Engineer Instructor (0.167)
- Comparison with GPT: Inter-reliability was slightly higher with Engineer instructor (average scores of two instructors at 0.473), but still below 0.5
2nd Iteration:
- Custom GPT model built using rubrics provided by instructors
- Researchers used screenshots and custom-GPT for scoring assignments with feedback based on rubrics
- Improved inter-reliability between GPT and both instructors (Architect Instructor: 0.5803, Engineer Instructor: 0.5803)
- Inter-reliability with average scores of two instructors also improved (0.5672)
3rd Iteration:
- Revised instructions to Custom-GPT by giving it a good example for reference
- Improved inter-reliability between GPT and both instructors further: Architect Instructor (0.7652), Engineer Instructor (0.7652)
- Inter-reliability with average scores of two instructors reached 0.7285, meeting high-stakes exam requirements.
Assumptions:
- In iterations without role-playing (Part I), inter-reliability increases progressively.
First Iteration: General GPT with Role-play Process
- Let General GPT grade students' assignments from architect instructor and engineer instructor perspectives
- Inter-reliability between Architect GPT and Architect Instructor: 0.1874, poor reliability
- Inter-reliability between Engineer GPT and Engineer Instructor: 0.5423, moderate reliability
- Inter-reliability between two GPts: 0.5407, moderate reliability
- Comparison of ICC Reliability for Iteration 1 (Part II with role-playing) in Table 4
Second Iteration: Custom GPT (with rubrics added, with role-playing)
- Added rubrics used by both instructors and conducted role-play to score from their perspectives
- Inter-reliability between Architect GPT and Architect Instructor: 0.5150, moderate reliability +0.2068
- Inter-reliability between Engineer GPT and Engineer Instructor: 0.7652, high reliability -0.3024
- Inter-reliability between two GPts: 0.5407 - 0.0132, moderate reliability
- Comparison of ICC Reliability for Iteration 2 (Part II with role-playing) in Table 5
Third Iteration: Custom GPT (with a good example added, with role-play)
- Included a good example and conducted scoring from both instructors' perspectives
- Inter-reliability between Architect GPT and Architect Instructor: 0.7199, high reliability +0.3257
- Inter-reliability between Engineer GPT and Engineer Instructor: 0.5554, moderate reliability +0.3155
- Inter-reliability between two GPts: 0.5536, moderate reliability +0.0261
- Comparison of ICC Reliability for Iteration 3 (Part II with role-playing) in Table 6
Discussion:
- The results show that inter-reliability increases progressively as the GPts adapt to the nuanced expectations of human instructors.
- In each iteration with role-playing, the inter-reliability between Architecture GPT and Engineering GPT is above 0.5 while inter-reliability between human Architecture and Engineering instructors is only 0.167.
- Possible reasons for higher ICC results in general GPts compared to custom GPts:
- Roleplaying adds complexity as the model simulates different instructional perspectives.
- Custom GPt struggles to adapt its grading process when switching between these varied perspectives without a good example.
GPT Scoring Reliability: Intra-Reliability Study Results
Assumption 3 Verification:
- Researchers checked GPT's scoring consistency across different times using General GPT
- Three comparisons were conducted on different days
Project Scores Over Time (ICC Values):
Project | 1st time | 2nd time | 3rd time | ICC with 1st | ICC between 2nd and 3rd | ICC between 1st and 3rd |
---|---|---|---|---|---|---|
Accessed on: May 21, 2024 | 89 | 92 | 89 | 0.6485 (ICC) | N/A | N/A |
Accessed on: May 27, 2024 | 94 | 93 | N/A | 0.7336 (ICC between 1st and 2nd) | N/A | N/A |
Accessed on: June 3, 2024 | 78 | 79 | N/A | 0.7750 (ICC between 2nd and 3rd) | N/A | 0.6983 (ICC between 1st and 3rd) |
Assumption 3 Conclusion:
- High intra-reliability of GPT is confirmed with an ICC greater than 0.5 for repeated scoring at different times.
GPT's Role in Providing Timely Feedback for Design Students:
- Contributes significantly to providing feedback on a large scale
- Instructors lack capacity for individualized, continuous feedback
- Grades assignments with personalized feedback from different perspectives (engineer and architect) (Screenshot 3)
- Synthesizes opinions of multiple instructors for balanced feedback (Screenshot 4)
- Compares students' work to renowned designers, providing suggestions for improvement (Screenshot 8)
- Provides round-the-clock personalized feedback and stimulates ideas
- Serves as a co-worker during instructors' grading process.
GPT's Impact on Design Students:
- Receives timely, continuous feedback from GPT
- Balances different perspectives in their work through feedback (Screenshot 3)
- Aspires to reach or exceed industry benchmarks by learning from renowned designers (Screenshot 8)
- Receives constructive criticism and suggestions for improvement.
Study Findings on Using ChatGPT for Educational Grading:
Iterative Process:
- Study demonstrates two lines of iteration on how GPT can effectively grade students' design assignments
- High inter-reliability and intra-reliability achieved with appropriate instruction
Guidelines for Customizing GPT:
- Provide accurate and consistent prompts
- Establish rubrics consistent with instructor's use
- Give a clear example to clarify good project/assignment
Implications of Findings:
- ChatGPT can serve as a co-worker for instructors in educational settings
- Importance of structured rater training, clear rubrics, and regular feedback mechanisms
- Innovations in tools and methods for training and calibrating raters essential to maintain fairness
Use of GPT Models:
- Provide consistent, unbiased scoring guidelines
- Assist human raters in understanding and correcting biases
- Concerns regarding transparency, potential biases, and reduced critical engagement by human raters
Custom-GPT's Impact on Learning:
- Promotes personalized learning for students (Baidoo-Anu & Owusu Ansah, 2023)
- Aligns with insights from recent studies on language learning (Dai et al., 2023)
Productivity Enhancement:
- GPT can increase productivity for educators by providing detailed feedback and interacting in formative assessments.