Performance and advantages of OpenAI ChatGPT on the VNHSGE English dataset

2025-09-11 13:42:40

1. Introduction

With the rapid development of artificial intelligence (AI) technology, large-scale language models (LLMs), such as OpenAI's ChatGPT, Microsoft's Bing Chat, and Google's Bard, have achieved remarkable results in various fields. In particular, the potential of LLMs in education has been widely explored, particularly their impact on language learning, grammatical analysis, and personalized learning. In Vietnam, high school English education continues to face significant challenges in improving teaching quality and student English proficiency. The Vietnam National High School Graduation Examination (VNHSGE), a key metric for measuring students' English proficiency, has become a crucial benchmark for evaluating the effectiveness of various educational technology tools.

This paper explores the performance and advantages of OpenAI's ChatGPT on the VNHSGE English dataset, focusing on performance comparisons with Microsoft Bing Chat and Google Bard. By comparing the performance of these three large-scale language models, this paper attempts to reveal the potential of ChatGPT in Vietnamese education and its advantages in practical teaching. Analysis of experimental results will provide a better understanding of the application prospects of LLMs in Vietnamese English teaching, particularly in integrating exam-oriented education with daily learning.

2. Literature Review

2.1 Application of Large Language Models in Education

Large Language Models (LLMs), such as the GPT series, BERT, and T5, have made revolutionary progress in natural language processing (NLP) in recent years. Trained on massive amounts of text data, they generate high-quality text output and perform well in a variety of NLP tasks, including machine translation, text summarization, sentiment analysis, and question-answering systems. In education, the application of LLMs is gradually expanding, particularly in language learning and intelligent tutoring systems. A growing body of research demonstrates that LLMs can provide personalized learning experiences, helping students receive immediate feedback during their learning process, thereby improving the efficiency and effectiveness of language learning.

In Vietnam, English education faces challenges such as teacher shortages and uneven distribution of educational resources. Providing students with supplementary learning tools using LLMs can alleviate these problems to a certain extent. In particular, for standardized tests like the VNHSGE, LLMs can provide students with real-time practice and mock exams, improving their test-taking skills.

2.2 VNHSGE Dataset Overview

The VNHSGE is the English portion of Vietnam's National High School Graduation Examination (NHGE) dataset, covering a wide range of language tasks, including listening comprehension, reading comprehension, grammatical selection, writing, and translation. This dataset is not only an important tool for measuring the English proficiency of Vietnamese high school students, but also provides a valuable data resource for research on how to use technology to improve English education. Researchers using the VNHSGE dataset to conduct model testing help evaluate the effectiveness of different language processing tools in real-world education, particularly on standardized test performance.

2.3 Comparison of OpenAI ChatGPT, BingChat, and Bard

OpenAI's ChatGPT (based on GPT-3.5), Microsoft's Bing Chat, and Google's Bard are three of the most representative language modeling tools currently on the market. While all three tools are based on similar deep learning architectures, they differ in their specific implementation details and optimization approaches.

According to Xuan-Quy Dao (2023), "Comparison of Large-Scale Language Models on the VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard," Bing Chat achieved the highest performance on the VNHSGE English dataset, achieving an accuracy of 92.4%, surpassing ChatGPT (79.2%) and Bard (86%). The study demonstrated that Bing Chat excelled in response speed and grammatical comprehension, enabling it to provide more accurate feedback on Gaokao English practice tests.

However, despite Bing Chat's slight performance advantage, ChatGPT continues to attract widespread attention for its flexible text generation capabilities and robust interactivity. ChatGPT demonstrates its unique advantages, particularly when addressing grammatical and language generation challenges. Furthermore, while Bard's performance on the VNHSGE dataset fell short of Bing Chat and ChatGPT, its optimized search engine capabilities give it a competitive edge in specific application scenarios. 3. Methodology

3.1 Data Source and Selection

This study uses the VNHSGE dataset, which contains English test questions taken by Vietnamese high school students in the national graduation exam. The VNHSGE dataset consists of four main sections: listening comprehension, reading comprehension, grammar and vocabulary selection, and writing tasks. Each section provides a balanced reflection of students' overall English proficiency. In this study, we will focus on evaluating the language model's performance in reading comprehension, grammar selection, and writing tasks. By analyzing the test results from these sections, we can comprehensively assess the feasibility of three large-scale language models for English education in Vietnam.

3.2 Evaluation Criteria

To objectively evaluate the performance of ChatGPT, Bing Chat, and Bard, this study used the following criteria:

Accuracy: Calculates the correctness of each model's answers to questions, i.e., the ratio of correct answers generated by the model to the total number of answers.

Generated Text Quality: Evaluates the model's fluency, grammatical correctness, and semantic consistency in generating writing and translation tasks.

Response Time: Measured the model's response speed to each question or task to assess its real-time feedback capabilities.

Task Adaptability: Analyzes the model's adaptability and performance stability across different problem types (e.g., grammar, reading comprehension, and translation).

3.3 Experimental Setup

The experimental setup includes three main steps: data preprocessing, model selection and training, and performance evaluation. During the data preprocessing phase, we categorized all questions in the VNHSGE dataset by task type and performed denoising to ensure data accuracy. Subsequently, we selected three models: OpenAI's ChatGPT, Microsoft's Bing Chat, and Google's Bard for comparative testing. Each model completed five simulated exams on each VNHSGE task, ultimately generating comprehensive performance data. All experiments were conducted using the same hardware environment to ensure comparability of the results.

4. Experimental Results and Discussion

4.1 Performance Comparison

According to the experimental results, Bing Chat achieved the best performance on the VNHSGE dataset, achieving an accuracy of 92.4%. This performance is significantly higher than ChatGPT (79.2%) and Bard (86%). A detailed analysis revealed that Bing Chat demonstrated significant advantages in grammar selection and reading comprehension tasks, accurately understanding and answering complex grammatical structure questions. Furthermore, Bing Chat performed well in terms of real-time response time, providing rapid feedback and adapting to students' needs for rapid practice.

In contrast, while ChatGPT performed weaker in some tasks, its unique generative capabilities and flexibility still gave it an advantage in some writing and translation tasks. In these tasks, ChatGPT generated relatively natural and coherent text, despite occasional minor grammatical errors or inappropriate vocabulary choices. ChatGPT demonstrated particularly strong adaptability in writing tasks requiring creativity and contextual coherence.

Bard's performance was relatively mediocre. While it performed reasonably well in grammar comprehension and reading comprehension tasks, it was relatively weak in terms of the fluency and linguistic diversity of generated text. This result suggests that, despite its strong capabilities in the search engine domain, Bard still has certain limitations in language generation.

4.2 Model Educational Potential

From the perspective of educational applications, the three models each have their strengths and weaknesses. With its high accuracy and fast response time, Bing Chat is likely to become the preferred tool for English education in Vietnamese high schools, especially in simulations and practice for the Gaokao English test. However, ChatGPT, due to its strong language generation capabilities, is suitable as a supplementary tool for writing and translation tasks, particularly those requiring creative thinking and expression.

Bard, on the other hand, is suitable for scenarios requiring immediate query or information retrieval, but its language generation and depth of task understanding may require further optimization.

4.3 Model Limitations

Although the three models differ in performance, they still have certain limitations. First, when dealing with Vietnamese-specific language expressions and cultural context, all three models fail to fully adapt to local requirements. For example, Bing Chat and Bard fail to provide accurate answers for certain culturally specific idioms or vocabulary. While ChatGPT performs well in language generation, it occasionally produces inaccurate responses when dealing with localized language differences and complex sentence structures.

Furthermore, all three models performed imperfectly on writing tasks. While they were able to generate fluent text, there is still room for improvement in grammatical accuracy and word usage. In particular, when dealing with student essays containing numerous grammatical errors, the models failed to provide sufficiently precise feedback, potentially impacting learning outcomes.

5. Conclusions and Future Research Directions

5.1 Summary of Research Findings

By comparing the performance of OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard on the VNHSGE English dataset, this study found that Bing Chat achieved the highest overall accuracy, but ChatGPT still possessed distinct advantages in idea generation tasks and contextual understanding. Bard performed worse than Bing Chat and ChatGPT on certain tasks, but it still has potential for application in specific scenarios.

5.2 Impact on Education

This study provides valuable insights for the Vietnamese education system, particularly regarding standardized testing in English education and the application of technology in everyday learning. Large language models can serve as a supplementary tool to help students improve their English proficiency, particularly in exam-oriented education settings, by facilitating mock exams and rapid practice.

5.3 Future Research Directions

Future research can focus on localized model optimization, particularly on improving model accuracy and adaptability when handling cultural differences and grammatical complexity. Furthermore, as models continue to evolve, applying these models to a wider range of educational scenarios, such as classroom interaction and teacher assistance, will also be an important research direction.

ChatGPT AL openAL

The impact of ChatGPT text generation tools on the development of digital education

Text Summarization with Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI ChatGPT Models