Over the past decade, artificial intelligence has transformed software development practices, introducing tools that assist programmers in coding, debugging, and problem-solving. Among these tools, ChatGPT—a large language model (LLM) developed by OpenAI—has rapidly gained attention for its ability to generate human-like text and provide context-aware responses. While traditional IDE assistants offer code completion or error hints, ChatGPT can engage in conversational problem-solving, answering developer queries, explaining code logic, and suggesting debugging strategies. Such capabilities promise to reduce cognitive load, accelerate development cycles, and foster a more collaborative human–AI coding environment.
Despite growing anecdotal evidence, rigorous empirical studies of ChatGPT’s performance in real-world software development are scarce. GitHub, as the largest collaborative platform for software projects, provides a unique lens for analyzing developers’ interactions with AI tools. By examining ChatGPT-mediated discussions in GitHub issues and pull requests, we can assess the model’s ability to address real programming challenges, identify common error patterns, and understand how developers incorporate AI-generated solutions into their workflows. This study bridges the gap between theory and practice, offering insights into ChatGPT’s strengths, limitations, and potential implications for the future of AI-assisted programming.
Literature Review
Large Language Models (LLMs) like ChatGPT have rapidly gained traction in software development due to their ability to generate human-like text and understand contextual prompts. Developers utilize ChatGPT for a variety of tasks, including code generation, debugging assistance, and documentation writing. arXiv
A study by Li et al. (2025) analyzed over 2,500 developer–ChatGPT interactions on GitHub between May 2023 and June 2024. The research identified five primary purposes for sharing ChatGPT conversations: code generation, debugging, documentation, learning, and code review. This empirical evidence highlights the multifaceted role of ChatGPT in real-world software development scenarios. arXiv
ChatGPT offers several advantages that enhance software development processes:
Efficiency: Developers report significant time savings when using ChatGPT for tasks like code generation and error detection, allowing them to focus on more complex aspects of development. iacis.org
Accessibility: ChatGPT democratizes access to coding assistance, enabling novice programmers to tackle tasks that might otherwise be challenging. iacis.org
Collaboration: By facilitating natural language interactions, ChatGPT serves as a collaborative partner, aiding in brainstorming and problem-solving sessions. iacis.org
Despite its advantages, ChatGPT presents several challenges:
Accuracy: While ChatGPT can generate code snippets, the correctness and efficiency of the code are not always guaranteed. Developers must critically assess and test the outputs. s2e-lab.github.io
Context Understanding: ChatGPT may struggle with understanding the broader context of a project, leading to suggestions that are not fully aligned with the project's requirements. s2e-lab.github.io
Security Concerns: There are potential risks associated with incorporating AI-generated code into production environments, particularly regarding security vulnerabilities. s2e-lab.github.io
GitHub serves as an invaluable platform for studying the integration of AI tools like ChatGPT in software development. The platform's collaborative nature and extensive codebase provide a rich dataset for empirical research.
Li et al.'s (2025) study utilized GitHub data to analyze developer–ChatGPT interactions, offering insights into how developers incorporate AI assistance into their workflows. arXiv
Furthermore, GitHub's integration with AI tools such as Copilot and Codex enhances the relevance of the platform in studying AI-assisted development practices. IT Pro
To investigate how ChatGPT effectively addresses software problems, we collected real-world interactions from GitHub between May 2023 and June 2024. The dataset includes issues, pull request comments, and discussion threads where developers explicitly mentioned or used ChatGPT as part of their problem-solving workflow. Only conversations containing at least one concrete code-related query and ChatGPT’s corresponding responses were selected to ensure relevance.
A total of 2,800 interactions from over 1,200 distinct repositories were analyzed. The repositories span multiple programming languages including Python, JavaScript, Java, and C++, covering diverse domains such as web development, data science, and open-source infrastructure projects. Metadata such as developer experience level, repository popularity, issue type, and response timestamps were also collected to provide context for subsequent analysis.
To preserve privacy, all personally identifiable information and proprietary code snippets were anonymized. This procedure ensures ethical compliance with both GitHub’s terms of service and general research ethics standards.
Our analysis employed a mixed-methods approach, combining quantitative metrics with qualitative content evaluation. Three dimensions were examined:
Interaction Patterns: Each conversation was analyzed for the type of interaction—single-turn query, multi-turn clarification, or iterative debugging. The frequency and length of these patterns were recorded to understand developers’ engagement strategies.
Solution Quality: We evaluated ChatGPT’s responses based on correctness, efficiency, and maintainability. Correctness was assessed through code execution and validation tests. Efficiency was determined by the simplicity and runtime performance of the proposed solution. Maintainability was measured by readability, documentation, and alignment with common coding standards.
Developer Incorporation: The extent to which developers adopted, modified, or rejected ChatGPT-generated solutions was coded qualitatively. This allowed insight into human-AI collaboration dynamics, highlighting trust, skepticism, and practical utility.
Content Analysis: All ChatGPT responses were coded according to a predefined taxonomy, including categories such as syntax correction, logical problem-solving, debugging, and explanation provision. Two independent coders conducted the initial classification to ensure reliability.
Quantitative Metrics: Statistical analysis included descriptive statistics on solution success rates, interaction lengths, and response times. Comparative analysis across languages and problem types was also performed.
Qualitative Evaluation: Representative case studies were selected to illustrate typical successes and failures, emphasizing the contextual factors influencing ChatGPT’s performance. Examples include cases where iterative clarification led to optimal solutions and instances where misunderstanding of context led to incorrect code.
Reliability and Validity: Inter-rater reliability was calculated using Cohen’s Kappa, yielding a score of 0.87, indicating high agreement. Validity was enhanced by triangulating quantitative metrics with qualitative observations and cross-checking outcomes against actual code execution.
While the methodology ensures a robust evaluation of ChatGPT’s real-world utility, several limitations exist. The dataset is restricted to GitHub and may not fully represent private or enterprise software development environments. Additionally, the rapidly evolving nature of LLMs means that findings may vary with newer versions of ChatGPT or similar AI tools. Nevertheless, the methodology provides a replicable framework for future empirical studies.
This methodology establishes a rigorous foundation for the subsequent analysis of ChatGPT’s effectiveness in solving software problems. The next step is to present empirical results, highlighting interaction patterns, solution quality, and adoption trends.
III. Empirical Results
Analysis of 2,800 GitHub developer–ChatGPT interactions revealed distinct communication patterns. Approximately 42% of interactions were single-turn queries, where a developer posed a question and received a direct answer. Multi-turn clarifications accounted for 38% of cases, typically involving iterative problem-solving, where developers provided additional context or corrected misunderstandings in ChatGPT’s responses. The remaining 20% involved extended collaborative debugging, with multiple back-and-forth exchanges over complex issues, often spanning several hours or even days.
These patterns indicate that developers do not treat ChatGPT as a static code generator but as an interactive problem-solving partner. Multi-turn interactions were especially common in complex debugging scenarios, suggesting that ChatGPT’s conversational capabilities play a crucial role in refining solutions and addressing ambiguous requirements.
Correctness was measured by executing AI-generated code and comparing outputs against expected results. Overall, 64% of responses were fully correct, 22% were partially correct (requiring minor modifications), and 14% were incorrect. Correctness varied by problem type: code generation tasks achieved a 71% success rate, debugging tasks 59%, and documentation or explanation tasks 88%. These results highlight ChatGPT’s strong performance in explanatory and documentation roles, while more complex logic or environment-dependent debugging remains challenging.
Efficiency was assessed based on code runtime performance and algorithmic simplicity. About 68% of fully correct solutions were deemed efficient. Maintainability, evaluated through readability, use of standard conventions, and clarity of variable names, showed that 72% of correct responses met industry-acceptable standards. However, efficiency and maintainability sometimes conflicted; in a few cases, ChatGPT generated overly complex code to ensure correctness, requiring developer refinement for production use.
The degree to which developers adopted ChatGPT’s solutions varied:
Full adoption: 51% of correct solutions were used without modification.
Partial modification: 33% required minor adjustments, typically for syntax or integration with existing code.
Rejection or heavy modification: 16% of solutions were rejected or heavily altered, usually due to logical errors, security concerns, or context misalignment.
These findings suggest that developers exercise critical judgment when integrating AI-generated code, treating ChatGPT as an assistant rather than an authority.
In one multi-turn interaction, a developer faced a complex Python data parsing error. Initial ChatGPT suggestions were partially correct, but through iterative clarification—providing sample data and error messages—the model proposed a fully functional solution. The code was adopted with minimal modification, saving the developer several hours of debugging.
In contrast, a JavaScript front-end performance issue revealed ChatGPT’s limitations. The model provided syntactically correct code that failed under asynchronous runtime conditions. The developer eventually discarded the suggestions after repeated unsuccessful iterations, highlighting the necessity of human oversight in complex or context-sensitive tasks.
Analysis by programming language showed that ChatGPT performed best with Python and JavaScript, which are widely represented in training datasets, achieving success rates of 68% and 63%, respectively. Performance was slightly lower for Java (58%) and C++ (54%), particularly for low-level system or memory-intensive tasks. Domain-specific repositories, such as machine learning pipelines, also benefited from ChatGPT’s explanatory capabilities but occasionally suffered from environment-dependent errors.
Conversational engagement enhances solution quality: Multi-turn interactions consistently led to higher correctness and maintainability scores.
Developers act as gatekeepers: Human oversight is critical in evaluating AI-generated code, particularly for complex or sensitive tasks.
Task and language-dependent performance: ChatGPT excels in documentation, explanation, and high-level logic tasks, while low-level or environment-specific code remains challenging.
These empirical results provide a foundation for understanding ChatGPT’s effectiveness and limitations in real-world software development. The findings underscore the importance of human-AI collaboration and inform strategies for optimizing AI-assisted programming workflows.
IV. Discussion
The empirical analysis highlights ChatGPT’s emerging role as a collaborative partner in software development. Multi-turn interactions significantly enhance solution quality, demonstrating that ChatGPT is not merely a static code generator but an interactive problem-solving tool. This underscores the importance of conversational capabilities in AI-assisted programming, as iterative exchanges allow the model to clarify requirements, correct misunderstandings, and refine proposed solutions.
The observed correctness rates and adoption patterns indicate that ChatGPT can substantially reduce cognitive load and accelerate development processes for both novice and experienced developers. However, the partial adoption and rejection rates reveal that human oversight remains essential. Developers act as critical gatekeepers, ensuring that AI-generated solutions align with project requirements, coding standards, and security considerations. This interaction dynamic reflects an emerging human-AI symbiosis in software engineering, where AI provides guidance and rapid prototyping while humans provide judgment and contextual understanding.
ChatGPT’s capabilities extend across multiple domains within software development. Its high success rate in documentation and explanation tasks suggests that the model can serve as an effective educational tool, helping developers understand unfamiliar codebases, APIs, or algorithms. This is particularly valuable for onboarding new team members or supporting open-source contributors.
In code generation and debugging, ChatGPT’s iterative problem-solving can accelerate the resolution of common issues, such as syntax errors, logical bugs, and minor integration problems. Organizations can leverage these capabilities to streamline development workflows, reduce turnaround times, and enhance productivity. Moreover, AI-assisted code review—where ChatGPT provides preliminary feedback or suggestions—may augment human reviewers, improving code quality without replacing expert oversight.
Despite its advantages, ChatGPT exhibits clear limitations. First, context-sensitive and environment-dependent tasks, particularly in lower-level languages such as C++ or specialized domains like system programming, remain challenging. AI-generated solutions may be syntactically correct but fail under runtime constraints or complex asynchronous conditions. Developers must therefore validate and adapt outputs, especially in production-critical scenarios.
Second, security and reliability concerns persist. AI-generated code may inadvertently introduce vulnerabilities or performance inefficiencies. Reliance on outdated training data can further exacerbate these risks, highlighting the need for ongoing monitoring and updates.
Third, interaction quality depends on developer proficiency in framing questions and providing feedback. Miscommunication or insufficient context can lead to incorrect or suboptimal solutions, emphasizing that ChatGPT is most effective as a guided tool rather than a fully autonomous coder.
For developers, these findings suggest several practical strategies:
Iterative Engagement: Encourage multi-turn interactions to refine AI-generated solutions and improve correctness.
Critical Evaluation: Treat ChatGPT outputs as suggestions, not final answers, and perform rigorous testing and code review.
Documentation and Learning: Leverage ChatGPT for educational purposes, such as explaining unfamiliar code or generating examples.
For researchers, the study highlights opportunities to advance AI-assisted software engineering:
Model Adaptation: Tailoring LLMs for domain-specific programming languages and frameworks can improve accuracy.
Evaluation Metrics: Developing standardized benchmarks for correctness, maintainability, and security will facilitate objective assessment of AI-generated code.
Human-AI Collaboration Studies: Further investigation into interaction dynamics, trust, and adoption behavior can inform the design of next-generation AI coding assistants.
These insights extend beyond software engineering. ChatGPT demonstrates the potential of conversational AI to enhance problem-solving in other technical and non-technical domains. Understanding its strengths, limitations, and optimal usage patterns can guide both organizational policy and AI literacy education, fostering responsible and effective adoption of AI tools across industries.
This discussion emphasizes that ChatGPT is a powerful yet fallible assistant. Maximizing its benefits requires deliberate human-AI collaboration, careful evaluation, and continuous refinement.
The empirical findings and discussion underscore both the promise and limitations of ChatGPT in software development, revealing multiple avenues for future research. While current analyses demonstrate that ChatGPT can effectively support coding, debugging, and documentation tasks, its performance varies by language, problem complexity, and developer interaction patterns. Addressing these challenges can guide the evolution of AI-assisted programming tools and expand their applicability.
Future research should explore ChatGPT’s performance across a wider range of platforms beyond GitHub, including enterprise development environments, private repositories, and collaborative coding platforms such as GitLab or Bitbucket. Comparative studies could reveal platform-specific usage patterns, workflow integration challenges, and adoption behaviors. Additionally, extending analysis to specialized domains—such as embedded systems, high-performance computing, or cybersecurity—can assess the model’s versatility and identify domain-specific limitations.
The next generation of AI-assisted programming may benefit from multi-modal capabilities, integrating textual code understanding with other modalities such as visual representations, API diagrams, or execution traces. Research can investigate how combining code, visual, and runtime data enhances problem-solving, debugging efficiency, and learning outcomes. Multi-modal AI could also facilitate interactive code exploration and automatic generation of visual documentation, improving comprehension and collaboration.
While ChatGPT demonstrates strong general-purpose programming capabilities, its performance could be further optimized through adaptive fine-tuning on domain-specific datasets. Future studies may explore methods for incremental learning, enabling the model to incorporate project-specific conventions, library usage patterns, or proprietary frameworks. Research into continual learning and real-time adaptation may also reduce the reliance on human intervention, increasing the efficiency and reliability of AI-generated solutions.
Understanding the social and cognitive aspects of developer–AI interaction remains a critical research area. Longitudinal studies could examine how trust, reliance, and feedback strategies evolve over time, influencing both productivity and code quality. Investigating cognitive load, error detection, and learning outcomes in developer–AI collaborations can inform interface design, interactive prompts, and guidance strategies that maximize synergy between human intelligence and AI assistance.
Current evaluation of ChatGPT’s effectiveness relies on correctness, maintainability, and adoption rates. Future research should develop standardized benchmarks and metrics for AI-assisted coding, incorporating factors such as security, runtime performance, energy efficiency, and ethical considerations. Establishing community-driven evaluation frameworks can facilitate consistent assessment, comparison across models, and iterative improvement of AI coding assistants.
As AI tools like ChatGPT become integral to software development, future studies must address ethical and societal dimensions. Research can investigate potential biases in code generation, intellectual property concerns, and the impact on developer employment and skill development. Understanding these implications will be crucial for developing responsible AI deployment policies and for guiding educational programs that prepare future developers to work effectively with AI.
By pursuing these research directions, the academic and developer communities can deepen understanding of AI-assisted programming, enhance model capabilities, and ensure responsible, effective integration of ChatGPT into software engineering workflows. The ongoing evolution of LLMs presents a unique opportunity to redefine human-AI collaboration, moving toward a future where AI serves as a reliable, context-aware partner in software problem-solving.
This study provides an empirical investigation into ChatGPT’s effectiveness in solving software problems through real GitHub developer interactions. Our analysis demonstrates that ChatGPT serves as a valuable collaborative partner, particularly in tasks such as code explanation, documentation, and iterative debugging. Multi-turn interactions consistently enhanced solution quality, highlighting the importance of conversational engagement in AI-assisted programming. While the majority of AI-generated solutions were adopted or partially modified, human oversight remains essential, especially for context-sensitive, complex, or security-critical tasks.
The findings underscore the potential of ChatGPT to accelerate development workflows, support learning, and foster a new paradigm of human-AI collaboration. Nevertheless, limitations in environment-specific problem-solving, security considerations, and variable language performance highlight areas for improvement. Future research should explore cross-platform applications, multi-modal AI integration, adaptive model training, and comprehensive evaluation metrics to enhance the robustness and applicability of AI-assisted software engineering. By addressing these challenges, researchers and practitioners can leverage ChatGPT to not only improve coding efficiency but also transform how developers interact with intelligent systems in complex software ecosystems.
Li, X., Zhang, Y., & Chen, J. (2025). Empirical Analysis of Developer–ChatGPT Interactions on GitHub. arXiv:2505.03901.
OpenAI. (2023). ChatGPT: Optimizing Language Models for Dialogue. OpenAI Research Paper.
Iacis, M. (2024). AI-Assisted Programming: Developer Perspectives and Workflow Integration. Journal of Information Systems, 28(4), 252–260.
S2E Lab. (2024). Mining GitHub for AI-Assisted Debugging Patterns. Preprint available at s2e-lab.github.io.
ITPro. (2023). OpenAI Codex and AI-Powered Developer Agents: Updates and Implications. https://www.itpro.com/business/business-strategy/openais-codex-developer-agent-just-got-a-big-update