Assessing the Reliability of Large Language Models for Deductive Qualitative Coding

2021-11-03 19:01:56
117

Abstract

Large Language Models (LLMs), such as ChatGPT, are rapidly transforming the landscape of qualitative data analysis. While the majority of recent studies have focused on their utility in inductive coding—where themes and categories emerge from data—little is known about their potential in deductive coding tasks that require adherence to predefined classification schemes. This study bridges that gap by rigorously evaluating ChatGPT’s performance in structured deductive qualitative coding using a set of well-established human-coded classifications. We implement four intervention strategies: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition method, to classify U.S. Supreme Court case summaries according to the Comparative Agendas Project (CAP) Master Codebook, which includes 21 major policy domains.

66386_3hff_7139.jpeg

Our findings demonstrate that carefully crafted prompting strategies significantly influence classification performance. In particular, the Step-by-Step Task Decomposition approach achieved the highest agreement with human coders, as measured by standard metrics like accuracy, F1-score, Cohen’s kappa, and Krippendorff’s alpha. Construct validity was supported by chi-squared and Cramer’s V tests, showing that the distribution of predictions meaningfully differed across intervention types. Overall, this study illustrates the conditions under which LLMs can be reliably deployed in deductive qualitative research workflows, challenging the assumption that human-coded deductive analysis must remain entirely manual.

OpenAI collects data from ChatGPT users to train and fine-tune the service further. Users can upvote or downvote responses they receive from ChatGPT and fill in a text field with additional feedback.[17]

ChatGPT's training data includes software manual pages, information about internet phenomena such as bulletin board systems, multiple programming languages, and the text of Wikipedia.

1. Introduction

In recent years, the advent of large language models (LLMs) has redefined natural language processing and automated reasoning. With applications ranging from content generation to data summarization, LLMs like OpenAI’s ChatGPT offer a promising avenue for automating time-consuming qualitative tasks. However, one area where LLMs remain underexplored is deductive qualitative coding—a method widely used in political science, law, communication studies, and public policy.

Unlike inductive coding, where themes are generated during analysis, deductive coding requires the model to apply pre-existing codebooks consistently. Human researchers traditionally perform this task, ensuring semantic consistency and interpretive depth. But can ChatGPT, when properly instructed, offer comparable reliability?

To address this question, we designed a systematic study using the Comparative Agendas Project (CAP) Master Codebook, a well-established classification system with 21 major policy domains. We assessed ChatGPT’s ability to categorize U.S. Supreme Court case summaries using this codebook and compared different prompting strategies to evaluate their effect on performance and agreement with human-coded gold standards.

2. Related Work

Most current literature on LLMs in qualitative analysis centers around inductive thematic generation, often using topic modeling, clustering, or latent Dirichlet allocation (LDA). Studies have demonstrated moderate success, particularly when LLMs are fine-tuned on domain-specific corpora. However, there is relatively little work on deductive classification using predefined taxonomies—a far more demanding task that mimics real-world analytical coding schemes.

Earlier work by Bubeck et al. (2023) and Gilardi et al. (2023) explored LLMs for content classification and sentiment labeling, but rarely under conditions requiring consistent application of formal codebooks. Others, like Arnold et al. (2022), suggested that LLMs could support mixed-method research if used in combination with human oversight. Still, the lack of evaluation using agreement metrics (e.g., Cohen’s kappa) leaves a gap in understanding their actual reliability.

3. Research Questions and Hypotheses

We investigate the following research questions:

  • RQ1: Can chat openai reliably apply a predefined deductive coding scheme (CAP Master Codebook) to classify Supreme Court case summaries?

  • RQ2: How do different prompting strategies affect classification performance?

  • RQ3: Do LLM-predicted labels exhibit construct validity with respect to human-coded distributions?

From these questions, we derive key hypotheses:

  • H1: ChatGPT’s accuracy and inter-rater reliability will vary significantly across prompting strategies.

  • H2: The Step-by-Step Task Decomposition strategy will yield the highest inter-coder agreement.

  • H3: Classification distributions will be significantly altered by prompt design, as shown by chi-squared and Cramer’s V analyses.

4. Methodology

4.1 Dataset

We used publicly available U.S. Supreme Court case summaries annotated under the CAP Master Codebook, comprising 21 policy domains (e.g., health, environment, education). Each summary ranges between 100–300 words. A total of 500 case summaries were sampled for testing.

4.2 Prompting Strategies

We evaluated four intervention methods:

  1. Zero-shot: ChatGPT is prompted with only the instruction to classify the text.

  2. Few-shot: Three examples of annotated summaries are included in the prompt.

  3. Definition-based: A full list of CAP domain definitions is provided alongside the classification task.

  4. Step-by-Step Task Decomposition: The task is broken down into smaller steps (e.g., summarization → keyword identification → domain selection), encouraging intermediate reasoning.

4.3 Evaluation Metrics

To assess reliability and validity, we used:

  • Accuracy

  • Macro-F1 Score

  • Cohen’s kappa (κ)

  • Krippendorff’s alpha (α)

  • Construct validity tests: Chi-squared and Cramer’s V to test distributional similarity with human-coded labels.

5. Results

5.1 Overall Performance

StrategyAccuracyF1 ScoreCohen’s κKrippendorff’s α
Zero-shot0.6410.6050.5740.561
Few-shot0.6780.6450.6110.622
Definition-based0.7130.6880.6910.684
Step-by-Step Decomp.0.7750.7410.7440.746

The Step-by-Step Task Decomposition strategy achieved the best overall performance, crossing the threshold of substantial agreement (κ > 0.70), while others hovered between moderate and fair.

5.2 Construct Validity

Chi-squared tests revealed significant differences in classification distributions across strategies (p < 0.001). Cramer’s V values ranged from 0.359 to 0.613, indicating moderate to strong effect sizes. These shifts affirm that prompt design can structurally alter classification behavior, reinforcing the need for tailored interventions in deductive coding.

6. Discussion

6.1 The Importance of Prompt Engineering

Our findings clearly support H1 and H2: prompting strategy has a major impact on classification reliability. The Step-by-Step Task Decomposition approach outperformed all others, likely because it aligns with how human coders deconstruct complex interpretive tasks. This aligns with cognitive science insights that breaking tasks into smaller subtasks improves reasoning fidelity.

Few-shot learning improved performance compared to zero-shot, but lacked the semantic grounding offered by domain definitions or task decomposition. Definition-based prompts provided useful context, but were too dense for the model to fully utilize without additional structure.

6.2 Implications for Deductive Coding Workflows

Achieving κ > 0.74 and α > 0.74 indicates that LLMs are now viable contributors to structured qualitative workflows, particularly in deductive research settings where codebooks are stable and well-defined. Analysts could deploy LLMs for pre-coding, code suggestion, or dual-coding with humans.

However, interpretive nuance remains a challenge. LLMs struggle with highly ambiguous or borderline cases, often defaulting to overgeneralized categories (e.g., “Government Operations”). Human oversight remains essential for final adjudication.

7. Limitations

Several limitations warrant caution:

  • The dataset is limited to English-language legal summaries.

  • Model behavior may vary across updates or different LLMs (e.g., Claude, Gemini).

  • Our gold standard relies on human-coded data, which itself contains subjective variance.

  • The Step-by-Step strategy increases token usage and cost, which may not be feasible at large scales without optimization.

8. Conclusion

This study provides the first large-scale, metric-based assessment of LLMs for deductive qualitative coding using a formal classification scheme. We show that with the right interventions—especially task decompositionChatGPT can achieve substantial agreement with human coders, supporting its integration into real-world research pipelines.

Rather than replacing human analysts, LLMs should be viewed as powerful collaborative agents that can scale and streamline qualitative workflows when deployed thoughtfully.

9. Future Work

Future research should explore:

  • Cross-domain generalization (e.g., medical vs. legal vs. educational texts).

  • Real-time interactive coding tools combining LLMs with human feedback.

  • Comparisons across LLM providers (Anthropic, Mistral, Google).

  • Cost-benefit modeling to assess trade-offs between accuracy and token cost.

Acknowledgments

We thank the Comparative Agendas Project for the publicly available codebook and case data, as well as OpenAI for providing API access to ChatGPT. This work was supported in part by the Department of Political Science and the Center for Digital Scholarship.