Large Language Models (LLMs), such as ChatGPT, are rapidly transforming the landscape of qualitative data analysis. While the majority of recent studies have focused on their utility in inductive coding—where themes and categories emerge from data—little is known about their potential in deductive coding tasks that require adherence to predefined classification schemes. This study bridges that gap by rigorously evaluating ChatGPT’s performance in structured deductive qualitative coding using a set of well-established human-coded classifications. We implement four intervention strategies: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition method, to classify U.S. Supreme Court case summaries according to the Comparative Agendas Project (CAP) Master Codebook, which includes 21 major policy domains.
Our findings demonstrate that carefully crafted prompting strategies significantly influence classification performance. In particular, the Step-by-Step Task Decomposition approach achieved the highest agreement with human coders, as measured by standard metrics like accuracy, F1-score, Cohen’s kappa, and Krippendorff’s alpha. Construct validity was supported by chi-squared and Cramer’s V tests, showing that the distribution of predictions meaningfully differed across intervention types. Overall, this study illustrates the conditions under which LLMs can be reliably deployed in deductive qualitative research workflows, challenging the assumption that human-coded deductive analysis must remain entirely manual.
OpenAI collects data from ChatGPT users to train and fine-tune the service further. Users can upvote or downvote responses they receive from ChatGPT and fill in a text field with additional feedback.[17]
ChatGPT's training data includes software manual pages, information about internet phenomena such as bulletin board systems, multiple programming languages, and the text of Wikipedia.
In recent years, the advent of large language models (LLMs) has redefined natural language processing and automated reasoning. With applications ranging from content generation to data summarization, LLMs like OpenAI’s ChatGPT offer a promising avenue for automating time-consuming qualitative tasks. However, one area where LLMs remain underexplored is deductive qualitative coding—a method widely used in political science, law, communication studies, and public policy.
Unlike inductive coding, where themes are generated during analysis, deductive coding requires the model to apply pre-existing codebooks consistently. Human researchers traditionally perform this task, ensuring semantic consistency and interpretive depth. But can ChatGPT, when properly instructed, offer comparable reliability?
To address this question, we designed a systematic study using the Comparative Agendas Project (CAP) Master Codebook, a well-established classification system with 21 major policy domains. We assessed ChatGPT’s ability to categorize U.S. Supreme Court case summaries using this codebook and compared different prompting strategies to evaluate their effect on performance and agreement with human-coded gold standards.
Most current literature on LLMs in qualitative analysis centers around inductive thematic generation, often using topic modeling, clustering, or latent Dirichlet allocation (LDA). Studies have demonstrated moderate success, particularly when LLMs are fine-tuned on domain-specific corpora. However, there is relatively little work on deductive classification using predefined taxonomies—a far more demanding task that mimics real-world analytical coding schemes.
Earlier work by Bubeck et al. (2023) and Gilardi et al. (2023) explored LLMs for content classification and sentiment labeling, but rarely under conditions requiring consistent application of formal codebooks. Others, like Arnold et al. (2022), suggested that LLMs could support mixed-method research if used in combination with human oversight. Still, the lack of evaluation using agreement metrics (e.g., Cohen’s kappa) leaves a gap in understanding their actual reliability.
We investigate the following research questions:
RQ1: Can chat openai reliably apply a predefined deductive coding scheme (CAP Master Codebook) to classify Supreme Court case summaries?
RQ2: How do different prompting strategies affect classification performance?
RQ3: Do LLM-predicted labels exhibit construct validity with respect to human-coded distributions?
From these questions, we derive key hypotheses:
H1: ChatGPT’s accuracy and inter-rater reliability will vary significantly across prompting strategies.
H2: The Step-by-Step Task Decomposition strategy will yield the highest inter-coder agreement.
H3: Classification distributions will be significantly altered by prompt design, as shown by chi-squared and Cramer’s V analyses.
We used publicly available U.S. Supreme Court case summaries annotated under the CAP Master Codebook, comprising 21 policy domains (e.g., health, environment, education). Each summary ranges between 100–300 words. A total of 500 case summaries were sampled for testing.
We evaluated four intervention methods:
Zero-shot: ChatGPT is prompted with only the instruction to classify the text.
Few-shot: Three examples of annotated summaries are included in the prompt.
Definition-based: A full list of CAP domain definitions is provided alongside the classification task.
Step-by-Step Task Decomposition: The task is broken down into smaller steps (e.g., summarization → keyword identification → domain selection), encouraging intermediate reasoning.
To assess reliability and validity, we used:
Accuracy
Macro-F1 Score
Cohen’s kappa (κ)
Krippendorff’s alpha (α)
Construct validity tests: Chi-squared and Cramer’s V to test distributional similarity with human-coded labels.
Strategy | Accuracy | F1 Score | Cohen’s κ | Krippendorff’s α |
---|---|---|---|---|
Zero-shot | 0.641 | 0.605 | 0.574 | 0.561 |
Few-shot | 0.678 | 0.645 | 0.611 | 0.622 |
Definition-based | 0.713 | 0.688 | 0.691 | 0.684 |
Step-by-Step Decomp. | 0.775 | 0.741 | 0.744 | 0.746 |
The Step-by-Step Task Decomposition strategy achieved the best overall performance, crossing the threshold of substantial agreement (κ > 0.70), while others hovered between moderate and fair.
Chi-squared tests revealed significant differences in classification distributions across strategies (p < 0.001). Cramer’s V values ranged from 0.359 to 0.613, indicating moderate to strong effect sizes. These shifts affirm that prompt design can structurally alter classification behavior, reinforcing the need for tailored interventions in deductive coding.
Our findings clearly support H1 and H2: prompting strategy has a major impact on classification reliability. The Step-by-Step Task Decomposition approach outperformed all others, likely because it aligns with how human coders deconstruct complex interpretive tasks. This aligns with cognitive science insights that breaking tasks into smaller subtasks improves reasoning fidelity.
Few-shot learning improved performance compared to zero-shot, but lacked the semantic grounding offered by domain definitions or task decomposition. Definition-based prompts provided useful context, but were too dense for the model to fully utilize without additional structure.
Achieving κ > 0.74 and α > 0.74 indicates that LLMs are now viable contributors to structured qualitative workflows, particularly in deductive research settings where codebooks are stable and well-defined. Analysts could deploy LLMs for pre-coding, code suggestion, or dual-coding with humans.
However, interpretive nuance remains a challenge. LLMs struggle with highly ambiguous or borderline cases, often defaulting to overgeneralized categories (e.g., “Government Operations”). Human oversight remains essential for final adjudication.
Several limitations warrant caution:
The dataset is limited to English-language legal summaries.
Model behavior may vary across updates or different LLMs (e.g., Claude, Gemini).
Our gold standard relies on human-coded data, which itself contains subjective variance.
The Step-by-Step strategy increases token usage and cost, which may not be feasible at large scales without optimization.
This study provides the first large-scale, metric-based assessment of LLMs for deductive qualitative coding using a formal classification scheme. We show that with the right interventions—especially task decomposition—ChatGPT can achieve substantial agreement with human coders, supporting its integration into real-world research pipelines.
Rather than replacing human analysts, LLMs should be viewed as powerful collaborative agents that can scale and streamline qualitative workflows when deployed thoughtfully.
Future research should explore:
Cross-domain generalization (e.g., medical vs. legal vs. educational texts).
Real-time interactive coding tools combining LLMs with human feedback.
Comparisons across LLM providers (Anthropic, Mistral, Google).
Cost-benefit modeling to assess trade-offs between accuracy and token cost.
We thank the Comparative Agendas Project for the publicly available codebook and case data, as well as OpenAI for providing API access to ChatGPT. This work was supported in part by the Department of Political Science and the Center for Digital Scholarship.