ChatGPT is not A Man but Das Man: Structural Failures and Representational Gaps in LLM-Generated Silicon Samples

2021-12-01 08:53:22

129

Abstract

As large language models (LLMs) like ChatGPT (GPT-4) and Meta’s Llama continue to rise in prominence, their usage has extended beyond creative and analytical tasks into the realm of social simulation. One emergent trend is the use of LLMs as "silicon samples"—stand-ins for human participants in social science research, policymaking simulations, and opinion polling. However, beneath the surface of convenience lies a deeper set of concerns: do these models truly reflect the diverse and structurally complex reality of human opinion?

ChatGPT's training data includes software manual pages, information about internet phenomena such as bulletin board systems, multiple programming languages, and the text of Wikipedia.

This article builds upon recent research that probes these questions, emphasizing two foundational issues: structural inconsistency and homogenization in LLM responses. We expand upon a comparative study that used prompts from the American National Election Studies (ANES) 2020 dataset on sensitive political issues, revealing how LLMs systematically diverge from human data. We argue that the emerging trend of LLM-as-human-substitute risks reinforcing mainstream stereotypes and marginalizing dissenting or minority voices. Moreover, the push for modal "accuracy" in these models—what we term the accuracy-optimization hypothesis—inadvertently suppresses diversity in favor of consistency, raising fundamental concerns for the ethical deployment of LLMs in sociopolitical research.

1. Introduction: Silicon Samples and the Rise of AI-Social Simulation

With the rapid advancement of LLMs like OpenAI’s GPT-4 and Meta’s Llama 3.1 series, artificial intelligence is no longer confined to backend automation or productivity enhancement. It now plays an increasingly central role in social simulation, including being deployed as representative stand-ins for real humans in opinion studies, forecasting exercises, and even the design of policies and political messages.

This practice, dubbed the use of “silicon samples,” treats language model outputs as analogous to human responses. Researchers might input survey questions from the real world and treat the resulting responses as if they came from a statistically weighted sample of the population. But how valid is this assumption?

Recent empirical findings suggest deep cracks in this emerging methodology, with LLMs failing to uphold structural consistency across demographic slices and homogenizing opinion distributions—leading to significant representational failures. In other words, these models may replicate the "average" view (or the dominant one), but lose the diversity, nuance, and internal contradiction that are hallmarks of real human populations.

2. Theoretical Foundation: Heidegger's "Das Man" and LLM Homogenization

The title of the original study—"ChatGPT is not a man, but Das Man"—is a philosophical provocation rooted in the work of Martin Heidegger. In his seminal text Being and Time, Heidegger introduces das Man (the "they") as the abstract, impersonal social norm that dictates behavior, preferences, and attitudes in everyday life. It represents the averaged-out expectations of society, not the authentic self.

In this context, ChatGPT and similar LLMs are not “men” (representing individuals), but “das Man”—producing utterances that conform to dominant, expected, and generalized norms. By collapsing internal diversity in favor of modal responses, LLMs are not reflecting the richness of lived human experience, but rather, the societal averages.

This theoretical framing is essential to understanding the risks of using LLMs as survey participants. While superficially accurate, they are structurally hollow, lacking the pluralistic internal structures that characterize populations. This difference is not trivial—it undermines the utility of LLMs in social science and may reinforce majoritarian narratives that actively exclude minority voices.

3. Methodology: Testing ChatGPT and Llama with ANES Questions

To empirically test these concerns, the researchers prompted GPT-4 and Meta’s Llama 3.1 (8B, 70B, 405B variants) with questions drawn from the American National Election Studies (ANES) 2020, specifically targeting highly polarized topics like abortion rights and unauthorized immigration.

These questions were selected for their sensitivity to demographic variation—real human responses to them are known to diverge significantly across lines of age, gender, education, race, political affiliation, and religious belief. The assumption was that if LLMs can adequately model human opinion, they should preserve this internal structural variation.

However, findings showed:

Structural Inconsistency: When prompted to simulate responses from different demographic groups (e.g., "Answer this as a conservative, Christian woman from the Midwest"), the LLM outputs often contradicted the overall response distribution or failed to reflect known correlations.
Homogenization: When asked to simulate a population-level response (e.g., "What do Americans think about abortion?"), LLMs gravitated toward median or modal answers, heavily underrepresenting fringe, minority, or radical positions that are statistically present in human samples.

These failures point to a fundamental issue: LLMs are not composed of real populations, but rather trained on vast corpora where dominant voices are overrepresented, and demographic nuance is flattened.

4. Structural Inconsistency: When Accuracy Falls Apart at Scale

In statistical modeling, structural consistency refers to the idea that the relationships observable at an individual level should aggregate logically to the population level. For example, if individual liberals support abortion access, then a population composed mainly of liberals should also support abortion access.

In LLM outputs, this does not hold. A model might generate a liberal viewpoint when prompted as an individual liberal but give a surprisingly moderate or contradictory answer when prompted to simulate "a liberal demographic." This suggests that LLMs lack a coherent internal structure, where demographic features interact predictably with beliefs.

In practice, this flaw means that LLMs are unreliable for any task that requires compositional reasoning across groups—like predicting electoral outcomes, estimating support for policies, or designing public messaging. They can parrot beliefs, but not simulate belief systems.

5. Homogenization: The Accuracy-Optimization Hypothesis

The second major failure—homogenization—stems from a deeper structural feature of LLM training. Because these models are trained to maximize predictive accuracy across next-token prediction tasks, they gravitate toward the most likely response.

This accuracy-optimization hypothesis explains why LLMs tend to produce the “average” opinion: when faced with ambiguity or polarization, the safest move is to output the most statistically probable response based on training data. Minority opinions—especially those not well represented in training corpora—are either absent or minimized.

This leads to the suppression of dissent, marginalization, and cultural variability, particularly in contexts involving race, gender identity, religion, or political radicalism. This is not just a computational limitation—it becomes an epistemological and ethical issue, especially if LLMs are used for designing policies or understanding human beliefs.

6. The Illusion of Representativeness

Many proponents of LLM-as-silicon-sample argue that with proper prompting (e.g., conditioning responses by demographic traits), models can simulate populations effectively. However, this assumption fails under scrutiny.

In practice:

Demographic prompts have inconsistent effects.
Models often default to mainstream U.S. cultural norms (typically white, educated, centrist-liberal).
Fine-grained identity intersectionality (e.g., Black conservative Muslims) is poorly modeled or collapsed into more dominant identity components.

Thus, representativeness is an illusion, not a feature. The model is not simulating diversity; it is simulating the dominant average.

7. Ethical and Epistemic Risks

The findings from this study are not merely academic—they have real-world consequences. As governments, NGOs, corporations, and researchers increasingly turn to LLMs for:

Opinion modeling
Focus group simulations
Media testing
Forecasting social behavior

…they risk internalizing the very biases that LLMs encode. Specifically:

Marginalized groups may be rendered invisible.
Policy decisions may be based on incorrect assumptions about public support.
Stereotypes may be reinforced, as the model reflects the training data rather than challenging it.

Even worse, the aura of objectivity surrounding LLMs ("It's just data") can mask these distortions, making it harder to detect and challenge them.

8. Toward Better Models: Recommendations

If LLMs are to be used in social research, we need new practices and safeguards, including:

Structured Audits: Systematic testing of LLM responses across demographic permutations to evaluate structural consistency.
Minority-Representation Penalties: During training, penalize models that overfit to modal answers at the cost of minority opinion diversity.
Transparent Training Data: Open datasets with clear demographic provenance can help identify over- or under-represented groups.
Human Calibration: Always compare LLM outputs with real-world human data (e.g., ANES, Pew, Gallup) before drawing conclusions.
Bias Acknowledgment: Make it standard practice to publish bias disclosures when using LLMs in research or public policy.

9. Conclusion: Not A Man, But Das Man

In conclusion, large language models like ChatGPT are not digital humans. They are not a diverse assembly of individuals but an engineered statistical consensus. They do not represent "people"—they represent das Man, the faceless “they” of societal norms.

Treating LLMs as silicon samples is not only methodologically flawed but ethically fraught. Without intervention, this practice may silence precisely the voices that social science and public policy aim to uplift.

As we enter an era where artificial intelligence shapes our understanding of ourselves, we must ask not only what these models say, but whom they leave out—and why.

ChatGPT

“Before, I Asked My Mom, Now I Ask ChatGPT”: Visual Privacy Management with Generative AI for Blind and Low-Vision People