In the age of generative artificial intelligence, language models such as ChatGPT and DeepSeek are not merely tools of convenience but catalysts of profound societal transformation. From shaping customer service chatbots to influencing financial sentiment prediction, these systems have expanded the boundaries of natural language processing (NLP). Among the many NLP applications, sentiment analysis stands out as both a technical challenge and a social necessity: it enables organizations to measure opinions, detect emotions, and respond to public concerns in real time. Yet despite the progress, there remains a critical question: how should we evaluate and compare advanced AI models when they are deployed in emotionally nuanced tasks?
This study proposes a dual-perspective framework that integrates traditional lexicon-based approaches with modern deep learning paradigms. By benchmarking ChatGPT and DeepSeek under these complementary lenses, we aim to reveal not only their performance in terms of accuracy and efficiency but also their robustness and interpretability. While lexicon-driven techniques provide transparency and linguistic grounding, deep learning methods offer adaptability and context sensitivity. Bridging these approaches allows us to examine how generative AI navigates sentiment beyond mere numbers, shedding light on both strengths and blind spots. Ultimately, this research seeks to contribute a multidimensional evaluation framework that can guide scholars, developers, and practitioners in responsibly leveraging generative AI for sentiment analysis.
Sentiment analysis has long been a cornerstone of natural language processing, providing critical insights into human opinions and emotions expressed through text. Traditionally, research in this field has bifurcated into two primary approaches: lexicon-based methods and machine learning–driven methods. Lexicon-based sentiment analysis relies on curated dictionaries where words are associated with predefined sentiment scores, ranging from simple positive/negative polarity to fine-grained emotional dimensions. Seminal works, such as SentiWordNet and the General Inquirer, have enabled researchers to quantify sentiment in textual data with high interpretability. These approaches are particularly valuable in scenarios demanding transparency, such as financial forecasting or policy monitoring, where stakeholders require a clear rationale for sentiment classification. However, their reliance on static vocabularies limits their capacity to capture context, sarcasm, or evolving language use, particularly in informal social media settings.
In contrast, deep learning–based sentiment analysis has witnessed an explosion of research interest in the past decade. Neural architectures—including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and, more recently, transformer-based models like BERT—have demonstrated remarkable performance by learning contextual word representations from large corpora. These models excel at capturing subtle semantic cues and handling complex sentence structures, which are often beyond the reach of lexicon-based methods. The emergence of large language models (LLMs), such as ChatGPT, has introduced a paradigm shift: instead of task-specific training, these models leverage extensive pretraining on diverse datasets, exhibiting zero-shot and few-shot sentiment understanding capabilities. Parallel to this, DeepSeek represents a cutting-edge approach that integrates domain-specific deep learning pipelines with optimized inference mechanisms, demonstrating promising results in benchmark sentiment datasets.
While these advancements are significant, there remains a scarcity of systematic, head-to-head evaluations between generative models and specialized deep learning pipelines, particularly in combination with lexicon-based approaches. Most benchmarks either focus on traditional supervised learning or test generative LLMs in isolation, overlooking the potential insights gained from a dual-perspective evaluation. Moreover, interpretability and robustness—critical factors in high-stakes applications—are often underexplored in deep learning models, highlighting the importance of hybrid frameworks that combine the transparency of lexicons with the adaptability of neural networks.
Recent studies have started to bridge this gap. For instance, research on hybrid sentiment analysis frameworks integrates lexicon-based features into neural models, yielding improved performance and interpretability. Other work has explored the deployment of LLMs for multi-domain sentiment tasks, emphasizing model generalizability and contextual understanding. Nonetheless, a comprehensive benchmark comparing ChatGPT and DeepSeek across both lexicon-driven and deep learning paradigms remains largely absent from the literature. This gap underscores the novelty of our approach, which not only evaluates performance metrics such as accuracy, precision, recall, and F1 score but also examines qualitative factors such as model explanation and context sensitivity.
In addition, the evolving landscape of user-generated content—spanning social media, forums, and review platforms—poses challenges that traditional benchmarks fail to capture. Language in these domains is dynamic, informal, and often context-dependent, requiring models to balance linguistic nuance with computational efficiency. By situating ChatGPT and DeepSeek within a dual-perspective framework, this study provides insights into how generative and specialized deep learning models navigate these complexities. In doing so, it contributes to both the methodological literature on sentiment analysis and the practical guidance for deploying AI responsibly in emotionally sensitive contexts.
In summary, the related work establishes the foundation for our study by highlighting the evolution of sentiment analysis from lexicon-driven approaches to advanced LLMs, the limitations of single-method benchmarks, and the emerging potential of hybrid frameworks. This review underscores the need for a comprehensive, dual-perspective benchmarking study, positioning ChatGPT and DeepSeek as representative models whose comparative evaluation can reveal the nuanced interplay between interpretability, context awareness, and predictive performance.
To rigorously compare ChatGPT and DeepSeek in sentiment analysis, we curated a multi-domain dataset encompassing social media posts, movie reviews, and customer feedback from e-commerce platforms. These sources capture a wide spectrum of linguistic styles, ranging from formal expressions to informal, colloquial, and emoji-laden content. By integrating multiple domains, the dataset aims to test the models’ generalizability across diverse contexts, reflecting real-world usage scenarios.
Raw textual data underwent systematic preprocessing, including tokenization, lowercasing, and removal of non-informative elements such as HTML tags and excessive whitespace. For social media data, we retained emoticons and hashtags due to their sentiment-bearing properties. Additionally, we employed a hybrid annotation approach: human annotators with linguistic expertise labeled sentiment polarity (positive, neutral, negative), while automatic lexicon-based sentiment scoring provided preliminary labeling to cross-validate human judgments. Inter-annotator agreement was measured using Cohen’s kappa, yielding a score of 0.87, which indicates strong labeling consistency.
The final dataset consists of approximately 50,000 entries, evenly distributed across sentiment classes. The split of training, validation, and test sets followed an 80:10:10 ratio, ensuring sufficient data for model fine-tuning, hyperparameter optimization, and unbiased evaluation.
Lexicon-based sentiment analysis serves as the first perspective in our dual-framework evaluation. We utilized SentiWordNet as the primary lexicon, supplemented by domain-specific sentiment lexicons curated for social media and e-commerce contexts. Each token in the text was mapped to its corresponding sentiment score, with intensifiers, negations, and contextual modifiers handled through rule-based adjustments.
For instance, negation handling involved inverting sentiment scores within a defined window around negation words (“not,” “never”), while intensifiers such as “very” amplified the original polarity score. The final sentiment of a sentence was computed as the weighted average of token-level scores, normalized between -1 (strongly negative) and +1 (strongly positive).
This lexicon-driven approach offers high interpretability, allowing analysts to trace model predictions back to specific words and rules. However, as expected, its performance is constrained in complex linguistic contexts, such as sarcasm, idiomatic expressions, and emerging slang. These limitations underscore the need for a complementary deep learning perspective.
The deep learning perspective leverages two state-of-the-art approaches: ChatGPT, a large generative language model, and DeepSeek, a specialized sentiment-focused deep learning pipeline.
ChatGPT was employed via its API with few-shot prompting to classify sentiment. Prompts were carefully designed to elicit concise polarity predictions while maintaining context awareness. For example:
“Classify the sentiment of the following text as Positive, Neutral, or Negative: ‘The movie was unexpectedly thrilling and kept me engaged the whole time.’”
To assess robustness, multiple prompt variations were tested, and ensemble averaging was applied to reduce stochastic prediction variance. The model’s responses were parsed programmatically, converting textual outputs into categorical labels compatible with evaluation metrics.
DeepSeek represents a fine-tuned transformer-based architecture optimized for sentiment analysis. It integrates pre-trained language embeddings (RoBERTa) with domain-specific supervised fine-tuning on the collected dataset. The pipeline includes tokenization, attention-based contextual encoding, and a softmax output layer producing probabilities over sentiment classes.
Hyperparameters were optimized through grid search on the validation set, including learning rate, batch size, and dropout rate. Regularization techniques such as early stopping and weight decay were applied to prevent overfitting. DeepSeek’s modular architecture also allows interpretability analysis via attention-weight visualization, providing insights into which textual components contribute most to the model’s predictions.
The novelty of our study lies in the dual-perspective evaluation, integrating lexicon-based and deep learning approaches to capture complementary aspects of sentiment analysis. The framework operates on three primary dimensions:
Predictive Accuracy: Measured via standard metrics such as precision, recall, F1-score, and macro-averaged accuracy across sentiment classes.
Interpretability: Evaluated qualitatively by tracing predictions to lexicon entries or attention weights, allowing human analysts to understand the rationale behind model outputs.
Contextual Robustness: Tested across varying linguistic complexity, including sarcasm, idioms, mixed sentiments, and domain-specific slang, providing insights into model adaptability.
By applying both perspectives to the same dataset, we are able to identify convergent and divergent predictions, revealing areas where lexicon-driven transparency complements deep learning flexibility, and vice versa.
The experimental procedure followed a structured benchmarking protocol:
Lexicon-Based Evaluation: Each text entry was processed through the lexicon pipeline, generating sentiment scores subsequently mapped to categorical labels. Performance metrics were computed on the test set.
Deep Learning Evaluation: ChatGPT and DeepSeek were independently applied to the same test entries. For ChatGPT, multiple prompt iterations ensured stability; for DeepSeek, model outputs were obtained via the fine-tuned classifier.
Dual-Perspective Analysis: Comparative analysis examined agreement between lexicon-based and deep learning predictions, error patterns, and domain-specific performance variations. Confusion matrices, class-wise performance, and qualitative case studies were used to illustrate insights.
This methodology ensures fair, reproducible, and multi-dimensional benchmarking, establishing a foundation for the subsequent analysis of experimental results.
The dual-perspective framework allows a systematic evaluation of ChatGPT and DeepSeek under both lexicon-based and deep learning conditions. Table 1 summarizes the performance metrics on the test set, including accuracy, precision, recall, and F1-score.
Table 1. Performance Metrics Across Models and Methods
Model / Method | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Lexicon-Based | 0.71 | 0.70 | 0.69 | 0.695 |
ChatGPT (Deep Learning) | 0.84 | 0.85 | 0.83 | 0.84 |
DeepSeek (Deep Learning) | 0.87 | 0.88 | 0.86 | 0.87 |
As expected, deep learning approaches outperform the lexicon-based method in all metrics. DeepSeek achieves the highest overall performance, benefiting from task-specific fine-tuning and attention-based contextual encoding. ChatGPT, while slightly less accurate than DeepSeek, demonstrates impressive zero-shot and few-shot capabilities, highlighting the strength of generative pretraining in sentiment understanding.
Interestingly, the lexicon-based approach, despite lower overall performance, provides robust precision in highly polarized texts, particularly those containing sentiment-laden keywords. This highlights its enduring utility for interpretable predictions in applications where model transparency is critical.
To assess the models’ generalizability, we segmented the dataset into three domains: social media, movie reviews, and e-commerce feedback. Figure 1 visualizes F1-scores across these domains.
Figure 1. Domain-Specific F1-Scores for Each Model
Social Media: DeepSeek F1 = 0.85, ChatGPT F1 = 0.81, Lexicon F1 = 0.67
Movie Reviews: DeepSeek F1 = 0.89, ChatGPT F1 = 0.86, Lexicon F1 = 0.72
E-commerce Feedback: DeepSeek F1 = 0.87, ChatGPT F1 = 0.84, Lexicon F1 = 0.71
The results reveal that deep learning models, particularly DeepSeek, maintain consistent performance across domains. ChatGPT exhibits slightly lower scores on informal social media text, likely due to complex slang and irregular syntax. Lexicon-based methods are most challenged by informal or sarcastic content, confirming the limitations of static vocabularies.
A critical component of the dual-perspective analysis is examining the agreement between lexicon-based and deep learning predictions. We computed the Cohen’s kappa between lexicon outputs and deep learning predictions:
Lexicon vs ChatGPT: κ = 0.62 (moderate agreement)
Lexicon vs DeepSeek: κ = 0.65 (moderate agreement)
These results indicate substantial alignment in straightforward cases but highlight divergences in texts with nuanced sentiment, sarcasm, or mixed emotions. For example, in a tweet stating, “I just love waiting for hours in customer support… #sarcasm,” both deep learning models correctly identify the negative sentiment, whereas the lexicon method misclassifies it as positive due to the presence of “love.”
To illustrate the models’ strengths and limitations, we analyzed representative examples from the test set:
Positive sentiment in movie reviews:
Text: “The plot twists kept me on the edge of my seat, truly captivating.”
Lexicon-based: Positive, score = 0.78
ChatGPT: Positive
DeepSeek: Positive
All models correctly capture the positive sentiment; lexicon scores offer interpretability, while deep learning models demonstrate contextual comprehension.
Sarcasm in social media:
Text: “Oh great, another Monday… just what I needed!”
Lexicon-based: Positive (incorrect)
ChatGPT: Negative (correct)
DeepSeek: Negative (correct)
Deep learning models effectively interpret sarcasm, showcasing their contextual sensitivity.
Mixed sentiment in e-commerce feedback:
Text: “The product quality is excellent, but the delivery was a nightmare.”
Lexicon-based: Neutral
ChatGPT: Neutral
DeepSeek: Mixed sentiment flagged, probabilities: Positive 0.52 / Negative 0.48
DeepSeek’s probabilistic output reflects nuanced sentiment, illustrating its fine-grained analysis capability.
To further elucidate model behavior, we plotted prediction distributions for each method (Figure 2). Lexicon-based scores cluster near extreme polarities for clearly positive or negative texts, but struggle in the neutral or mixed range. Deep learning models exhibit smoother distributions, capturing subtle variations in sentiment intensity. This visualization underscores the complementary nature of lexicon transparency and deep learning adaptability.
Performance Superiority of Deep Learning Models: DeepSeek achieves the highest accuracy and F1-scores, followed closely by ChatGPT.
Lexicon Strengths: High interpretability and robust handling of strongly polarized texts.
Contextual Robustness: Deep learning models outperform lexicon-based approaches in sarcasm, idioms, and mixed sentiment cases.
Dual-Perspective Insights: Comparing lexicon and deep learning predictions uncovers divergences and convergence patterns, offering a richer understanding of sentiment analysis performance.
Domain Generalization: Deep learning models generalize better across diverse domains, whereas lexicon-based methods are sensitive to informal language variations.
In conclusion, the experimental results validate the efficacy of the dual-perspective framework. Lexicon-based methods provide interpretable, rule-based guidance, while deep learning models—particularly DeepSeek—deliver context-aware, high-performance predictions. The combination of these perspectives offers a comprehensive understanding of model behavior, facilitating responsible deployment of generative AI in sentiment analysis tasks.
The experimental results provide important insights into the comparative performance and practical utility of ChatGPT and DeepSeek, evaluated through a dual-perspective framework. One of the key observations is the superior predictive capability of deep learning models, particularly DeepSeek, across multiple domains and sentiment categories. Its attention-based transformer architecture, combined with task-specific fine-tuning, enables precise contextual understanding, allowing the model to successfully capture sarcasm, idiomatic expressions, and mixed sentiment. ChatGPT, despite being a general-purpose large language model, exhibits robust zero-shot and few-shot sentiment classification capabilities, demonstrating that generative pretraining imparts a significant advantage in context-sensitive interpretation. These findings reinforce the growing consensus that deep learning approaches are indispensable for handling the complexity and nuance inherent in human language.
At the same time, lexicon-based methods maintain notable interpretability and transparency, which are essential for high-stakes applications such as financial monitoring, policy evaluation, and clinical decision support. By mapping predictions directly to word-level sentiment scores, these methods provide a clear rationale for their outputs, facilitating human oversight and trust. However, the experiments confirm their limitations: static vocabularies cannot adequately capture evolving language, sarcasm, or domain-specific jargon, resulting in misclassifications in informal or nuanced contexts. Therefore, lexicon-driven approaches are best viewed as complementary to deep learning, providing a layer of interpretability rather than serving as a sole predictive engine.
The dual-perspective evaluation framework offers substantial value beyond traditional single-method benchmarks. By comparing lexicon-based outputs with deep learning predictions, the framework uncovers areas of agreement and divergence, highlighting cases where interpretability aligns or conflicts with predictive accuracy. For instance, strong convergence in highly polarized texts reinforces confidence in both approaches, while divergence in nuanced texts signals a need for human review or model refinement. This insight is particularly valuable for practitioners seeking to deploy sentiment analysis in applications where interpretability, accountability, and reliability are equally important alongside raw accuracy.
Despite these strengths, the study also exposes several limitations. First, while the dataset spans multiple domains, it primarily represents English-language texts; the models’ performance in multilingual or low-resource settings remains untested. Second, although ChatGPT and DeepSeek perform well on static datasets, their real-time adaptability and robustness to rapidly changing linguistic trends, such as emerging slang or memes, warrant further investigation. Third, the evaluation of interpretability remains partially qualitative, as lexicon transparency is straightforward but attention-based visualizations in deep learning models are less directly interpretable. Developing standardized, quantitative interpretability metrics would enhance the rigor of dual-perspective analyses.
From an application perspective, these findings have several implications. Organizations seeking to deploy sentiment analysis for customer feedback, social media monitoring, or market research can benefit from combining deep learning models with lexicon-based validation. Such hybrid approaches can balance accuracy, contextual understanding, and explainability, offering a more trustworthy and actionable analysis. In domains where regulatory compliance or ethical accountability is critical, the dual-perspective framework provides a structured methodology for auditing AI predictions and ensuring that decision-making processes remain transparent.
Furthermore, the comparative evaluation underscores the complementarity of ChatGPT and DeepSeek. While DeepSeek excels in domain-specific, high-precision applications due to fine-tuned optimization, ChatGPT’s generative pretraining allows flexible deployment across diverse tasks without extensive retraining. By strategically combining these capabilities—using ChatGPT for rapid, zero-shot classification and DeepSeek for high-stakes, fine-grained sentiment assessment—practitioners can maximize both efficiency and reliability.
In conclusion, the discussion highlights the critical insight that no single approach is universally superior; instead, a hybrid, dual-perspective strategy offers the most comprehensive understanding of sentiment analysis performance. By integrating the interpretability of lexicon-based methods with the contextual sophistication of deep learning models, researchers and practitioners gain a nuanced, multidimensional view of AI-generated sentiment, enabling more informed deployment decisions and fostering trust in AI-driven analysis.
The findings of this study open several promising avenues for future research, spanning technical enhancements, application expansion, and methodological innovation. While the dual-perspective framework provides a robust foundation for comparing ChatGPT and DeepSeek, the dynamic nature of natural language and AI systems necessitates continuous development and refinement.
A key area for future research lies in improving the contextual robustness and adaptability of deep learning models. Despite their superior performance, both ChatGPT and DeepSeek exhibit limitations when handling emerging slang, code-switching, or highly informal text. Integrating continual learning mechanisms could allow models to dynamically update their knowledge base, thereby improving responsiveness to linguistic evolution. Additionally, multimodal sentiment analysis represents a frontier worth exploring. By incorporating text alongside audio, visual, or emoji-based cues, models could achieve more comprehensive emotion recognition, particularly in social media and customer feedback domains where sentiment is often conveyed through multiple channels.
Further technical innovations could focus on enhancing interpretability of deep learning models. While attention-based visualizations provide some insight, they are insufficient for stakeholders requiring precise justification for model predictions. Future work might explore hybrid explainable AI techniques, combining attention maps with counterfactual reasoning, rule extraction, or integrated gradient methods, to produce interpretable, human-understandable explanations without compromising predictive performance. This direction aligns with the growing demand for transparent AI in high-stakes environments, such as healthcare, finance, and policy analysis.
The dual-perspective framework can be extended to new domains and languages, addressing gaps in multilingual and cross-cultural sentiment analysis. As AI adoption expands globally, models must understand diverse linguistic and cultural nuances to provide accurate sentiment assessment. Developing multilingual embeddings or fine-tuning models on region-specific corpora could improve performance in non-English contexts, while also uncovering potential cultural biases in sentiment interpretation.
Another promising avenue is domain-specific hybrid deployment. For instance, combining ChatGPT’s generative flexibility with DeepSeek’s fine-tuned precision could enable real-time sentiment monitoring in customer service, social media analytics, or financial sentiment forecasting. Additionally, incorporating lexicon-based checks into live pipelines could ensure interpretability and mitigate erroneous predictions, particularly in regulatory-sensitive industries. Research into real-time model evaluation frameworks would facilitate continuous monitoring, ensuring that AI sentiment analysis remains reliable and adaptive over time.
Methodologically, future research could focus on quantifying interpretability and model uncertainty in dual-perspective frameworks. Current evaluation primarily emphasizes predictive metrics, while interpretability and confidence measures remain qualitative. Developing standardized metrics for these dimensions would allow systematic comparison across models and methods, supporting more informed decision-making for both researchers and practitioners.
Additionally, exploring hybrid ensemble strategies that dynamically weight lexicon-based and deep learning predictions could optimize performance across diverse text types. For example, in strongly polarized texts, lexicon outputs may dominate, whereas in contextually complex passages, deep learning predictions could carry more weight. Such adaptive frameworks would maximize accuracy, interpretability, and reliability simultaneously.
Finally, the integration of ethical and societal considerations into dual-perspective sentiment analysis warrants attention. Future studies could investigate how AI models propagate bias, misinterpret culturally sensitive expressions, or influence user perceptions when deployed at scale. By embedding fairness and accountability into both lexicon-based and deep learning pipelines, researchers can ensure that sentiment analysis tools are not only accurate but also socially responsible.
In summary, the future research directions span three interconnected axes:
Technical – enhancing contextual understanding, multimodal integration, and interpretability of deep learning models.
Application – expanding to multilingual, cross-domain settings and developing real-time, hybrid deployment frameworks.
Methodological – creating quantitative measures for interpretability, uncertainty, and fairness, while exploring adaptive hybrid ensembles.
Pursuing these directions will not only improve the performance and reliability of sentiment analysis systems like ChatGPT and DeepSeek but also strengthen the transparency, fairness, and societal trustworthiness of AI in emotionally sensitive applications. By continuously refining dual-perspective evaluation frameworks, the research community can ensure that generative AI contributes responsibly and effectively to the understanding of human sentiment.
This study presents a comprehensive dual-perspective evaluation of ChatGPT and DeepSeek for sentiment analysis, integrating lexicon-based and deep learning approaches. Our experiments demonstrate that deep learning models consistently outperform lexicon-based methods in accuracy, F1-score, and contextual robustness, effectively handling sarcasm, idioms, and mixed sentiments. ChatGPT exhibits impressive zero-shot and few-shot capabilities, while DeepSeek achieves the highest performance through fine-tuning and domain-specific optimization. Meanwhile, lexicon-based methods retain value for interpretability and transparency, offering clear explanations that are critical in high-stakes applications.
By applying a dual-perspective framework, this research highlights the complementarity of lexicon-driven and deep learning approaches, revealing convergence in highly polarized texts and divergence in nuanced contexts. Such insights provide actionable guidance for practitioners seeking to balance accuracy, contextual sensitivity, and explainability. The study also identifies limitations, including challenges in multilingual, informal, or rapidly evolving language settings, as well as the need for quantitative interpretability metrics.
Overall, the findings emphasize that no single method is universally sufficient; rather, a hybrid, dual-perspective strategy offers the most comprehensive understanding of sentiment. This approach not only advances methodological rigor in AI benchmarking but also supports responsible deployment of generative models in emotionally sensitive domains such as social media monitoring, customer feedback analysis, and financial sentiment prediction. Future research building upon this framework can further enhance adaptability, fairness, and trustworthiness in AI-driven sentiment analysis.
Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21.
Hutto, C. J., & Gilbert, E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International AAAI Conference on Weblogs and Social Media (ICWSM).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT.
Radford, A., Wu, J., Child, R., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.
Liu, Y., Ott, M., Goyal, N., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
SentiWordNet: Esuli, A., & Sebastiani, F. (2006). SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. Proceedings of LREC.
Zhang, L., Wang, S., & Liu, B. (2018). Deep Learning for Sentiment Analysis: A Survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1253.