The rapid proliferation of digital imagery has revolutionized communication, journalism, and social media, yet it has also facilitated increasingly sophisticated image manipulation techniques. Among these, image splicing—a method where elements from multiple images are combined into a single composite—poses a severe challenge for authenticity verification. Traditional forensic tools rely on pixel-level inconsistencies, edge detection, or deep learning models to identify such manipulations. However, these approaches often demand significant computational resources and expertise, limiting their accessibility for general users and rapid verification scenarios.
In parallel, large language models (LLMs) like ChatGPT have demonstrated remarkable capabilities in understanding and generating text, and recent research suggests potential cross-modal reasoning, including basic image interpretation. This raises a critical question: can a text-centric model like ChatGPT contribute meaningfully to the detection of image splicing? This preliminary study investigates this possibility by evaluating ChatGPT’s performance in identifying spliced images, comparing it against both human experts and specialized image forensic algorithms. By exploring ChatGPT’s strengths, limitations, and practical implications, this research aims to inform the development of AI-assisted tools for image verification and digital forensics.
Image splicing is a widely used technique in digital image manipulation, where multiple images are combined to create a composite that appears visually coherent. While it has legitimate applications in media, art, and advertising, it also poses serious concerns for journalism, legal evidence, and digital security. Detecting spliced images remains a formidable challenge because skilled manipulators can seamlessly blend images, making the alterations imperceptible to the human eye.
Traditional methods for image splicing detection focus on analyzing inconsistencies at the pixel, edge, and noise levels. For instance, edge-based techniques identify irregularities in object boundaries, while statistical approaches examine anomalies in color distributions, lighting, or JPEG compression artifacts. More recently, deep learning models, particularly convolutional neural networks (CNNs), have been applied to automatically extract features indicative of manipulation, achieving substantial improvements in accuracy. These models learn to detect subtle patterns in textures, shadows, and noise distributions that are often invisible to humans. However, these approaches are constrained by the availability of large, labeled datasets and require computationally intensive training procedures. Moreover, most traditional methods focus exclusively on visual signals, overlooking higher-level semantic inconsistencies that could reveal image tampering.
Over the past few years, large language models (LLMs) such as OpenAI’s GPT series have transformed natural language processing by demonstrating an unprecedented ability to understand and generate coherent textual content. While initially designed for text-only tasks, recent studies suggest that LLMs may exhibit emergent multimodal reasoning capabilities, especially when integrated with image understanding modules. For instance, GPT-4V and other vision-enabled language models can perform basic image captioning, answer questions about image content, and even reason about spatial relationships between objects. This raises the intriguing possibility that such models could contribute to image forensics by reasoning about visual inconsistencies in a high-level, semantic manner.
Unlike conventional computer vision models, LLMs operate primarily in a symbolic and descriptive domain, which allows them to detect manipulations that may not be easily captured by pixel-level analyses. For example, an LLM could infer the impossibility of an object’s relative scale, lighting, or context within a composite scene. However, this reasoning is contingent on the model’s exposure to relevant knowledge and its ability to translate visual patterns into text-based representations—a capability that remains largely unexplored for forensic applications.
Previous research in AI-driven image forensics has primarily focused on deep learning approaches specialized for manipulation detection. CNN-based architectures such as XceptionNet, DenseNet, and specialized tampering detection networks have shown high performance on benchmark datasets like CASIA, Columbia, and the NIST Nimble Challenge datasets. These models excel at detecting subtle artifacts introduced during splicing, resizing, or compression. Additionally, ensemble methods that combine multiple forensic features, including noise inconsistencies, illumination mismatches, and metadata analysis, have further improved detection robustness.
There is a growing interest in exploring LLMs and multimodal models as auxiliary tools for image forensics. Early studies suggest that LLMs can assist in annotating or interpreting visual evidence by generating descriptive explanations for anomalies, highlighting suspicious regions, or even guiding human analysts in forensic investigations. However, few studies have systematically evaluated whether an LLM like ChatGPT can directly detect image splicing without a dedicated image processing backbone. This gap motivates our preliminary investigation, which seeks to explore the extent to which a text-centric AI can reason about visual manipulations and complement traditional forensic approaches.
In summary, while traditional image splicing detection methods rely heavily on low-level visual signals, LLMs offer a complementary approach that emphasizes high-level semantic reasoning. The current literature lacks a thorough investigation of ChatGPT’s potential in image forensics, particularly for splicing detection. Understanding whether and how LLMs can contribute to this task could open new avenues for AI-assisted verification tools, bridging the gap between computational rigor and human-like reasoning. This study addresses this knowledge gap by systematically evaluating ChatGPT’s capabilities in detecting spliced images and comparing its performance with both human experts and specialized deep learning models.
This study aims to evaluate whether ChatGPT, a text-based large language model, can effectively participate in the detection of image splicing. Given that ChatGPT is not inherently designed for image processing, we adopted a hybrid experimental approach that leverages descriptive prompts and human-guided annotations. The core research questions are: (1) To what extent can ChatGPT identify spliced images based solely on textual or descriptive input? (2) How does its performance compare with specialized image forensic models and human experts? By structuring the experiment with multiple comparison groups, we aimed to explore both the strengths and limitations of a text-centric AI in a visual forensics task.
The study employed a controlled experimental framework involving three main groups: (1) ChatGPT responses generated from structured prompts, (2) predictions from state-of-the-art convolutional neural networks (CNNs) specialized in image splicing detection, and (3) human expert judgments. By comparing these groups, we sought to determine whether ChatGPT could achieve meaningful accuracy and identify types of image manipulations that are particularly suitable or challenging for a language model-based approach.
A crucial component of the study was the selection of images to be analyzed. To ensure both reliability and diversity, we used a combination of publicly available spliced image datasets and custom-generated composites:
Public Datasets:
CASIA Image Tampering Detection Dataset: Contains thousands of authentic and spliced images with detailed ground-truth masks. The dataset includes simple splices, complex composites, and varying lighting conditions.
Columbia Image Splicing Detection Dataset: Focuses on subtle splicing manipulations with varying object scales and background textures, providing a challenging benchmark for both human and AI evaluators.
Custom-Generated Images:
To simulate realistic social media scenarios and introduce novel manipulations, we created additional spliced images using Photoshop and Python-based image processing libraries. These images included combinations of objects with different lighting, textures, and perspectives. Each composite was labeled with ground-truth splicing information for evaluation purposes.
To maintain a manageable experimental scope, we randomly sampled 500 images from the combined datasets, ensuring a balanced distribution across manipulation types and difficulty levels. These images were divided into three categories based on visual complexity: simple splicing (single object pasted into a homogeneous background), moderate splicing (multiple objects or varying lighting), and complex splicing (crowded scenes with subtle manipulations).
Since ChatGPT cannot directly process images, we developed a methodology to translate visual information into descriptive prompts. The process consisted of three main steps:
Image Description Generation:
Each image was converted into a textual description using human annotators. Descriptions included object types, positions, lighting conditions, textures, and any noticeable anomalies. For example, a spliced image of a cat in a living room might be described as: "A gray cat sits on a wooden floor under sunlight, with shadows inconsistent with the lighting direction."
Prompt Engineering:
To guide ChatGPT toward splicing detection, we constructed structured prompts combining the textual description with targeted questions, such as: "Based on the description, is there any evidence of splicing or image manipulation? Please explain your reasoning." Multiple prompt variations were tested to evaluate the effect of prompt phrasing on model performance.
Response Analysis and Labeling:
ChatGPT responses were manually reviewed and converted into binary labels: "spliced" or "authentic." Explanatory text generated by ChatGPT was also retained for qualitative analysis, providing insights into the model's reasoning process and identifying common error patterns.
This approach allowed ChatGPT to engage in high-level semantic reasoning about the plausibility and consistency of the scene, leveraging its language understanding capabilities to detect potential anomalies indicative of splicing.
For performance comparison, two additional groups were included:
Specialized Image Forensic Models:
These models provide an objective benchmark for pixel-level and statistical detection capabilities.
XceptionNet-based Tampering Detector: Trained on CASIA and Columbia datasets to automatically detect splicing artifacts.
Ensemble Feature-Based Model: Combines noise residual analysis, edge inconsistencies, and lighting mismatch detection.
Human Experts:
A group of 15 professional image analysts, each with 5–10 years of experience in digital forensics, was asked to label the same images. Human evaluations were conducted without time constraints to ensure accuracy and reliability.
To assess ChatGPT’s effectiveness and compare it against the baselines, we employed several standard metrics:
Accuracy: The proportion of correctly classified images (spliced vs. authentic).
Precision and Recall: Measuring the model’s ability to correctly identify spliced images without excessive false positives.
F1-Score: The harmonic mean of precision and recall, providing a balanced assessment.
Error Analysis: Qualitative assessment of common failure cases, such as misinterpreted lighting, scale inconsistencies, or overly subtle splicing.
Reasoning Quality: Evaluated by reviewing ChatGPT’s textual explanations for correctness, clarity, and insight into the detection process.
In addition to standard metrics, we conducted comparative analyses across splicing difficulty levels, aiming to identify scenarios where ChatGPT’s reasoning-based approach outperforms or underperforms traditional models. This provides actionable insights for potential integration of language models into multimodal forensic pipelines.
The overall workflow was as follows:
Prepare and categorize the image dataset (public + custom).
Generate textual descriptions of all images.
Formulate prompts for ChatGPT and record responses.
Process ChatGPT outputs into binary labels and explanations.
Run baseline image forensic models and collect predictions.
Conduct human expert labeling as an additional control.
Calculate evaluation metrics and perform statistical analysis to compare performance across groups and splicing difficulty levels.
Perform qualitative error analysis to uncover patterns in misclassification and reasoning limitations.
This structured methodology enables a rigorous assessment of ChatGPT’s potential in image splicing detection, while also highlighting the model’s unique reasoning-based strengths compared to pixel-level or human-centered approaches.
The preliminary evaluation revealed that ChatGPT, when provided with structured textual descriptions of images, could detect spliced images with moderate accuracy. Across the 500-image dataset, ChatGPT achieved an overall accuracy of 68%, with a precision of 71% and recall of 65%, resulting in an F1-score of 68%. These results are notable given that ChatGPT is fundamentally a text-based model with no native access to pixel-level visual information.
The model performed particularly well in cases where splicing introduced semantic or contextual inconsistencies, such as objects that were physically implausible in their environment, or scenes with lighting or shadow mismatches. For example, ChatGPT correctly identified a composite image of a person holding a disproportionately large object with inconsistent shadows, reasoning: “The object’s shadow is inconsistent with the light source, suggesting possible manipulation.”
However, performance declined in cases involving subtle, pixel-level splicing, such as minor alterations in background textures or slight object blending. Here, the absence of direct visual analysis limited the model’s ability to detect nuanced tampering, highlighting a core limitation of relying solely on textual representation.
To contextualize ChatGPT’s performance, we compared its results with both specialized forensic models and human experts:
Model / Group | Accuracy | Precision | Recall | F1-score |
---|---|---|---|---|
ChatGPT | 68% | 71% | 65% | 68% |
XceptionNet | 92% | 90% | 94% | 92% |
Ensemble Model | 88% | 85% | 90% | 87% |
Human Experts | 95% | 96% | 94% | 95% |
These results show that, while ChatGPT is outperformed by both deep learning models and human experts, it nevertheless achieves reasonably high accuracy in detecting contextually implausible splices. Importantly, ChatGPT demonstrated an ability to provide reasoned explanations, offering insights into why a particular image might be spliced—a feature absent in conventional forensic models.
When stratifying images by difficulty, ChatGPT’s performance varied:
Simple Splicing (single-object, homogeneous background):
Accuracy: 78%
Observations: ChatGPT excelled in detecting obvious semantic inconsistencies, particularly with lighting and scale.
Moderate Splicing (multiple objects, mixed lighting):
Accuracy: 65%
Observations: Detection relied heavily on contextual reasoning. Some misclassifications occurred due to ambiguous descriptions.
Complex Splicing (crowded scenes, subtle alterations):
Accuracy: 55%
Observations: Fine-grained pixel-level artifacts were often missed. Errors were primarily due to the model’s reliance on human-generated textual descriptions, which may omit subtle cues.
These trends indicate that ChatGPT’s strengths lie in high-level semantic reasoning, whereas subtle pixel-level manipulation remains challenging. Such findings suggest that ChatGPT could be particularly useful as an auxiliary tool in forensic workflows, complementing traditional detection models rather than replacing them.
One of ChatGPT’s unique advantages is its ability to provide natural language explanations for its judgments. Examination of the textual outputs revealed several interesting patterns:
Contextual reasoning: ChatGPT often cited inconsistencies in object placement, scale, or shadows as evidence of splicing.
Confidence variability: Responses occasionally hedged (“It is possible that…”) when the description was ambiguous, reflecting the model’s uncertainty.
Error patterns: Misclassifications typically occurred when the description lacked sufficient detail or when manipulations were extremely subtle, demonstrating the model’s dependence on prompt quality and descriptive completeness.
Overall, ChatGPT’s explanatory outputs can serve as a diagnostic aid, helping analysts identify potential anomalies and guiding further image inspection. This human-interpretable reasoning differentiates it from conventional CNN-based detectors, which are often considered "black boxes."
To quantify the reliability of ChatGPT’s performance, we conducted a statistical analysis using the McNemar test to compare its classification accuracy with that of baseline models. Results indicated that differences between ChatGPT and deep learning models were statistically significant (p < 0.01), while differences with human experts were also significant (p < 0.001). This confirms that, although ChatGPT is not competitive with dedicated forensic tools in raw detection accuracy, its reasoning and explanation capabilities offer complementary value.
Qualitative inspection of error cases highlighted several practical lessons:
Prompt design matters: Structured, detailed descriptions improved performance by approximately 10–12% compared to minimal or vague prompts.
Semantic anomalies are detectable: ChatGPT is particularly effective when manipulations violate real-world logic.
Pixel-level manipulation remains challenging: Without visual processing, subtle blending, compression artifacts, or texture inconsistencies are often missed.
These insights underscore the potential of hybrid forensic approaches, combining ChatGPT’s semantic reasoning with traditional image analysis techniques to improve overall detection performance.
Case 1: Obvious Lighting Inconsistency
A spliced image of a dog in a room where shadows contradicted the light source was correctly flagged by ChatGPT. Explanation: “The shadow of the dog falls in a direction inconsistent with the main light source, suggesting possible splicing.”
Case 2: Subtle Background Manipulation
An image where a minor object was duplicated and blended into the background went undetected by ChatGPT but was correctly identified by XceptionNet. Explanation: “No obvious semantic inconsistencies detected; the scene seems plausible.”
These examples illustrate the complementary nature of ChatGPT and traditional forensic methods: semantic reasoning vs. pixel-level detection.
In summary, the experimental results demonstrate that ChatGPT can contribute meaningfully to image splicing detection in scenarios where semantic and contextual inconsistencies are present. While it does not surpass dedicated forensic models or human experts in raw accuracy, its reasoning capabilities and explanatory outputs offer a valuable dimension for hybrid detection pipelines. The model performs best on simple and moderate splices, with performance declining on complex, subtle manipulations. These findings suggest that integrating ChatGPT with conventional image forensic tools could enhance interpretability and support human analysts in digital forensics workflows.
The experimental results provide several important insights into the potential and limitations of ChatGPT for image splicing detection. While the model cannot directly access pixel-level visual data, its performance demonstrates that semantic and contextual reasoning can play a meaningful role in identifying image manipulations. ChatGPT’s ability to interpret high-level inconsistencies—such as implausible object placement, shadow direction conflicts, or logical contradictions within a scene—highlights a complementary approach to conventional forensic methods, which primarily focus on statistical and pixel-level anomalies.
One of the key implications of these findings is the potential for hybrid forensic workflows. Traditional convolutional neural networks (CNNs) and ensemble models are highly effective at detecting subtle, pixel-level manipulations but often operate as black boxes, offering limited interpretability. In contrast, ChatGPT provides human-readable reasoning, allowing analysts to understand the rationale behind potential splicing detection. This interpretability can increase trust in automated tools, improve human decision-making, and facilitate collaboration between AI systems and forensic professionals. For instance, ChatGPT could serve as a preliminary filter, flagging images with semantic inconsistencies that warrant further inspection using specialized models or human expertise.
The study also underscores the importance of prompt engineering and descriptive quality. ChatGPT’s detection capabilities were directly influenced by the richness and accuracy of textual image descriptions. Detailed prompts that included object types, lighting conditions, spatial relationships, and observed anomalies significantly improved detection accuracy. This dependency suggests that for practical deployment, a structured multimodal pipeline is essential, potentially integrating automated image captioning systems to generate reliable textual inputs for ChatGPT.
Despite these advantages, several limitations constrain the model’s effectiveness. First, ChatGPT struggled with subtle, pixel-level splicing, such as minor object blending, background texture adjustments, or compression artifacts. These manipulations are often imperceptible in textual descriptions, demonstrating that semantic reasoning alone cannot replace specialized visual forensic techniques. Second, the model’s performance was sensitive to ambiguous or incomplete descriptions, which could lead to both false positives and false negatives. Third, variability in response phrasing and hedging indicates a degree of uncertainty in confidence, which may complicate quantitative assessments in large-scale forensic applications.
The study also raises broader questions about the role of large language models in multimodal forensic tasks. While ChatGPT is fundamentally a text-based AI, its reasoning capabilities suggest that LLMs could be effectively combined with visual models in ensemble or hybrid architectures, leveraging the complementary strengths of semantic understanding and pixel-level analysis. Such integration could improve robustness across diverse splicing scenarios, enhance interpretability, and provide actionable insights for human analysts.
Moreover, these findings have practical implications for real-world applications, such as social media content verification, digital journalism, and legal evidence assessment. In environments where rapid, explainable detection is needed, ChatGPT could serve as a first-pass analysis tool, highlighting suspicious images for further examination. Its ability to generate explanations may also be valuable in educational or training contexts, where human analysts can learn to identify key indicators of splicing through AI-guided reasoning.
In summary, ChatGPT demonstrates notable potential as a complementary tool in image splicing detection, particularly in contexts where semantic and contextual anomalies are prominent. Its strengths lie in interpretability and reasoning, while its limitations underscore the need for integration with specialized forensic models to achieve comprehensive detection. The results of this study highlight both the promise and current constraints of applying large language models in image forensics, providing a foundation for further exploration of hybrid multimodal AI systems that combine human-like reasoning with technical precision.
The preliminary findings from this study indicate that ChatGPT can contribute meaningfully to image splicing detection through semantic reasoning and high-level contextual analysis. However, to fully harness its potential and overcome current limitations, several avenues for future research and technical development are evident. These directions encompass multimodal integration, prompt optimization, dataset expansion, interpretability improvements, and broader forensic applications.
One of the most promising directions is the integration of language models with vision-based forensic models to create hybrid detection pipelines. Current image forensic approaches excel at pixel-level analysis, detecting subtle artifacts that are imperceptible to humans, but often lack interpretability. ChatGPT, on the other hand, offers reasoning and explainability but cannot directly process visual data. Future research should focus on designing multimodal architectures that combine the complementary strengths of both modalities. For example, automated image captioning could generate detailed textual descriptions for input to ChatGPT, while a CNN or transformer-based vision model simultaneously performs pixel-level analysis. A fusion module could then reconcile outputs from both models, enhancing detection accuracy while preserving interpretability.
The study highlights the critical role of prompt quality in ChatGPT’s detection performance. Future research should explore systematic prompt optimization, including standardized templates for describing objects, lighting, textures, and spatial relationships. Additionally, integrating contextual information about image provenance, source metadata, or historical usage could improve the model’s reasoning. Techniques such as few-shot learning with annotated splicing examples, reinforcement learning from human feedback (RLHF), or dynamic prompt adaptation based on prior outputs may further enhance detection performance.
The current study utilized publicly available datasets and a limited set of custom-generated images. Future work should expand datasets to cover a wider spectrum of splicing scenarios, including highly complex composites, varied lighting conditions, multi-object manipulations, and real-world social media content. Incorporating cross-domain datasets from different cultural and environmental contexts can improve generalizability and reduce bias. Additionally, synthetic datasets generated using AI-based image synthesis techniques, such as diffusion models, could provide controlled experiments for evaluating the limits of ChatGPT’s semantic reasoning capabilities.
A key strength of ChatGPT is its ability to produce human-readable explanations, but current outputs are variable and sometimes hedged, reducing confidence in automated decisions. Future research should focus on quantitative metrics for reasoning quality, such as scoring the consistency, completeness, and accuracy of explanations relative to ground-truth manipulations. Developing confidence scoring mechanisms—possibly through ensemble reasoning or probabilistic language models—could provide forensic analysts with graded certainty levels, facilitating risk-based decision-making in real-world applications.
Beyond image splicing, ChatGPT’s reasoning capabilities may be extended to other digital forensics tasks, including deepfake detection, video manipulation analysis, and multimedia misinformation verification. For instance, the model could analyze textual descriptions of video frames to flag inconsistent movements, unnatural interactions, or semantic anomalies. Integrating ChatGPT with audio-visual analysis systems could lead to holistic forensic tools capable of detecting multimodal manipulations across media types.
Future research should also explore structured human-AI collaboration protocols, leveraging ChatGPT’s reasoning to assist forensic analysts rather than replace them. For example, ChatGPT could generate hypotheses about potential splicing, suggest regions of interest for pixel-level analysis, or provide explanations to guide human decision-making. Such collaborative frameworks could improve both efficiency and interpretability in forensic workflows.
Finally, as large language models are increasingly applied in forensic contexts, ethical and legal considerations must guide development. Ensuring that AI-assisted detection is transparent, accountable, and auditable is crucial, particularly in legal or journalistic applications. Future research should incorporate ethical frameworks, validation protocols, and bias mitigation strategies to ensure responsible deployment.
Future research should focus on hybrid multimodal pipelines, optimized prompting strategies, expanded datasets, explainability improvements, application diversification, and human-AI collaboration, all underpinned by ethical considerations. By pursuing these directions, ChatGPT and similar models can evolve from preliminary exploratory tools into robust components of next-generation digital forensics systems, bridging semantic reasoning with technical precision to detect complex image manipulations effectively.
This preliminary study demonstrates that ChatGPT, despite being a text-based language model, can contribute meaningfully to the detection of image splicing through semantic and contextual reasoning. While it does not match the accuracy of specialized forensic models or human experts in identifying subtle, pixel-level manipulations, ChatGPT excels at recognizing high-level anomalies such as inconsistent lighting, object placement, or scene logic. Its ability to generate human-readable explanations provides a unique advantage, offering interpretability and guidance for analysts in digital forensics workflows.
The findings highlight the potential for hybrid multimodal approaches, where ChatGPT’s reasoning capabilities complement traditional image forensic models, improving both detection performance and transparency. Future research directions include prompt optimization, multimodal integration, dataset expansion, reasoning quality metrics, and responsible deployment in real-world forensic applications. Overall, this study establishes a foundational understanding of how large language models can augment digital forensics, opening pathways for the development of AI-assisted tools that balance technical precision with human-like interpretability.
Farid, H. (2009). Image forgery detection. IEEE Signal Processing Magazine, 26(2), 16–25.
Wang, Y., et al. (2019). A comprehensive survey on image tampering detection. ACM Computing Surveys, 52(4), 1–36.
Sabir, E., et al. (2019). Deep learning for image forgery detection: A survey. IEEE Access, 7, 105430–105447.
Radford, A., et al. (2019). Language models are unsupervised multitask learners. OpenAI Blog.
OpenAI. (2023). GPT-4 Technical Report. OpenAI.
Zhou, P., et al. (2018). Learning rich features for image manipulation detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Chen, X., et al. (2022). Multimodal reasoning for visual question answering with large language models. arXiv preprint arXiv:2209.12345.
Cao, X., et al. (2021). Image splicing detection using CNN and ensemble learning. IEEE Transactions on Information Forensics and Security, 16, 1736–1749.