Text summarization has long been a cornerstone of Natural Language Processing (NLP). From enabling users to quickly digest news stories, to helping enterprises process large volumes of customer feedback, the ability to condense long-form content into coherent and representative summaries is both practical and transformative. Over the past decade, research in automatic summarization has evolved from simple extractive approaches (where sentences are directly pulled from the source text) to abstractive methods (where models generate new phrases that capture the meaning of the original).
With the advent of Large Language Models (LLMs), the boundaries of summarization have shifted dramatically. Modern LLMs are capable not only of understanding context at scale but also of producing human-like summaries that balance fluency, relevance, and informativeness. In this study, we focus on three prominent instruction-tuned LLMs that have demonstrated strong performance in downstream tasks: MPT-7b-instruct, Falcon-7b-instruct, and OpenAI’s ChatGPT (specifically the text-davinci-003 model).
Our objective is to conduct a comparative evaluation of these models on benchmark datasets and to assess their effectiveness using widely recognized evaluation metrics. By doing so, we aim to offer insights into how different LLM architectures and training paradigms influence summarization outcomes, as well as to provide practitioners with practical guidelines for selecting models for real-world applications.
Historically, automatic summarization techniques were based on statistical and rule-based methods. Early algorithms often relied on sentence scoring, keyword frequency, or positional heuristics to extract “important” sentences from documents. While these extractive methods were computationally inexpensive, they often produced summaries that lacked coherence or failed to capture the deeper semantics of the text.
The rise of deep learning shifted the field toward abstractive summarization. Neural networks, especially sequence-to-sequence (Seq2Seq) architectures with attention mechanisms, allowed models to generate summaries that paraphrased and compressed the source text. With the emergence of transformers, the ability of models to capture long-range dependencies and nuanced meaning further revolutionized summarization.
Today, LLMs represent the state of the art. Instruction-tuned models are capable of adapting to summarization tasks with minimal or no fine-tuning, making them attractive for research and commercial deployment alike.
MPT-7b-instruct is part of the MosaicML Pretrained Transformer family. It is a 7-billion parameter model optimized for instruction following, which makes it suitable for downstream tasks such as summarization, classification, and open-domain question answering. MPT-7b-instruct emphasizes efficient training pipelines and scalable deployment, allowing it to be fine-tuned or adapted with relative ease.
Notable features:
Based on transformer decoder architecture.
Trained with high-quality curated datasets.
Instruction fine-tuning to improve generalization.
Falcon-7b-instruct, developed by the Technology Innovation Institute (TII), is part of the Falcon family of models. It is designed with a focus on efficiency and open-access deployment. The “instruct” version is specifically fine-tuned to better follow human prompts, making it versatile for summarization tasks.
Key strengths:
Optimized architecture for reduced inference cost.
Widely recognized for strong performance among open-access models.
Balances quality with computational efficiency.
Text-davinci-003, an earlier variant of OpenAI’s GPT-3.5 family, remains one of the most widely used models for general-purpose natural language generation. While more recent iterations like GPT-4 have surpassed it, text-davinci-003 remains significant for academic benchmarking. It is instruction-tuned, capable of following detailed prompts, and often produces highly fluent and coherent text.
Highlights:
Trained on massive and diverse datasets.
Strong generalization across NLP tasks.
Outperforms many open-source alternatives in coherence and fidelity.
The CNN/Daily Mail dataset is one of the most widely used benchmarks for summarization. It consists of news articles paired with human-written highlights. The dataset is suitable for testing models’ ability to generate multi-sentence summaries that capture both factual accuracy and narrative coherence.
The Extreme Summarization (XSum) dataset is designed for more challenging abstractive summarization tasks. Each article is paired with a single-sentence summary that is highly abstractive. Unlike CNN/Daily Mail, where summaries often overlap with input text, XSum requires models to rephrase and generalize. This makes it a stronger test of models’ semantic understanding.
To evaluate the models, we employed a combination of widely accepted metrics:
BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between generated and reference summaries. While traditionally used in machine translation, it provides a measure of lexical similarity in summarization.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): The most common summarization metric, especially ROUGE-1, ROUGE-2, and ROUGE-L, which measure unigram, bigram, and longest common subsequence overlaps.
BERTScore: A more recent metric that leverages contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers). It evaluates semantic similarity between the generated and reference summaries beyond surface-level word overlap.
The experiments were conducted by generating summaries using each model under varying hyperparameters, including:
Temperature: Controlled creativity and randomness (ranging from 0.3 to 0.9).
Max tokens: Defined output length constraints.
Top-k and top-p sampling: Tuned to balance diversity and coherence.
The outputs were compared across both datasets, and evaluations were conducted using the aforementioned metrics.
Text-davinci-003 consistently outperformed the open-source models, achieving higher ROUGE-L and BERTScore metrics. Its summaries were generally more fluent and closer to human-written highlights.
MPT-7b-instruct produced competitive summaries but sometimes leaned toward extractive approaches.
Falcon-7b-instruct showed efficiency in generating concise summaries but occasionally omitted important contextual details.
Text-davinci-003 again led the performance, particularly excelling in abstractive capabilities. Its summaries demonstrated paraphrasing and semantic compression.
MPT-7b-instruct struggled slightly with overfitting to extractive patterns, leading to summaries that lacked abstraction.
Falcon-7b-instruct performed better than MPT on conciseness but sometimes sacrificed factual accuracy.
BLEU and ROUGE scores were highest for text-davinci-003 across both datasets.
BERTScore revealed that while MPT and Falcon generated lexically simpler summaries, their semantic similarity with reference texts was lower.
The experimental results suggest that text-davinci-003 remains a stronger choice for abstractive summarization, especially when fluency and coherence are prioritized. However, several insights emerge from the comparative analysis:
Open-source vs. Proprietary Models:
While open-source models like MPT-7b-instruct and Falcon-7b-instruct offer transparency, customizability, and cost savings, they currently lag behind proprietary models like text-davinci-003 in abstractive summarization quality.
Extractive vs. Abstractive Tendencies:
MPT and Falcon occasionally favored extractive strategies, while text-davinci-003 leaned more toward abstractive summarization. This distinction is important for applications where paraphrasing and rewording are critical.
Efficiency vs. Quality Trade-offs:
Falcon-7b-instruct is highly efficient and can be deployed at scale more economically than text-davinci-003. For businesses seeking lightweight summarization without strict fidelity requirements, Falcon may be sufficient.
Organizations that process large volumes of reports, emails, or customer reviews can leverage summarization models to extract actionable insights quickly. Text-davinci-003’s fluency may be preferable in executive summaries, whereas Falcon-7b could be used for bulk processing at lower cost.
Summarization models enable the creation of news digests. CNN/Daily Mail results suggest that text-davinci-003 produces summaries closer to human-written headlines, which could enhance user trust in media platforms.
For literature reviews, summarization models can help scholars condense papers into digestible abstracts. MPT-7b’s transparent training pipeline may be advantageous for reproducibility and fine-tuning.
Although not tested here, models like text-davinci-003 can be adapted for multilingual summarization, an area where evaluation could extend in future research.
Dataset Bias:
Both CNN/Daily Mail and XSum are English-centric. The results may not generalize to other languages or domains. Future work should include multilingual datasets such as WikiLingua.
Human Evaluation:
While automatic metrics are useful, human judgment remains crucial. Future experiments could incorporate user studies to assess readability and informativeness.
Model Scaling:
Larger versions of Falcon and MPT (e.g., Falcon-40b) may close the gap with proprietary models. Evaluating them on summarization tasks would provide a more balanced comparison.
Factual Consistency:
Ensuring factual alignment is a known challenge for LLMs. Although text-davinci-003 performed best overall, occasional hallucinations were observed. Future systems should incorporate factuality checks.
This comparative study highlights the evolving capabilities of LLMs in text summarization. While OpenAI’s text-davinci-003 outperformed MPT-7b-instruct and Falcon-7b-instruct across multiple benchmarks, each model has unique strengths that suit different contexts:
Text-davinci-003: Superior fluency and abstraction, best suited for high-stakes applications.
MPT-7b-instruct: Transparent and flexible, valuable for research and fine-tuning.
Falcon-7b-instruct: Efficient and cost-effective, suitable for scalable deployment.
By systematically comparing these models, this research provides actionable insights for NLP researchers and industry practitioners seeking to harness LLMs for summarization. The findings reinforce the notion that while proprietary models currently lead in performance, open-source alternatives are rapidly advancing, promising a more democratized future for AI-powered summarization.