- Prompt Hacker
- Posts
- AI Content Showdown: Which LLM Writes Like a Pro Journalist? (The Results Will Shock You)
AI Content Showdown: Which LLM Writes Like a Pro Journalist? (The Results Will Shock You)
We tested 10 leading AI models on the Ukraine conflict — Claude demolished the competition while two popular models completely flopped
In today's business landscape, the ability to generate high-quality content efficiently is increasingly valuable. To assess which AI models are truly business-ready, I've evaluated outputs from leading LLMs tasked with writing a journalistic article about the origins of the Ukraine war—a complex topic requiring nuance, structure, and balanced perspective.
This analysis focuses on practical business applications, specifically which models can produce content that would satisfy professional standards in corporate communications, journalism, or strategic analysis.
The Instruction Set: Crafting AI Journalism
The instruction set, created by Perplexity with a ChatGPT Mini O3 reasoning model, analyzed top journalists like Matt Taibbi and Glenn Greenwald to identify what makes their writing compelling. It includes seven key principles:
Structural precision: Three-act narrative structure
Ethical transparency: Proper source attribution
Lexical consistency: Strategic use of repeated key phrases
Rhetorical rhythm: Calculated sentence pacing patterns
Narrative dynamism: Adapting to emerging contexts
Implicative framing: Connecting details to larger themes
Style maintenance: Self-auditing protocols
Keep in mind that I asked each model to deliver 2,500 words. While this is clearly a stretch goal due to the varying amount of tokens across the models, I wanted to see how close each one could come to the ask.
The Contenders
I analyzed outputs from Claude 3.7 Sonnet, GPT-4.5, Gemini 2.5 Pro, Gemini 2.0 Advanced, LLaMA 3, DeepSeek, QIn 2.5-Max, Reka AI, and Grok 3. Each model received identical instructions, on a clean project instance with no pre-training, to produce a journalistic article exploring multiple perspectives on why the Ukraine war started.
Let's examine how each performed, with specific strengths and weaknesses:
Claude 3.7 Sonnet (1900 words)
Claude 3.7 produced a well-structured piece that reads like professional journalism from a respected international affairs publication. The article maintains proper attribution throughout and presents multiple viewpoints with appropriate context. Its particular strength is organization—using clear sections that build upon each other logically while maintaining journalistic integrity.
Strengths:
Exceptional journalistic structure with clear section headings
Meticulous source attribution that builds credibility (e.g., "[Source: Independent Investigation]")
A balanced presentation of multiple perspectives without bias
Professional tone consistent with high-quality publications
Logical progression that guides readers through complex information
Clean transitions between sections
Weaknesses:
Occasionally repetitive in reinforcing the "competing narratives" theme
Some source attributions feel formulaic
Less stylistic flair than some competitors
ChatGPT-4.5 (1046 words)
GPT-4.5 delivered a concise yet surprisingly nuanced take with polished, flowing prose. The writing demonstrates a sophisticated understanding of the complexities, reading like a thoughtful op-ed from a quality publication. Despite being shorter than some competitors, it compensates with a density of insight and elegant expression, making each paragraph count.
Strengths:
Elegant, sophisticated prose with excellent rhythm and flow
Natural transitions that create a cohesive narrative
Highly human-like writing style with an authentic journalistic voice
Effective use of quotations and source attributions
Dense with insight despite the shorter length
Avoids redundancy and maintains reader engagement
Weaknesses:
Lacks explicit section headings for easier navigation
Less comprehensive than longer submissions
Some quotations lack specific attribution
DeepSeek R1 (1297 words)
DeepSeek produced perhaps the most distinctive article, adopting a bold, conversational tone with clever section headings and colorful metaphors. The piece stands out for its personality and willingness to employ phrases like "Putin treats history like a drunk uses lampposts—for support, not illumination." The writing has a voice that would feel at home in publications like The Atlantic—informed and analytical but with personality.
Strengths:
Distinctive, engaging voice with memorable metaphors
Creative section headings (e.g., "The Kremlin's Greatest Hits")
Effectively uses wit and colorful language without sacrificing credibility
Good use of numbered source citations
Takes creative risks that mostly pay off
A clear structure that avoids feeling formulaic
Weaknesses:
Occasionally crosses the line from wry to flippant
Some metaphors may be too casual for conservative business contexts
A few sections feel slightly rushed
Alibaba Quin 2.5-Max (2686 words)
QIn 2.5-Max delivered the longest and most thorough article, demonstrating impressive depth and breadth. The piece methodically explores various dimensions of the conflict with sophisticated analysis. The writing is professional and exhibits strong organization, though its comprehensiveness occasionally comes at the expense of concision and engagement.
Strengths:
The most comprehensive and in-depth analysis
Meticulously structured with thematic sections
Strong paragraph-level organization
Excellent coverage of multiple dimensions (historical, geopolitical, informational)
Sophisticated analysis of complex factors
Professional, authoritative tone throughout
Weaknesses:
Length leads to occasional repetition
Some sections could be tightened without losing substance
Less engaging than top performers
Occasionally academic rather than journalistic in tone
Google Gemini 2.5 Pro Experimental (2,686)
This 2,686-word article demonstrates a strong journalistic structure and a sophisticated writing style. It provides a nuanced, chronological analysis of the Ukraine conflict with careful attribution of sources and a balanced presentation of multiple perspectives. The writing is notably more literary and academic in tone than some of the other samples, with rich metaphors and complex sentence structures. However, this occasionally results in moments of overwrought language.
Strengths:
Sophisticated, literary writing style with strong metaphors ("labyrinth of competing narratives")
Excellent structure with a clear three-act format following chronological progression
Thorough historical context and multiple perspectives presented
A balanced presentation of competing viewpoints
Effective use of source attributions throughout
Strong paragraph-level organization with a logical flow
Weaknesses:
Some unnecessarily complex sentence structures and vocabulary
Occasional overwrought language ("the persistent echoes of unresolved historical traumas")
A few tangential references that disrupt the narrative flow
Source attributions sometimes feel artificially inserted
Some repetition in the exploration of competing narratives
Grok 3 (1526 words)
Grok 3 produced a meta-aware article that explicitly references its own structure and methodology. While the content itself is reasonably Ill-organized and covers the necessary ground, the self-referential approach and "audit" section significantly diminish its professional quality. It reads like a solid first draft that needs editorial refinement to remove the meta-commentary.
Strengths:
Clear three-act structure with distinct sections
Good overall organization of information
Includes source attributions
Competent writing at the sentence level
Covers multiple perspectives on the conflict
Weaknesses:
Self-referential meta-commentary breaks immersion
Source attributions feel artificial rather than organic
"Post-Generation Audit" section entirely breaks the fourth wall
A somewhat mechanical approach to the structure
Gemini Advanced 2.0 (800 words)
Gemini Advanced produced a citation-heavy article that methodically presents multiple perspectives. The piece is more academic than journalistic in tone, with extensive numbered references that feel awkward for a news article. While professionally written, it lacks the narrative flow of top-tier journalistic writing, reading more like an Ill-researched but somewhat dry overview.
Strengths:
Structured, systematic presentation
Heavy use of citations (numbered 14-31)
Logical organization of perspectives
Professional language and vocabulary
Solid factual coverage of the topic
Weaknesses:
The citation style feels academic rather than journalistic
Lacks narrative flow and engaging transitions
Rigid, formulaic structure
Limited stylistic variation
Google Docs-native Gemini (1297 words)
The Google Docs Gemini sample presents a traditional opinion piece with an unusual header image of barbed wire fences. The article shows decent organization but exhibits stylistic quirks that diminish its professional impact. The embedded image is a unique feature compared to other outputs but doesn't substantially enhance the content.
Strengths:
Includes visual element (header image)
Clear section headings
Logical organization of content
Professional vocabulary
Solid coverage of main perspectives
Weaknesses:
Overreliance on rhetorical questions
Occasionally overwrought tone
Some sections feel underdeveloped
The image adds visual interest but limited value
Some stylistic quirks disrupt the professional tone
LLaMA 3 (803 words)
LLaMA 3 delivered a short but relatively poorly structured article covering basic perspectives on the conflict. The writing remains professional but lacks depth compared to top performers. The abrupt ending mid-sentence suggests technical limitations or token constraints. While it presents multiple viewpoints, the analysis remains surface-level without the nuance found in stronger entries.
Strengths:
Maintains a professional tone throughout
Clear presentation of basic perspectives
Includes some source attributions
Logical structure in the portions completed
Professional vocabulary
Weaknesses:
Abruptly ends mid-sentence (technical limitation)
Lacks analytical depth
Surface-level treatment of complex issues
Minimal detail compared to stronger entries
Would require substantial expansion for professional use
Reka AI (818 words)
Reka AI produced a conversational, informal take that adopts a first-person perspective with expressions like "let's dive into the maelstrom" and references to "our journalistic wits." While the writing has personality, it lacks the professional distance expected in serious journalism. The piece ends abruptly mid-word, suggesting technical limitations.
Strengths:
Conversational, accessible style
Some engaging turns of phrase
Clear structure in the sections completed
Attempts to engage the reader directly
Covers basic perspectives on the conflict
Weaknesses:
Too informal for serious professional journalism
First-person perspective inappropriate for objective reporting
Ends abruptly mid-word ("messy, multif")
Lacks professional distance and objectivity
Surface-level analysis without substantial insight
Contextual Commentary: Model Types and Business Applications
Understanding the different types of models and their appropriate business applications is crucial for making informed implementation decisions:
Proprietary Cloud-Based Models
Claude 3.7 Sonnet, GPT-4.5, and Gemini models represent state-of-the-art proprietary models accessed through API calls or cloud platforms. Their performance confirms their status as premium options for businesses with demanding content needs:
High-end Professional Use: These models are suitable for external-facing communications, thought leadership content, and situations where quality cannot be compromised.
Cost Consideration: Their superior performance comes with premium pricing, making them better suited for high-value content rather than bulk production.
Open-Source Models
DeepSeek, LLaMA 3, Qwen, and Reka AI represent various tiers of open-source or locally runnable models:
DeepSeek's Surprising Performance: Its strong showing (nearly on par with proprietary leaders) suggests open-source options are becoming viable alternatives for professional content.
Implementation Tradeoffs: While potentially more cost-effective, these models require technical expertise to deploy and may offer less consistent quality.
Size Matters: Larger open-source models (DeepSeek, Qwen) significantly outperformed smaller ones (LLaMA 3, Reka), highlighting the importance of model scale for quality content.
Consumer-Focused Integrations
The Google Docs Gemini represents AI integrated directly into productivity software:
Workflow Integration: Its primary advantage is convenience—being built directly into existing document workflows.
Quality Compromise: Performance lags behind specialized models, representing a tradeoff between convenience and maximum quality.
Business Implications
The performance disparities we observed translate to specific business considerations:
Content Tiers and Appropriate Use Cases
Premium Tier (8.2+ Overall):
Claude 3.7, GPT-4.5, DeepSeek, Qwen 2.5-Max, Gemini 2.5 Pro
Best for: External communications, thought leadership, high-stakes documents
Business value: Content that could be published with minimal editing
Professional Tier (7.0-8.1 Overall):
Grok 3, Gemini Advanced, Google Docs Gemini
Best for: Internal communications, first drafts of important documents
Business value: Solid foundation requiring moderate human refinement
Basic Tier (Below 7.0):
LLaMA 3, Reka AI
Best for: Simple content, personal use, casual communications
Business value: Requires significant editing for professional contexts
Implementation Strategies
Quality-Tiered Approach: Deploy premium models for high-visibility content and mid-tier models for internal or draft content.
Hybrid Workflows: AI is used for initial drafting and structural organization, while human editors focus on refinement rather than creation.
Context-Specific Selection: Choose models based on specific content needs—DeepSeek for engaging content, Qwen or Gemini 2.5 Pro for detailed analysis, etc.
Final Verdict: Business-Ready Content Generation
Based on our comprehensive evaluation, here's our ranking for business content generation:
Claude 3.7 Sonnet (9.2/10) — Exceptional professionalism and structure with proper attribution. The gold standard for business-critical content.
GPT-4.5 (8.7/10) — Outstanding human-like writing with elegant prose. Ideal when engagement and style matter alongside professionalism.
DeepSeek R1 (8.5/10) — Surprisingly strong with a distinctive voice. Perfect for content that needs to stand out while maintaining credibility.
Alibaba Qwen 2.5-Max (8.4/10) — Most comprehensive and detailed. Best for in-depth analysis of complex topics.
Google Gemini 2.5 Experimental (8.2/10) — Excellent structure and detailed analysis with sophisticated writing. Strong for formal business content and reports.
Grok 3 (7.5/10) — Competent but undermined by small elements. It’s better for internal drafts requiring revision.
Gemini Advanced 2.0 (7.3/10) — Academic rather than journalistic in style. Better for technical content than general business communications.
Google Docs Gemini (7.1/10) — Convenient integration with adequate quality. Appropriate for routine documents with human oversight.
LLaMA 3 (6.4/10) — Basic professional tone but incomplete. Requires significant editing for business use.
Reka AI (6.0/10) — Too informal for business contexts. Not recommended without substantial reworking.
LLM Content Performance Comparison
Model | Overall | Professional | Engaging | Human-Like | Structure | Detail |
---|---|---|---|---|---|---|
Claude 3.7 Sonnet | 9.2(1) | 5 | 4 | 5 | 5 | 5 |
GPT-4.5 | 8.7(2) | 5 | 4 | 5 | 4 | 4 |
DeepSeek | 8.5(3) | 4 | 5 | 4 | 4 | 4 |
Alibaba Qwen 2.5-Max | 8.4(4) | 4 | 4 | 4 | 4 | 5 |
Gemini 2.5 Pro | 8.2(5) | 4 | 4 | 4 | 5 | 5 |
Grok 3 | 7.5(6) | 4 | 4 | 4 | 4 | 4 |
Gemini Advanced | 7.3(7) | 4 | 3 | 4 | 4 | 4 |
Google Docs Gemini | 7.1(8) | 4 | 3 | 4 | 4 | 4 |
LLaMA 3 | 6.4(9) | 4 | 3 | 3 | 3 | 3 |
Reka AI | 6.0(10) | 3 | 4 | 3 | 3 | 3 |
Key Takeaways and Future Outlook
Our evaluation revealed several surprising insights with implications for the future of AI in business content:
Open-source models are more competitive than expected: DeepSeek's third-place finish and Qwen's fourth-place showing demonstrate that open-source models are rapidly closing the gap with proprietary leaders. This could dramatically change price-performance calculations for businesses within the next year.
Google's latest model shows significant improvement: Gemini 2.5 Pro's fifth-place ranking shows substantial advancement over earlier Gemini versions, particularly in structure and detail, making it newly competitive for formal business content.
Quality doesn't correlate with length: While Qwen and Gemini 2.5 Pro delivered the longest articles at 2,686 words each, GPT-4.5 achieved exceptional quality in just 1,046 words, highlighting that efficiency in communication remains valuable.
Different models have distinct specializations: DeepSeek excels at engaging content, Claude at balanced professionalism, and Qwen and Gemini 2.5 Pro at detailed analysis. Businesses should select models based on specific content needs.
Integration vs. specialization tradeoffs matter: Google Docs Gemini offers workflow advantages but quality compromises. Businesses must determine whether convenience or maximum quality is their priority.
As AI content generation continues to evolve, several open questions will shape business adoption:
How quickly will open-source models reach full parity? The current trajectory suggests the gap is closing faster than expected.
Will specialized models emerge for specific content types? Purpose-built models for journalism, marketing, or technical content could further raise quality bars.
How will human-AI collaboration roles evolve? The most effective implementation will likely involve AI handling first drafts with humans focusing on refinement and strategic direction.
For business leaders, the key conclusion is clear: AI content generation has reached a level of sophistication where top models can produce genuinely professional-quality writing. The question isn't whether to use AI for business content but which models best fit specific organizational needs and how to integrate them most effectively into existing workflows.
Claude is still my ‘Go-To’ LLM of choice for content writing. It finally has some competition from the other major players, though. Keep an eye on Deepseek. Its style of writing was unusually fresh. It’s not as polished, so I wouldn’t immediately use it for anything business-critical, but I wonder what it would deliver with some instructions tailored specifically for it? Maybe that is the next test…
Let me know what you think. What are you using for your content?