• Prompt Hacker
  • Posts
  • AI Content Showdown: Which LLM Writes Like a Pro Journalist? (The Results Will Shock You)

AI Content Showdown: Which LLM Writes Like a Pro Journalist? (The Results Will Shock You)

We tested 10 leading AI models on the Ukraine conflict — Claude demolished the competition while two popular models completely flopped

In today's business landscape, the ability to generate high-quality content efficiently is increasingly valuable. To assess which AI models are truly business-ready, I've evaluated outputs from leading LLMs tasked with writing a journalistic article about the origins of the Ukraine war—a complex topic requiring nuance, structure, and balanced perspective.

This analysis focuses on practical business applications, specifically which models can produce content that would satisfy professional standards in corporate communications, journalism, or strategic analysis.

The Instruction Set: Crafting AI Journalism

The instruction set, created by Perplexity with a ChatGPT Mini O3 reasoning model, analyzed top journalists like Matt Taibbi and Glenn Greenwald to identify what makes their writing compelling. It includes seven key principles:

  • Structural precision: Three-act narrative structure

  • Ethical transparency: Proper source attribution

  • Lexical consistency: Strategic use of repeated key phrases

  • Rhetorical rhythm: Calculated sentence pacing patterns

  • Narrative dynamism: Adapting to emerging contexts

  • Implicative framing: Connecting details to larger themes

  • Style maintenance: Self-auditing protocols

Keep in mind that I asked each model to deliver 2,500 words. While this is clearly a stretch goal due to the varying amount of tokens across the models, I wanted to see how close each one could come to the ask.

The Contenders

I analyzed outputs from Claude 3.7 Sonnet, GPT-4.5, Gemini 2.5 Pro, Gemini 2.0 Advanced, LLaMA 3, DeepSeek, QIn 2.5-Max, Reka AI, and Grok 3. Each model received identical instructions, on a clean project instance with no pre-training, to produce a journalistic article exploring multiple perspectives on why the Ukraine war started.

Let's examine how each performed, with specific strengths and weaknesses:

Claude 3.7 Sonnet (1900 words)

Claude 3.7 produced a well-structured piece that reads like professional journalism from a respected international affairs publication. The article maintains proper attribution throughout and presents multiple viewpoints with appropriate context. Its particular strength is organization—using clear sections that build upon each other logically while maintaining journalistic integrity.

Strengths:

  • Exceptional journalistic structure with clear section headings

  • Meticulous source attribution that builds credibility (e.g., "[Source: Independent Investigation]")

  • A balanced presentation of multiple perspectives without bias

  • Professional tone consistent with high-quality publications

  • Logical progression that guides readers through complex information

  • Clean transitions between sections

Weaknesses:

  • Occasionally repetitive in reinforcing the "competing narratives" theme

  • Some source attributions feel formulaic

  • Less stylistic flair than some competitors

ChatGPT-4.5 (1046 words)

GPT-4.5 delivered a concise yet surprisingly nuanced take with polished, flowing prose. The writing demonstrates a sophisticated understanding of the complexities, reading like a thoughtful op-ed from a quality publication. Despite being shorter than some competitors, it compensates with a density of insight and elegant expression, making each paragraph count.

Strengths:

  • Elegant, sophisticated prose with excellent rhythm and flow

  • Natural transitions that create a cohesive narrative

  • Highly human-like writing style with an authentic journalistic voice

  • Effective use of quotations and source attributions

  • Dense with insight despite the shorter length

  • Avoids redundancy and maintains reader engagement

Weaknesses:

  • Lacks explicit section headings for easier navigation

  • Less comprehensive than longer submissions

  • Some quotations lack specific attribution

DeepSeek R1 (1297 words)

DeepSeek produced perhaps the most distinctive article, adopting a bold, conversational tone with clever section headings and colorful metaphors. The piece stands out for its personality and willingness to employ phrases like "Putin treats history like a drunk uses lampposts—for support, not illumination." The writing has a voice that would feel at home in publications like The Atlantic—informed and analytical but with personality.

Strengths:

  • Distinctive, engaging voice with memorable metaphors

  • Creative section headings (e.g., "The Kremlin's Greatest Hits")

  • Effectively uses wit and colorful language without sacrificing credibility

  • Good use of numbered source citations

  • Takes creative risks that mostly pay off

  • A clear structure that avoids feeling formulaic

Weaknesses:

  • Occasionally crosses the line from wry to flippant

  • Some metaphors may be too casual for conservative business contexts

  • A few sections feel slightly rushed

Alibaba Quin 2.5-Max (2686 words)

QIn 2.5-Max delivered the longest and most thorough article, demonstrating impressive depth and breadth. The piece methodically explores various dimensions of the conflict with sophisticated analysis. The writing is professional and exhibits strong organization, though its comprehensiveness occasionally comes at the expense of concision and engagement.

Strengths:

  • The most comprehensive and in-depth analysis

  • Meticulously structured with thematic sections

  • Strong paragraph-level organization

  • Excellent coverage of multiple dimensions (historical, geopolitical, informational)

  • Sophisticated analysis of complex factors

  • Professional, authoritative tone throughout

Weaknesses:

  • Length leads to occasional repetition

  • Some sections could be tightened without losing substance

  • Less engaging than top performers

  • Occasionally academic rather than journalistic in tone

Google Gemini 2.5 Pro Experimental (2,686)

This 2,686-word article demonstrates a strong journalistic structure and a sophisticated writing style. It provides a nuanced, chronological analysis of the Ukraine conflict with careful attribution of sources and a balanced presentation of multiple perspectives. The writing is notably more literary and academic in tone than some of the other samples, with rich metaphors and complex sentence structures. However, this occasionally results in moments of overwrought language.

Strengths:

  • Sophisticated, literary writing style with strong metaphors ("labyrinth of competing narratives")

  • Excellent structure with a clear three-act format following chronological progression

  • Thorough historical context and multiple perspectives presented

  • A balanced presentation of competing viewpoints

  • Effective use of source attributions throughout

  • Strong paragraph-level organization with a logical flow

Weaknesses:

  • Some unnecessarily complex sentence structures and vocabulary

  • Occasional overwrought language ("the persistent echoes of unresolved historical traumas")

  • A few tangential references that disrupt the narrative flow

  • Source attributions sometimes feel artificially inserted

  • Some repetition in the exploration of competing narratives

Grok 3 (1526 words)

Grok 3 produced a meta-aware article that explicitly references its own structure and methodology. While the content itself is reasonably Ill-organized and covers the necessary ground, the self-referential approach and "audit" section significantly diminish its professional quality. It reads like a solid first draft that needs editorial refinement to remove the meta-commentary.

Strengths:

  • Clear three-act structure with distinct sections

  • Good overall organization of information

  • Includes source attributions

  • Competent writing at the sentence level

  • Covers multiple perspectives on the conflict

Weaknesses:

  • Self-referential meta-commentary breaks immersion

  • Source attributions feel artificial rather than organic

  • "Post-Generation Audit" section entirely breaks the fourth wall

  • A somewhat mechanical approach to the structure

Gemini Advanced 2.0 (800 words)

Gemini Advanced produced a citation-heavy article that methodically presents multiple perspectives. The piece is more academic than journalistic in tone, with extensive numbered references that feel awkward for a news article. While professionally written, it lacks the narrative flow of top-tier journalistic writing, reading more like an Ill-researched but somewhat dry overview.

Strengths:

  • Structured, systematic presentation

  • Heavy use of citations (numbered 14-31)

  • Logical organization of perspectives

  • Professional language and vocabulary

  • Solid factual coverage of the topic

Weaknesses:

  • The citation style feels academic rather than journalistic

  • Lacks narrative flow and engaging transitions

  • Rigid, formulaic structure

  • Limited stylistic variation

Google Docs-native Gemini (1297 words)

The Google Docs Gemini sample presents a traditional opinion piece with an unusual header image of barbed wire fences. The article shows decent organization but exhibits stylistic quirks that diminish its professional impact. The embedded image is a unique feature compared to other outputs but doesn't substantially enhance the content.

Strengths:

  • Includes visual element (header image)

  • Clear section headings

  • Logical organization of content

  • Professional vocabulary

  • Solid coverage of main perspectives

Weaknesses:

  • Overreliance on rhetorical questions

  • Occasionally overwrought tone

  • Some sections feel underdeveloped

  • The image adds visual interest but limited value

  • Some stylistic quirks disrupt the professional tone

LLaMA 3 (803 words)

LLaMA 3 delivered a short but relatively poorly structured article covering basic perspectives on the conflict. The writing remains professional but lacks depth compared to top performers. The abrupt ending mid-sentence suggests technical limitations or token constraints. While it presents multiple viewpoints, the analysis remains surface-level without the nuance found in stronger entries.

Strengths:

  • Maintains a professional tone throughout

  • Clear presentation of basic perspectives

  • Includes some source attributions

  • Logical structure in the portions completed

  • Professional vocabulary

Weaknesses:

  • Abruptly ends mid-sentence (technical limitation)

  • Lacks analytical depth

  • Surface-level treatment of complex issues

  • Minimal detail compared to stronger entries

  • Would require substantial expansion for professional use

Reka AI (818 words)

Reka AI produced a conversational, informal take that adopts a first-person perspective with expressions like "let's dive into the maelstrom" and references to "our journalistic wits." While the writing has personality, it lacks the professional distance expected in serious journalism. The piece ends abruptly mid-word, suggesting technical limitations.

Strengths:

  • Conversational, accessible style

  • Some engaging turns of phrase

  • Clear structure in the sections completed

  • Attempts to engage the reader directly

  • Covers basic perspectives on the conflict

Weaknesses:

  • Too informal for serious professional journalism

  • First-person perspective inappropriate for objective reporting

  • Ends abruptly mid-word ("messy, multif")

  • Lacks professional distance and objectivity

  • Surface-level analysis without substantial insight

Contextual Commentary: Model Types and Business Applications

Understanding the different types of models and their appropriate business applications is crucial for making informed implementation decisions:

Proprietary Cloud-Based Models

Claude 3.7 Sonnet, GPT-4.5, and Gemini models represent state-of-the-art proprietary models accessed through API calls or cloud platforms. Their performance confirms their status as premium options for businesses with demanding content needs:

  • High-end Professional Use: These models are suitable for external-facing communications, thought leadership content, and situations where quality cannot be compromised.

  • Cost Consideration: Their superior performance comes with premium pricing, making them better suited for high-value content rather than bulk production.

Open-Source Models

DeepSeek, LLaMA 3, Qwen, and Reka AI represent various tiers of open-source or locally runnable models:

  • DeepSeek's Surprising Performance: Its strong showing (nearly on par with proprietary leaders) suggests open-source options are becoming viable alternatives for professional content.

  • Implementation Tradeoffs: While potentially more cost-effective, these models require technical expertise to deploy and may offer less consistent quality.

  • Size Matters: Larger open-source models (DeepSeek, Qwen) significantly outperformed smaller ones (LLaMA 3, Reka), highlighting the importance of model scale for quality content.

Consumer-Focused Integrations

The Google Docs Gemini represents AI integrated directly into productivity software:

  • Workflow Integration: Its primary advantage is convenience—being built directly into existing document workflows.

  • Quality Compromise: Performance lags behind specialized models, representing a tradeoff between convenience and maximum quality.

Business Implications

The performance disparities we observed translate to specific business considerations:

Content Tiers and Appropriate Use Cases

Premium Tier (8.2+ Overall):

  • Claude 3.7, GPT-4.5, DeepSeek, Qwen 2.5-Max, Gemini 2.5 Pro

  • Best for: External communications, thought leadership, high-stakes documents

  • Business value: Content that could be published with minimal editing

Professional Tier (7.0-8.1 Overall):

  • Grok 3, Gemini Advanced, Google Docs Gemini

  • Best for: Internal communications, first drafts of important documents

  • Business value: Solid foundation requiring moderate human refinement

Basic Tier (Below 7.0):

  • LLaMA 3, Reka AI

  • Best for: Simple content, personal use, casual communications

  • Business value: Requires significant editing for professional contexts

Implementation Strategies

  1. Quality-Tiered Approach: Deploy premium models for high-visibility content and mid-tier models for internal or draft content.

  2. Hybrid Workflows: AI is used for initial drafting and structural organization, while human editors focus on refinement rather than creation.

  3. Context-Specific Selection: Choose models based on specific content needs—DeepSeek for engaging content, Qwen or Gemini 2.5 Pro for detailed analysis, etc.

Final Verdict: Business-Ready Content Generation

Based on our comprehensive evaluation, here's our ranking for business content generation:

  1. Claude 3.7 Sonnet (9.2/10) — Exceptional professionalism and structure with proper attribution. The gold standard for business-critical content.

  2. GPT-4.5 (8.7/10) — Outstanding human-like writing with elegant prose. Ideal when engagement and style matter alongside professionalism.

  3. DeepSeek R1 (8.5/10) — Surprisingly strong with a distinctive voice. Perfect for content that needs to stand out while maintaining credibility.

  4. Alibaba Qwen 2.5-Max (8.4/10) — Most comprehensive and detailed. Best for in-depth analysis of complex topics.

  5. Google Gemini 2.5 Experimental (8.2/10) — Excellent structure and detailed analysis with sophisticated writing. Strong for formal business content and reports.

  6. Grok 3 (7.5/10) — Competent but undermined by small elements. It’s better for internal drafts requiring revision.

  7. Gemini Advanced 2.0 (7.3/10) — Academic rather than journalistic in style. Better for technical content than general business communications.

  8. Google Docs Gemini (7.1/10) — Convenient integration with adequate quality. Appropriate for routine documents with human oversight.

  9. LLaMA 3 (6.4/10) — Basic professional tone but incomplete. Requires significant editing for business use.

  10. Reka AI (6.0/10) — Too informal for business contexts. Not recommended without substantial reworking.

LLM Content Performance Comparison

LLM Content Performance Comparison

ModelOverallProfessionalEngagingHuman-LikeStructureDetail
C
Claude 3.7 Sonnet
9.2(1)
5
4
5
5
5
G
GPT-4.5
8.7(2)
5
4
5
4
4
D
DeepSeek
8.5(3)
4
5
4
4
4
Q
Alibaba Qwen 2.5-Max
8.4(4)
4
4
4
4
5
G
Gemini 2.5 Pro
8.2(5)
4
4
4
5
5
G
Grok 3
7.5(6)
4
4
4
4
4
G
Gemini Advanced
7.3(7)
4
3
4
4
4
G
Google Docs Gemini
7.1(8)
4
3
4
4
4
L
LLaMA 3
6.4(9)
4
3
3
3
3
R
Reka AI
6.0(10)
3
4
3
3
3

Key Takeaways and Future Outlook

Our evaluation revealed several surprising insights with implications for the future of AI in business content:

  1. Open-source models are more competitive than expected: DeepSeek's third-place finish and Qwen's fourth-place showing demonstrate that open-source models are rapidly closing the gap with proprietary leaders. This could dramatically change price-performance calculations for businesses within the next year.

  2. Google's latest model shows significant improvement: Gemini 2.5 Pro's fifth-place ranking shows substantial advancement over earlier Gemini versions, particularly in structure and detail, making it newly competitive for formal business content.

  3. Quality doesn't correlate with length: While Qwen and Gemini 2.5 Pro delivered the longest articles at 2,686 words each, GPT-4.5 achieved exceptional quality in just 1,046 words, highlighting that efficiency in communication remains valuable.

  4. Different models have distinct specializations: DeepSeek excels at engaging content, Claude at balanced professionalism, and Qwen and Gemini 2.5 Pro at detailed analysis. Businesses should select models based on specific content needs.

  5. Integration vs. specialization tradeoffs matter: Google Docs Gemini offers workflow advantages but quality compromises. Businesses must determine whether convenience or maximum quality is their priority.

As AI content generation continues to evolve, several open questions will shape business adoption:

  1. How quickly will open-source models reach full parity? The current trajectory suggests the gap is closing faster than expected.

  2. Will specialized models emerge for specific content types? Purpose-built models for journalism, marketing, or technical content could further raise quality bars.

  3. How will human-AI collaboration roles evolve? The most effective implementation will likely involve AI handling first drafts with humans focusing on refinement and strategic direction.

For business leaders, the key conclusion is clear: AI content generation has reached a level of sophistication where top models can produce genuinely professional-quality writing. The question isn't whether to use AI for business content but which models best fit specific organizational needs and how to integrate them most effectively into existing workflows.

Claude is still my ‘Go-To’ LLM of choice for content writing. It finally has some competition from the other major players, though. Keep an eye on Deepseek. Its style of writing was unusually fresh. It’s not as polished, so I wouldn’t immediately use it for anything business-critical, but I wonder what it would deliver with some instructions tailored specifically for it? Maybe that is the next test…

Let me know what you think. What are you using for your content?