E-Commerce Trust Simulation with LLM-Powered Agents

Project Overview

This agent-based modeling (ABM) research project simulates an e-commerce marketplace with 7,580 autonomous agents to quantify how coordinated fake review campaigns manipulate consumer trust and purchasing behavior. Using the MESA framework and local Llama 3.1 8B inference via Ollama, the simulation generates realistic review behavior and purchasing decisions to provide statistical evidence of market manipulation patterns.

Developed as the final project for Simulation and Modeling Course at Hasanuddin University, this work combines agent-based modeling, large language models, and rigorous statistical analysis to answer critical research questions about e-commerce fraud.

Research Questions

RQ1: Fake Review Impact on Conversion Rates

How much does conversion rate increase for low-quality products targeted by fake review campaigns?

RQ2: Consumer Persona Vulnerability

Which consumer persona is most vulnerable to fake reviews? (Impulsive vs Careful vs Skeptical)

Key Findings

Research Question	Key Result
RQ1: BudgetBeats (Low Quality)	0% to 54% conversion (+54pp, Chi-Square = 121.30, p less than 0.0001)
RQ1: ClearSound (Low-Medium)	0% to 72% conversion (+72pp, Chi-Square = 177.35, p less than 0.0001)
RQ2: Most Vulnerable	Careful: +95pp, Impulsive: +92.5pp
RQ2: Least Vulnerable	Skeptical: +40.44pp (2.3x less vulnerable)

Simulation Architecture

MESA Framework Design

Multi-agent system built on MESA (Modular Entity Scheduling Architecture) with Products (5 headphone models with quality attributes), Agent Scheduler (RandomActivation), Data Collector (metrics tracking), and Phase Execution (Review Phase for genuine + fake review generation, Shopping Phase for purchase decisions).

Agent Population Breakdown (7,580 Total)

Reviewer Agents (1,730 total)

Genuine Reviewers (1,200): 20 iterations x 5 products x 12 reviews
- 4 Critical personality (harsh ratings)
- 4 Balanced personality (objective ratings)
- 4 Lenient personality (generous ratings)
Fake Reviewers (530): Coordinated campaign
- Burst phase: 80 reviews (iterations 4-5)
- Maintenance: 450 reviews (iterations 6-20)

Shopper Agents (6,000 total)

20 iterations x 5 products x 3 personas x 20 shoppers per group
Impulsive (33%): Reads 3 reviews, fast decisions
Careful (33%): Reads 10 reviews, balanced analysis
Skeptical (33%): Reads 15 reviews, pattern detection

Product Configuration

Five headphone products with realistic quality attributes:

ID	Product	Quality Tier	Price (IDR)	Avg Quality	Targeted?
1	SoundMax Pro	High	450,000	8.5/10	No
2	AudioBlast Wireless	Med-High	350,000	7.5/10	No
3	BudgetBeats	Low	150,000	4.25/10	YES
4	TechWave Elite	Premium	650,000	9.25/10	No
5	ClearSound Basic	Low-Med	250,000	5.25/10	YES

LLM Integration Architecture

Local Inference System

Deployment Configuration:

Model: Llama 3.1 8B
Platform: Ollama (local inference)
Context Windows: 2048-8192 tokens (dynamic)
Temperature: 0.6 (reviewers), 0.3 (shoppers), 0.7 (fake reviewers)
Cost: $0 (100% API cost elimination)

Prompt Engineering Strategy

Genuine Reviewer Prompts: Quality-aware rating system with personality-based guidance. Example: Critical personality reviewing BudgetBeats (4.25/10 quality) generates 1-2 star reviews with specific feature mentions (sound, build, battery, comfort) using natural language without templates.

Fake Reviewer Prompts: Variation strategy avoiding detection patterns with ALWAYS 5 stars, varied openings (direct opinion, time context, comparison, situation), natural language, specific details, and 1-4 sentences.

Shopper Decision Prompts: Persona-specific logic - Impulsive (rating above 3.8 leads to BUY, HIGH vulnerability), Careful (10 reviews analysis, MEDIUM vulnerability), Skeptical (burst/pattern detection, LOW vulnerability).

Experimental Design

Attack Timeline

Baseline (iterations 1-3): Only 180 genuine reviews (3 x 60)

Burst (iterations 4-5): 40 fake reviews per target per iteration = 160 total fake

Maintenance (iterations 6-20): Adaptive volume based on rating - AGGRESSIVE (15 reviews if rating below 4.0), MODERATE (11 if rating below 4.3), NORMAL (7 if rating 4.3 or above)

Data Collection Pipeline

Per-Iteration Metrics:

Reviews CSV: product_id, rating, text, is_fake, iteration, personality
Transactions CSV: product_id, persona, decision, reasoning, iteration
Model Metrics CSV: ratings, review counts, fake counts per product

Statistical Analysis Methodology

RQ1: Chi-Square Test for Fake Review Impact

Hypothesis:

H0: Fake reviews have no effect on conversion rates
H1: Fake reviews significantly increase conversion rates

Results for Targeted Products:

Product	Baseline Conv	Attack Conv	Increase	Chi-Square	p-value	Cramér's V
BudgetBeats	0.00%	54.17%	+54.17pp	121.30	less than 0.0001	0.636
ClearSound	0.00%	71.67%	+71.67pp	177.35	less than 0.0001	0.769

Interpretation:

p < 0.0001: Extremely strong statistical significance (reject H0)
Cramér's V > 0.6: Large effect size (strong association)
Practical significance: 54-72pp increase is massive in real-world context

RQ2: ANOVA for Persona Vulnerability

Hypothesis:

H0: All personas equally vulnerable to manipulation
H1: Significant differences exist between personas

Vulnerability Ranking (during attack on targeted products):

Rank	Persona	Baseline	Attack Period	Impact
1	Careful	0.0%	95.0%	+95.00pp
2	Impulsive	0.0%	92.5%	+92.50pp
3	Skeptical	0.0%	40.4%	+40.44pp

ANOVA Test Results:

F-statistic = 540.28
p-value < 0.0001 (***)
Sample sizes: 680 per persona

Key Finding: Skeptical persona 2.3x less vulnerable (40pp vs 92-95pp for others)

Experimental Results

RQ1 Detailed Results

BudgetBeats (Low Quality, ID = 3)

Phase	Iterations	Avg Rating	Conversion	Change
Baseline	1-3	2.1 stars	0.00%	-
Burst	4-5	3.8 stars	54.17%	+54.17pp
Post-Burst	6-20	4.2 stars	79.22%	+79.22pp

ClearSound Basic (Low-Medium, ID = 5)

Phase	Iterations	Avg Rating	Conversion	Change
Baseline	1-3	2.3 stars	0.00%	-
Burst	4-5	4.1 stars	71.67%	+71.67pp
Post-Burst	6-20	4.4 stars	76.22%	+76.22pp

Control Products (Non-Targeted):

Product	Quality	Baseline Conv	Final Conv	Change
SoundMax Pro	High	75%	78%	+3pp (natural)
AudioBlast	Med-High	58%	62%	+4pp (natural)
TechWave Elite	Premium	82%	85%	+3pp (natural)

Key Insight: Only targeted products show massive jumps; control products show natural small fluctuations.

RQ2 Detailed Results

Persona Vulnerability on Targeted Products:

Persona	Baseline	Attack	Impact	Rank
Careful	0.0%	95.0%	+95.0pp	#1
Impulsive	0.0%	92.5%	+92.5pp	#2
Skeptical	0.0%	40.4%	+40.4pp	#3

Key Findings:

Careful Persona Most Vulnerable: Paradoxically, deeper analysis (10 reviews) increases susceptibility when fake reviews dominate the sample.
Skeptical Persona 2.3x More Resistant: Pattern detection and burst ratio analysis provide some protection.
Impulsive Nearly As Vulnerable: Despite reading only 3 reviews, high trust in star ratings makes them susceptible.

Key Insights and Contributions

Scientific Contributions

Quantitative Evidence of Manipulation Effectiveness

First simulation study to demonstrate +54-72pp conversion increase with publication-grade statistical rigor (p < 0.0001).

Persona-Specific Vulnerability Profiles

Novel finding: Careful persona MOST vulnerable (95pp impact) despite deeper review analysis. Challenges assumption that more information = better decisions when information is manipulated.

LLM-Powered Agent-Based Modeling Methodology

Demonstrates feasibility of local LLM inference (Llama 3.1 8B) for realistic natural language generation in ABM research with zero API costs.

Temporal Dynamics of Trust Manipulation

Quantifies burst + maintenance attack strategy effectiveness, showing sustained elevation of ratings and conversion even as genuine reviews accumulate.

Detection-Resistant Campaign Design

Adaptive maintenance strategy (AGGRESSIVE/MODERATE/NORMAL) maintains ratings without obvious spikes that detection algorithms might flag.

Practical Implications

For E-Commerce Platforms:

Burst Detection: Monitor for sudden rating jumps > 1.0 stars
Temporal Analysis: Flag products with 60%+ positive reviews in short timespan
Reviewer Patterns: Detect coordinated timing of 5-star reviews
Consumer Education: Warn Careful/Impulsive users they're most vulnerable

For Consumers:

Skeptical Mindset: 2.3x more protective than trusting approach
Pattern Recognition: Look for rating jumps and review bursts
Quality Signals: Focus on specific product details vs generic praise
Baseline Comparison: Check pre-campaign ratings if available

For Researchers:

ABM + LLM Synergy: Realistic behavior simulation without manual scripting
Cost-Effective Research: Local inference enables large-scale experiments
Reproducible Methodology: Open-source framework for replication
Statistical Rigor: Publication-ready analysis pipeline

Technical Implementation Highlights

Performance Metrics

LLM Inference:

Average: 1.8 seconds per review generation
Throughput: approximately 1,730 reviews in approximately 52 minutes
Memory: approximately 8 GB (Ollama server)

Simulation Runtime:

20 iterations: approximately 4-6 hours on consumer hardware
Bottleneck: LLM inference

Reproducibility:

Fixed random seed: 42
Version control: ollama==0.1.0, mesa less than 3.0
CSV exports for independent analysis

Lessons Learned

Key Insights

LLM Integration:

Local inference viable for research (zero cost)
Dynamic context windows prevent overflow
Quality-speed tradeoff acceptable (1.8s per review)

Prompt Engineering:

Quality-aware prompts generate realistic genuine reviews
Variation templates prevent fake review uniformity
Persona-specific rules create emergent behavior

Statistical Rigor:

Multiple tests (Chi-Square, ANOVA) strengthen claims
Effect sizes (Cramér's V) show practical significance
Clear baseline/attack comparison isolates causality

Consumer Behavior:

Surprising: Careful persona MOST vulnerable (not least)
Intuitive: Skeptical pattern detection provides 2.3x protection
Actionable: Reading 15 reviews (Skeptical) much better than 10 reviews (Careful)

Future Work

Model Extensions

Network Effects:

Social influence between consumers
Viral review sharing
Influencer-driven campaigns

Platform Interventions:

Detection algorithm simulation
Reviewer reputation scores
Verified purchase badges

Statistical Rigor:

Monte Carlo simulation (100+ runs)
Sensitivity analysis on all parameters
Bayesian inference for uncertainty quantification

Conclusion

This agent-based modeling study provides empirical evidence that coordinated fake review campaigns are highly effective at manipulating consumer trust, increasing conversion rates by +54-72 percentage points for low-quality products with p < 0.0001 statistical significance.

Key Achievements:

Publication-grade statistical evidence (Chi-Square = 121-177, p less than 0.0001)
Novel vulnerability findings (Careful persona MOST vulnerable, +95pp impact)
Technical innovation (Local Llama 3.1 8B, 100% cost elimination)
Methodological contribution (LLM-powered ABM for realistic simulation)

Impact: Demonstrates that fake reviews work (72%+ conversion for low-quality products), consumers are vulnerable (even "Careful" shoppers), detection is possible (burst patterns provide signals), and skepticism helps (2.3x more resistant).

Future work should focus on real-world validation with platform data, detection algorithm evaluation, and long-term temporal dynamics.

Project Repository: GitHub - fake-review-abm-llm

Course: Simulation and Modeling, Hasanuddin University, 2025

Skills & Tools

Skills Applied

Tools & Software

Challenges

Solutions

Impact