Back to Case Studies
Professional ServicesPwC

Full Visibility into AI Training Costs: How PwC Optimized a Custom LLM Pipeline Spanning 8 AWS Regions

PwC used PointFive to map and optimize a 32-billion parameter LLM training pipeline across 8 AWS regions and 5 NVIDIA GPU architectures, identifying 11-19% cost reduction opportunities.

PwC logo

~$180K

in annual identified savings

~19%

cost reduction

5

NVIDIA GPU architectures optimized

Overview

Client: PwC

Industry: Professional Services / AI Research

Cloud Provider: AWS

Challenge: PwC's AI research team is building a custom 32-billion parameter LLM across a complex, multi-region training pipeline using cutting-edge GPU infrastructure — from NVIDIA H200s to preview-stage B300 Blackwell GPUs. With workloads spread across 8 AWS regions, traditional FinOps tools couldn't map the full pipeline or pinpoint where waste was hiding.

Solution: PointFive's DeepWaste™ Detection Engine mapped PwC's entire LLM training pipeline end-to-end, surfacing actionable inefficiencies across compute, storage, and data transfer that traditional tools cannot see.

Results at a Glance

  • $78K/month in AI/ML infrastructure fully mapped and attributed across 8 regions
  • $9K–$15K/month in savings identified (11–19% cost reduction)
  • 5 NVIDIA GPU architectures optimized (Blackwell, Hopper, Ampere, Turing)
  • Continuous monitoring established for NVIDIA Blackwell GA pricing transition
  • Prioritized remediation plan delivered in actionable, phased engineering steps

Background

PwC, one of the world's leading professional services firms, is investing heavily in custom AI capabilities. The firm's AI research team operates a platform for training a custom 8-billion parameter large language model built on NVIDIA MegatronLM 2.0 and Amazon SageMaker HyperPod.

The training platform is a full ML pipeline: high-memory CPU instances handle data preparation and tokenization, GPU instances power instruction fine-tuning and model evaluation, and SageMaker HyperPod clusters run distributed training — all coordinated across 8 AWS regions with FSx for Lustre providing high-throughput storage and S3 managing checkpoint distribution.

At approximately $78K/month, with 99.6% of spend directly tied to AI/ML workloads, even modest percentage improvements translate into meaningful savings. The team needed confidence that optimizations wouldn't disrupt a training pipeline where a single misconfiguration could waste days of GPU time.

Objectives

  • Map the full AI training pipeline across all compute, storage, and data transfer components spanning 8 AWS regions
  • Identify waste in GPU and ML infrastructure across SageMaker, EC2 GPU instances, FSx storage tiers, and cross-region data flows
  • Prepare for Blackwell cost impact before NVIDIA Blackwell (P6) instances transition from AWS Preview to GA pricing
  • Deliver actionable, prioritized recommendations with engineering-ready remediation steps

Challenges

Multi-region pipeline complexity: The training pipeline spans us-east-1 (checkpoint hub), ap-south-1 (primary GPU training), us-east-2 (HyperPod + data prep), us-west-2 (Blackwell testing), and 4 additional regions. A checkpoint transfer cost in us-east-1 is meaningless without understanding it feeds a training job in ap-south-1.

Mixed compute paradigms: GPU workloads run across standalone EC2 instances, SageMaker HyperPod clusters, SageMaker Training Plans, Capacity Block reservations, and AWS Preview instances — each with different pricing models. No single AWS tool provides a unified view.

Cutting-edge hardware with no pricing history: PwC is among the earliest adopters of NVIDIA Blackwell GPUs, currently running 689 hours/month at $0 during AWS Preview. When GA pricing takes effect, this becomes a significant new cost center with no historical data to plan around.

High stakes, low tolerance for disruption: Training an 8B-parameter model is a multi-week process. The team needed optimization recommendations they could trust.

Solution

PwC adopted PointFive to bring structure and visibility to their LLM training infrastructure.

End-to-End Pipeline Mapping

The DeepWaste™ Detection Engine identified and mapped every component of the training pipeline, attributing costs and data flows across all 8 regions.

Multi-Layer Cost Analysis

Pipeline StageAnnual CostKey Resources
Data Preparation & Tokenization$139Kr6a.48xlarge, r8i.metal-96xl, c6id
GPU Training & Fine-Tuning$123KP5.4xlarge (H100), P4de (A100), G5 (A10G)
SageMaker HyperPod$193Kml.g5.12xlarge, ml.m5.12xlarge/16xlarge
High-Performance Storage$156KFSx for Lustre (1000 MB/s and 250 MB/s tiers)
Checkpoint Storage & Distribution$78KS3 + cross-region transfer
Development Environments$58KSageMaker notebooks, Studio

Key Discoveries

  • $33K/year in dormant snapshot storage — ideal for archive tier at 75% savings
  • GPU notebooks running 24/7 for ~35% utilization — a straightforward lifecycle policy fix
  • Over-provisioned FSx throughput when actual I/O patterns could be served by a lower tier
  • Cross-region checkpoint transfer waste without S3 Cross-Region Replication
  • Development instances running outside working hours

Results

$108K–$180K/year in Identified Savings (11–19% of Total Spend)

PriorityOptimizationAnnual SavingsEffort
CriticalEBS snapshot archival$24K–$33K1 hour
HighSageMaker notebook lifecycle policies$18K–$30K2 hours
HighS3 Intelligent-Tiering for checkpoints$10K–$14K1 hour
HighData transfer optimization (VPC endpoints + CRR)$10K–$18K4 hours
MediumEBS volume type modernization (gp2 → gp3)$10K–$18K2 hours
MediumInstance scheduling for dev/eval resources$12K–$24K4 hours
StrategicRegional consolidation assessment$18K–$30K2 weeks

Validated existing good practices: PointFive confirmed that PwC's use of SageMaker Training Plans and EC2 Capacity Blocks for H100 reservations were well-optimized.

Blackwell cost preparedness: With 689 hours/month of P6 Blackwell GPU usage currently at $0, PointFive established a monitoring baseline for when GA pricing takes effect — estimated at $50K–$100K/month.

Full pipeline visibility: For the first time, PwC's AI research team had a single view connecting data preparation costs to training costs to checkpoint distribution costs across regions.

Conclusion

PwC's LLM training platform represents a new class of cloud workload: multi-region, multi-architecture, rapidly evolving, and mission-critical. Traditional FinOps tools see services and line items, not training pipelines and data flows.

PointFive mapped a full LLM training pipeline across 8 AWS regions, 5 NVIDIA GPU generations, and multiple compute paradigms into a coherent cost picture with prioritized, engineering-ready optimizations.

With $9K–$15K/month in savings identified and continuous monitoring in place for the coming Blackwell cost transition, PwC is positioned to scale its custom AI capabilities with cost efficiency built into the foundation.

About PointFive

PointFive redefines how enterprises continuously optimize cloud, infrastructure, and AI environments. By combining a real-time cloud and infrastructure data fabric with AI-driven detection and guided remediation, PointFive transforms efficiency from a reporting exercise into an operational discipline. Customers achieve sustained improvements in cost, performance, reliability, and engineering accountability at scale.

To learn more, book a demo.

Savings by Service

~$33K/yr

EBS Snapshot Archival

~$30K/yr

SageMaker Notebook Lifecycle Policies

~$18K/yr

Data Transfer Optimization

~$30K/yr

Regional Consolidation

Ready to find your hidden savings?

Get a quantified savings report in 48 hours, no agents, no risk.

Book a Demo