Odd Lots

How to Build the Ultimate GPU Cloud to Power AI

Thu Jul 20 2023

Bank of AmericaAI chip marketCoreWeaveNVIDIAAI infrastructureGPU computeData centersSupply chainAI boom

Description

Bank of America's expansion in Paris, the AI chip market, CoreWeave's infrastructure and services, challenges in building AI infrastructure, CoreWeave's relationship with NVIDIA and market demand, infrastructure challenges and NVIDIA's dominance, CoreWeave's efficiency and competitive advantage, challenges in repurposing crypto GPUs for AI workloads, challenges in meeting demand and supply chain, and uncertainty about the AI boom.

Insights

Bank of America's Expansion in Paris

Bank of America expanded operations in Paris to serve European clients and be a one-stop shop for their needs.

CoreWeave's Role in the AI Chip Market

CoreWeave is a specialized cloud services provider that offers high-volume compute to AI companies, addressing the demand for infrastructure in the AI chip market.

Challenges in Building AI Infrastructure

Building infrastructure for next-generation AI models requires a different type of compute, and CoreWeave specializes in providing this compute to end consumers for AI development.

NVIDIA's Dominance in the AI Market

NVIDIA chips dominate the AI market due to their established ecosystem and support within the machine learning community, extending beyond GPUs to components like Infiniband fabric.

CoreWeave's Efficiency and Competitive Advantage

CoreWeave's cloud is more efficient than hyperscalers on a workload adjusted basis, giving them a competitive advantage.

Challenges in Repurposing Crypto GPUs for AI Workloads

Repurposing crypto GPUs for enterprise AI workloads is challenging and requires expertise, as retail grade GPUs used for crypto mining are not suitable for enterprise-grade workloads.

Challenges in Meeting Demand and Supply Chain

The supply chain for accessing chips has become increasingly challenging, making it difficult to meet the growing demand for AI infrastructure.

Uncertainty about the AI Boom

There is uncertainty about whether companies building AI models will make money, raising questions about the longevity of the AI boom and the monetization of AI products.

Chapters

Bank of America's Expansion in Paris
AI Chip Market and CoreWeave's Role
CoreWeave's Infrastructure and Services
Challenges in Building AI Infrastructure
CoreWeave's Relationship with NVIDIA and Market Demand
Infrastructure Challenges and NVIDIA's Dominance
CoreWeave's Efficiency and Competitive Advantage
Challenges in Repurposing Crypto GPUs for AI Workloads
Challenges in Meeting Demand and Supply Chain
Uncertainty and Questions about AI Boom

Summary

Transcript

Bank of America's Expansion in Paris

00:00 - 06:38

Bank of America expanded operations in Paris in 2019 following the Brexit vote.
Vanessa Holtz, CEO Bank of America Securities Europe SA and France Country Executive, ensured seamless service for European clients.
Bank of America aimed to be a one-stop shop for clients and bring its full spectrum of services to Paris.

AI Chip Market and CoreWeave's Role

00:00 - 06:38

The trading and dealing of advanced AI chips is a mystery that needs further exploration.
NVIDIA aims to provide a holistic approach by offering not just hardware but also services.
CoreWeave, a specialized cloud services provider, offers high-volume compute to AI companies.
The demand for infrastructure in the AI chip market has created a massive supply-demand imbalance.
The adoption curve for AI software is one of the fastest ever seen, leading to increased demand for infrastructure build.
CoreWeave recently raised over $400 million and is valued at around $2 billion.

CoreWeave's Infrastructure and Services

06:24 - 13:30

CoreWeave is a specialized cloud service provider focused on highly parallelizable workloads.
They build and operate the world's most performant GPU infrastructure at scale.
They serve the artificial intelligence, media and entertainment, and computational chemistry sectors.
Their infrastructure allows entities to train next-generation machine learning models faster than anyone else in the market.
The infrastructure for AI involves making data centers intelligent by expanding throughput and community capability.
CoreWeave designs their infrastructure to a DGX reference back, which maximizes performance with NVIDIA infrastructure components.
They co-locate within tier 3 or tier 4 data centers that guarantee high uptime through power redundancy, internet redundancy, security, and connectivity to backbone.
There has been a recent increase in demand for data center space due to the shift towards GPU compute instead of CPU compute.
GPU compute is about four times more power dense than CPU compute, leading to inefficient use of data center space and cooling issues.

Challenges in Building AI Infrastructure

13:13 - 20:28

The power density issue in data centers has become a major problem for the industry.
There is a limited number of data centers designed to accommodate the new power density requirements.
Connecting newer types of chips in data centers requires high data throughput and a different type of network fabric.
NVIDIA's Infiniband technology is used to connect devices in supercomputers.
Building infrastructure for next-generation AI models requires a completely different type of compute.
CoreWeave specializes in building this type of compute and aims to provide it to end consumers for AI development.

CoreWeave's Relationship with NVIDIA and Market Demand

20:09 - 27:18

CoreWeave has a good relationship with NVIDIA, allowing them quick access to the latest chips and infrastructure.
NVIDIA values CoreWeave's ability to deliver on promises and provide performant configurations.
The demand for AI training is currently high, but the real demand will be in the inference market.
A company that trained their model using 10,000 A100 GPUs may need a million GPUs within one to two years of launch to support the entire inference demand.
There is a shortage of GPU infrastructure globally, posing a challenge for scaling AI models.
Hyperscalers like Amazon, Google, and Microsoft are slowly ramping up their infrastructure with H100 chips.
Access to H100 chips for hyperscalers is expected to begin in late Q3.

Infrastructure Challenges and NVIDIA's Dominance

26:55 - 33:55

Infrastructure for the H100 chipset is being built across 10 data centers in the US.
Hyperscalers are expected to start delivering scale access to the H100 chipset in late Q3 or even later due to the complexity of building a different type of compute infrastructure.
The adoption rate of AI software may outpace the ability to scale infrastructure, leading to a supply-demand imbalance.
NVIDIA chips still dominate the AI market due to their established ecosystem and support within the machine learning community.
NVIDIA's advantage extends beyond GPUs and includes components like Infiniband fabric, making it difficult for other companies to displace them as the standard for AI infrastructure.
While hyperscalers have significant resources, other factors like expertise and ecosystem play a role in determining market position.

CoreWeave's Efficiency and Competitive Advantage

33:28 - 40:52

CoreWeave's cloud is 40-60% more efficient than hyperscalers on a workload adjusted basis.
Trillion dollar companies like hyperscalers have the budget and personnel to offer AI, but it requires foundational changes in their business.
It would take years for hyperscalers to catch up to CoreWeave's market share and product differentiation.
CoreWeave started in Ethereum mining but pivoted into the AI space due to lack of competitive advantage in mining.

Challenges in Repurposing Crypto GPUs for AI Workloads

40:25 - 47:27

There is no moat or advantage that could be created relative to competitors.
Producing own chips could provide an advantage, but it was not pursued.
GPU compute was explored as a way to develop uncorrelated optionality into multiple high growth markets.
The company aimed for 100% utilization rate across infrastructure by switching between AI workloads and cryptocurrency mining.
Ethereum mining effectively ended in Q3 of 2022.
Running a CSP (Cloud Service Provider) is complex, requiring software development and enterprise-grade GPU chip sets.
Retail grade GPUs used for crypto mining are not suitable for enterprise-grade workloads.
Enterprise-grade workloads require low failure rates and continuous operation, which retail grade GPUs cannot provide.
Data centers used by crypto miners are not compatible with the requirements of enterprise AI workloads.
Repurposing crypto GPUs for enterprise AI workloads is challenging and requires expertise in converting data centers from tier zero/tier one to tier three/tier four.
Estimates suggest only a small percentage (5-15%) of crypto GPUs can be repurposed.
Access to chips requires building relationships with suppliers like Nvidia and maintaining long-term commitments.

Challenges in Meeting Demand and Supply Chain

46:59 - 54:21

The supply chain for accessing chips has become increasingly challenging, requiring long-term relationships and planning.
NVIDIA is fully allocated and unable to fulfill additional compute chip requests.
Building a GPU cluster now requires significant lead time, with Q1 and Q2 already booked up.
Critical components need to be ordered ahead of time to reduce infrastructure build time.
The pace of demand growth has caught everyone by surprise, making it difficult for infrastructure to keep up.

Uncertainty and Questions about AI Boom

54:01 - 57:37

The ease with which traditional cloud providers or data center providers can switch to different types of connectors and power is not trivial.
The big question is how quickly other hyperscalers can adapt and how big a moat Nvidia can build around this business.
There is uncertainty about whether companies building AI models will make money, which raises questions about the longevity of the AI boom.
Tech companies should make money, but the monetization of AI products may be trickier than expected.