April 25 (Reuters) – As the summer of 2022 drew to a close, Meta CEO Mark Zuckerberg brought together his top lieutenants for a five-hour dissection of the company’s IT capability, focusing on its ability to perform cutting-edge artificial intelligence work, according to a Sept. 20 memo reviewed by Reuters.
They had a thorny problem: Despite massive investments in AI research, the social media giant had been slow to adopt expensive, AI-ready hardware and software systems for its core business, hampering its ability to keep pace with large-scale innovation even as it has increasingly relied on AI to support its growth, according to the memo, company statements and interviews with 12 people familiar with the changes, who spoke on condition of anonymity to discuss internal company matters.
“We have a significant gap in our AI development tools, workflows and processes. We need to invest heavily here,” said the memo, written by new infrastructure manager Santosh Janardhan, which was posted on Meta’s internal message board. in September and is now being reported for the first time.
Supporting AI work would require Meta (META.O) to “fundamentally change the design of our physical infrastructure, our software systems and our approach to provide a stable platform,” he added.
For more than a year, Meta has been engaged in a massive project to get its AI infrastructure in shape. Although the company has publicly acknowledged “playing a bit of catch-up” on AI hardware trends, details of the redesign – including capacity issues, management changes and an abandoned AI chip project – have not been previously reported.
Asked about the memo and the restructuring, Meta spokesman Jon Carvill said the company “has a proven track record of building and deploying state-of-the-art infrastructure at scale, combined with deep expertise in AI research and engineering”.
“We are confident in our ability to continue to expand the capabilities of our infrastructure to meet our near and long-term needs as we bring new AI-powered experiences to our large family of applications and products. public,” Carvill said. He declined to say whether Meta gave up his AI chip.
Janardhan and other executives did not grant interview requests made through the company.
The overhaul has spiked Meta’s capital spending by about $4 billion a quarter, according to the company’s disclosures — nearly double its spending in 2021 — and caused it to pause or cancel builds of data centers previously planned at four sites.
These investments coincided with a period of severe financial restraint for Meta, which has been laying off employees since November on a scale not seen since the dotcom collapse.
Meanwhile, OpenAI’s ChatGPT, backed by Microsoft, became the fastest-growing consumer app in history after its November 30 debut, sparking an arms race among tech giants to launch. products using so-called generative AI, which, beyond recognizing data patterns like other AI, creates human-like written and visual content in response to prompts.
Generative AI is gobbling up tons of computing power, amplifying the urgency of Meta’s capacity scramble, five of the sources said.
FALL LATE
According to these five sources, a key source of the problem can be traced to Meta’s late adoption of the graphics processing unit, or GPU, for AI work.
GPU chips are particularly well suited for AI processing because they can perform a large number of tasks simultaneously, reducing the time it takes to sift through billions of data.
However, GPUs are also more expensive than other chips, with chipmaker Nvidia Corp (NVDA.O) controlling 80% of the market and maintaining a sizable lead in accompanying software, the sources said.
Nvidia did not respond to a request for comment for this story.
Instead, until last year, Meta largely ran AI workloads using the company’s fleet of central processing units (CPUs), the workhorse chip of the computing world, which has filled data centers for decades but performs AI work poorly.
According to two such sources, the company has also started using its own custom chip it designed in-house for inference, an AI process in which algorithms trained on massive amounts of data make judgments and generate responses to prompts.
In 2021, this two-pronged approach proved slower and less efficient than one built around GPUs, which were also more flexible to run different types of models than Meta’s chip, the two people said.
Meta declined to comment on the performance of its AI chip.
As Zuckerberg pivoted the company to the metaverse — a collection of digital worlds enabled by augmented and virtual reality — his capacity crisis was slowing his ability to deploy AI to respond to threats, like the rise of the rival TikTok social media and Apple-led advertising privacy. changes, said four of the sources.
The stumbles caught the attention of former Meta board member Peter Thiel, who resigned in early 2022, without explanation.
In a board meeting before he left, Thiel told Zuckerberg and his executives that they were happy with Meta’s core social media business while focusing too much on the metaverse, which, he said made the company vulnerable to a challenge from TikTok, according to two sources familiar with the exchange.
Meta declined to comment on the conversation.
CATCH UP
After ending the large-scale rollout of Meta’s own custom inference chip, which was planned for 2022, executives instead reversed course and placed orders that year for billions of dollars worth of Nvidia GPUs. , said a source.
Meta declined to comment on the order.
By then, Meta was already several steps behind peers like Google, which had started rolling out its own custom version of GPUs, called TPUs, in 2015.
Executives also set out this spring to revamp Meta’s AI units, appointing two new engineering chiefs in the process, including Janardhan, the author of the September memo.
More than a dozen executives left Meta during the months-long upheaval, according to their LinkedIn profiles and a source familiar with the departures, a near-general shift in the direction of AI infrastructure.
Meta then began to rearrange its data centers to accommodate incoming GPUs, which consume more power and produce more heat than CPUs, and which must be tightly clustered with a dedicated network between them.
The facilities needed 24 to 32 times the networking capacity and new liquid cooling systems to handle cluster heat, requiring them to be “completely redesigned”, according to the memo from Janardhan and four sources familiar with the project, the details of which have not yet been disclosed.
Early in the work, Meta laid out internal plans to begin developing a new, more ambitious internal chip, which, like a GPU, would be capable of both training AI models and performing inference. The previously unreported project is expected to end around 2025, two sources said.
Carvill, the Meta spokesman, said data center construction that had been paused during the transition to the new designs would resume later this year. He declined to comment on the chip project.
COMPROMISE
While increasing its GPU capacity, Meta, for now, has little to show for as competitors like Microsoft and Google promote public launches of commercial generative AI products.
CFO Susan Li acknowledged in February that Meta isn’t devoting much of its current compute to generative work, saying “all of our AI capability basically goes into ads, streams, and reels,” her TikTok-like short video format that is popular with younger users.
According to four of the sources, Meta only prioritized building generative AI products after ChatGPT launched in November. Even though its research lab FAIR, or Facebook AI Research, has been releasing prototypes of the technology since late 2021, the company hasn’t focused on converting its reputable research into products, they said.
As investor interest skyrockets, that is changing. Zuckerberg announced in February the creation of a new high-level generative AI team that he said would “energize” the company’s work in this area.
CTO Andrew Bosworth also said this month that generative AI was the area where he and Zuckerberg spent the most time, anticipating Meta to release a product this year.
Two people familiar with the new team said its work is in its early stages and focused on building a base model, a core program that can then be refined and adapted to different products.
Carvill, the spokesperson for Meta, said the company has been building generative AI products on different teams for more than a year. He confirmed that the work accelerated in the months following the arrival of ChatGPT.
Reporting by Katie Paul, Krystal Hu, Stephen Nellis and Anna Tong; additional reporting by Jeffrey Dastin; edited by Kenneth Li and Claudia Parsons
Our standards: The Thomson Reuters Trust Principles.