It is an exciting time to build products powered by LLMs (large language models). Some of the best the commercial market has to offer (such as OpenAI Davinci, Cohere Generate) exhibit very surprising properties, such as picking up CAP theorem (in distributed computing) from a very vague description (https://twitter.com/LiaoPeiyuan/status/1617019242561568768), or explaining the Black-Scholes formula in an uwu voice (https://twitter.com/goodside/status/1597438176980660225). I'd like to refer to them as "breakthrough capabilities" (in the same vein of “emergent abilities” from Jason Wei) that are not yet widely reflected in academic benchmarks. In this short opinion piece, I will try to assess the feasibility of rapidly replicating these capabilities within a 6-month timeframe and reason about the most effective approach for product builders when dealing with large language models.
In short, to get breakthrough capabilities, we need internet-scale data and collaborative training power.
Data collection relies on near-data-center storage, scraping, and curation capabilities. The former requires specialized skills and a complex setup, while the latter is even more challenging since scraping is inherently adversarial, and curation has high barrier of entry and is generally underexplored in both industry and academia. Collaborative training in this context refers to synchronized gradient descent across a mostly static training set. It demands significant raw computing power and bandwidth, the former of which is readily available in retail but training a GPT-3.5 level model without adequate bandwidth has yet to be achieved (the best attempt so far, as far as I am aware of, being Together GPT-JT-6B-v1). Currently, techniques similar to that of MosaicML have not yet reached broad adoption, which means that a partnership with cloud providers or colocation centers is necessary. Colocation centers typically need bespoke relationships and contract values at least in single-digit millions and may take up to three months to set up. Meanwhile, cloud providers such as Google lack high bandwidth clusters, while Amazon's Elastic Fabric Adapter (EFA) stack is difficult to set up and inferior to Mellanox. Azure reportedly is only interested in deals in the double-digit millions of dedicated spending, while Oracle also has limited capacity for over 64 nodes of compute. This leaves us with second-tier providers such as Lambda Labs and Coreweave, but finding enough capacity to train GPT-3.5 level models may still be a challenge.

Regarding effectively utilizing the collaborative computing power, PyTorch is arguably the most commonly used framework, the distributed training component of which is already challenging (as evidenced by the lack of growth in projects such as MegatronLM, Horovod, and Deepspeed, meaning labor supply remains stable). This means that even startups using PyTorch may not attract enough infrastructure engineers to maintain proper breakthrough-LLM training jobs. The use of specialized hardware (such as Tenstorrent, Graphcore, Cerebras, and Google TPU) requires even more professional training. While academic institutions have access to supercomputers with ample computing power, it’s harder to convert that into proprietary models readily useful for product building.
A preliminary conclusion is that most startups claiming to build new models are likely to only be able to train with a limited number of hardware nodes (probably just one), mainly through fine-tuning of existing large models or training smaller models from scratch. The most probable fine-tunable models in this category are GPT-J-6B, GPT-NeoX-20B, OPT, BART, BLOOM, Flan-T5, among others. Per scaling law, it's hard for these startups to compete with “breakthrough” models like Bard/Claude/ChatGPT, which have better access to highly curated data, more collaborative computing power, and a talented team to ensure consistent development progress. (https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf gives us a glimpse into that process as well).
Another minor comment is that scaling and serving large language models is still a challenge, and while frameworks like Ray Serve, Alpa, and Colossal AI exist, there is limited knowledge on how to use them effectively in products in production. This is why most of the new models, released as a reaction to ChatGPT, are still in the "research preview" stage.
This evaluation is likely inaccurate in three ways:
Given the challenges in moving along the scaling laws curve with labor and hardware constraints, a more practical approach for product builders may be to assemble existing open- and closed-source LLMs rather than training from scratch or even fine-tuning. This is frequently called the “composability” hypothesis in recent discussions. Langchain is a noteworthy example of this effort, with non-LLM APIs such as Google Search and Wolfram Alpha also becoming involved. Automated and learnt prompt engineering with either black-box or white-box models is another exciting possibility. It is important to be transparent about whether products are built using models trained from scratch, fine-tuned models, or composable APIs. We should move away from the notion that only models trained from scratch are "impressive" or "technically defendable." LLMs still produce false information, have limited memory, struggle with multi-modality, and their outputs are hard to formalize. Instead, we should be cool-headed in trying to make LLMs perform tasks that require high precision and think more deeply about their eloquence, cross-lingual contextual understanding, and exceptional fuzzy retrieval abilities.
In short, to get breakthrough capabilities, we need internet-scale data and collaborative training power.
Data collection relies on near-data-center storage, scraping, and curation capabilities. The former requires specialized skills and a complex setup, while the latter is even more challenging since scraping is inherently adversarial, and curation has high barrier of entry and is generally underexplored in both industry and academia. Collaborative training in this context refers to synchronized gradient descent across a mostly static training set. It demands significant raw computing power and bandwidth, the former of which is readily available in retail but training a GPT-3.5 level model without adequate bandwidth has yet to be achieved (the best attempt so far, as far as I am aware of, being Together GPT-JT-6B-v1). Currently, techniques similar to that of MosaicML have not yet reached broad adoption, which means that a partnership with cloud providers or colocation centers is necessary. Colocation centers typically need bespoke relationships and contract values at least in single-digit millions and may take up to three months to set up. Meanwhile, cloud providers such as Google lack high bandwidth clusters, while Amazon's Elastic Fabric Adapter (EFA) stack is difficult to set up and inferior to Mellanox. Azure reportedly is only interested in deals in the double-digit millions of dedicated spending, while Oracle also has limited capacity for over 64 nodes of compute. This leaves us with second-tier providers such as Lambda Labs and Coreweave, but finding enough capacity to train GPT-3.5 level models may still be a challenge.
Regarding effectively utilizing the collaborative computing power, PyTorch is arguably the most commonly used framework, the distributed training component of which is already challenging (as evidenced by the lack of growth in projects such as MegatronLM, Horovod, and Deepspeed, meaning labor supply remains stable). This means that even startups using PyTorch may not attract enough infrastructure engineers to maintain proper breakthrough-LLM training jobs. The use of specialized hardware (such as Tenstorrent, Graphcore, Cerebras, and Google TPU) requires even more professional training. While academic institutions have access to supercomputers with ample computing power, it’s harder to convert that into proprietary models readily useful for product building.
A preliminary conclusion is that most startups claiming to build new models are likely to only be able to train with a limited number of hardware nodes (probably just one), mainly through fine-tuning of existing large models or training smaller models from scratch. The most probable fine-tunable models in this category are GPT-J-6B, GPT-NeoX-20B, OPT, BART, BLOOM, Flan-T5, among others. Per scaling law, it's hard for these startups to compete with “breakthrough” models like Bard/Claude/ChatGPT, which have better access to highly curated data, more collaborative computing power, and a talented team to ensure consistent development progress. (https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf gives us a glimpse into that process as well).
Another minor comment is that scaling and serving large language models is still a challenge, and while frameworks like Ray Serve, Alpa, and Colossal AI exist, there is limited knowledge on how to use them effectively in products in production. This is why most of the new models, released as a reaction to ChatGPT, are still in the "research preview" stage.
This evaluation is likely inaccurate in three ways:
- Vision models and other multimodal models may have more favorable scaling laws, allowing unique properties to emerge with just 1-8 nodes of computing power, which are more abundant in the market now.
- There are companies that have been privately training language models for specific use cases for a very long time (such as Character AI and Cresta), and their expertise and first-mover advantage will be valuable.
- Specialized training companies like MosaicML may have, or will, discover how to consistently replicate innovative abilities in limited use cases with reduced computing power. Open-source initiatives like the Pile or LAION may also challenge this conclusion. However, at present, the data quality is still lacking.
Given the challenges in moving along the scaling laws curve with labor and hardware constraints, a more practical approach for product builders may be to assemble existing open- and closed-source LLMs rather than training from scratch or even fine-tuning. This is frequently called the “composability” hypothesis in recent discussions. Langchain is a noteworthy example of this effort, with non-LLM APIs such as Google Search and Wolfram Alpha also becoming involved. Automated and learnt prompt engineering with either black-box or white-box models is another exciting possibility. It is important to be transparent about whether products are built using models trained from scratch, fine-tuned models, or composable APIs. We should move away from the notion that only models trained from scratch are "impressive" or "technically defendable." LLMs still produce false information, have limited memory, struggle with multi-modality, and their outputs are hard to formalize. Instead, we should be cool-headed in trying to make LLMs perform tasks that require high precision and think more deeply about their eloquence, cross-lingual contextual understanding, and exceptional fuzzy retrieval abilities.