Unleashing Machine Learning For All 

How AWS created Inferentia and Trainium for powerful and cost-effective machine learning.
Image may contain Electronics Hardware Person and Printed Circuit Board
Weiquan Lin

Machine learning (ML) is powering breakthroughs that impact every aspect of our lives: cars that drive themselves, voice assistants that hold conversations, and even software that writes messages and creates photorealistic images with only a text prompt. From legacy enterprises to startups in every industry, companies are using this fast-evolving technology to solve problems in new ways. 

This evolution, however, presents a challenge. As ML becomes increasingly sophisticated, the models are getting exponentially complex. Over the last few years, for instance, these models have gone from having hundreds of millions of parameters to hundreds of billions of them. In addition, the high costs of training and deploying these increasingly complex ML models are locking out companies, especially smaller startups, and hindering innovation.

Until recently, Scott Lightner, the CTO and Co-Founder of Finch Computing, a software startup that makes human-generated text machine-readable, found himself in a similar predicament. Finch provides tools to analyze informational assets in various languages and deliver insights. Since its inception, Finch had been using ML infrastructure solutions from Amazon Web Services (AWS). Its products use natural language processing (NLP), a subset of ML algorithms that can understand the nuances of human language, including deciphering tone and intent. 

Scott and his team started with English and wanted to expand their products to more languages. But the high infrastructure costs needed to run these algorithms made this expansion impractical. 

Until they found AWS Inferentia, that is.

Democratizing Inference

In 2017, a team of AWS engineers were monitoring ML trends and noted that with the rising cost of compute, customers could benefit from a powerful but less expensive alternative. As a result, AWS designed Inferentia to deliver high performance and lower cost for ML applications. 

Creating Inferentia was a journey.

The AWS team had over a decade of experience designing and building silicon, such as the AWS Graviton Processor for general-purpose computing, and they believed they could build innovative silicon for ML workloads as well. The team decided to first tackle ML inference because it accounted for a majority of ML infrastructure costs, and a purpose-built accelerator could make a big difference in increasing performance, lowering costs, and reducing carbon footprint for large ML workloads. 

ML inference is the process of running new data points through existing models to generate a prediction, such as whether an expense is fraudulent or not. Depending on the application, there are requirements around how fast a model needs to generate a prediction (latency) and how many predictions need to be generated per second (throughput). The AWS engineers set out to design and build a chip that would deliver high performance on these metrics while also reducing costs and lowering power consumption.

Launched in 2019, AWS Inferentia, a purpose-built chip for ML inference and the servers it powers, delivered unmatched cost-performance value, higher throughput, and lower latency than comparable GPU-based servers. By delivering powerful performance at a fraction of the cost, AWS Inferentia democratized ML inference—opening up new paths of innovation for customers. 

Finch’s Scott Lightner, for example, immediately realized Inferentia’s value. “Given the cost of GPUs, we simply couldn’t have offered our customers additional languages while keeping our product profitable,” he says. “AWS Inferentia changed that equation for us.” 

Finch migrated its compute-heavy models from GPUs to AWS Inferentia and reduced its inference costs by more than 80 percent. With this reduction in infrastructure costs, the company added support for three additional languages, which not only attracted new customers interested in gaining insights from those languages, but also received positive feedback from existing customers.

Accelerating Training 

For a model to be able to “infer,” or make a prediction after processing a new data point, it first needs to be trained to do so. For instance, to teach a model to identify a fraudulent transaction, you first have to feed it examples of valid and fraudulent transactions from which it can learn. This learning happens when the model tunes its own parameters and increases its prediction accuracy with every data point. The larger the number of parameters, the longer it takes to tune the model, thereby increasing model training times, which can result in higher costs due to the prolonged use of compute infrastructure.

Costs are only part of the picture; longer training times also slow innovation. Engineering teams usually need to train, test, and validate different models over several iterations before they can pick the one that best serves their business purpose. If engineering teams can’t iterate fast enough, they can’t innovate fast enough. And this is the sort of thing that can keep startup founders up at night.

Jian Peng, CEO of Helixon, is familiar with this problem. Helixon builds next-generation AI solutions for protein-based therapeutics, developing AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. Today, Helixon parallelizes model training over many GPU-based servers, but this still takes the company weeks to train a single model. 

To help Helixon (and thousands of other customers) accelerate training, AWS took its learnings from building Inferentia to create AWS Trainium, a chip purpose-built to accelerate ML training. AWS built on its silicon innovation of Inferentia to create state-of-the-art ML services and infrastructure in AWS Trainium, a chip purpose built to accelerate ML training.

ML training is difficult, and sometimes needs more processing than any single server can handle. To distribute training across multiple servers, Trainium-based servers are deployed in EC2 UltraClusters, where tens of thousands of Trainium accelerators are connected with a peta-bit scale non-blocking network. As a result, customers can reduce training times for the most complex of deep-learning models with hundreds of billions of parameters while also reducing infrastructure costs. 

Jian Peng says, “We are excited to utilize Amazon EC2 Trn1 instances [Trainium-powered servers] featuring the highest networking bandwidth available in AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs.” 

AWS Trainium also features innovations that address challenges unique to machine learning. Stochastic rounding, for instance, is a probabilistic approach to rounding up or down based on a random number. This method delivers better accuracy over a large number of computations, but is compute-intensive. To solve for this, Trainium adds hardware support for stochastic rounding and unlocks better accuracy and faster training times—up to 20 percent faster for some models. 

In addition, AWS Trainium features several other hardware optimizations to accelerate mathematical operations commonly used in deep-learning training. And for developers who want to define and use their own operators, Trainium has embedded general-purpose processors deep inside its core. This unique feature allows customers to run their custom operators on the Trainium chip without having to move data back and forth to the CPU in the server. 

Purpose-Built for End-to-End Machine Learning

While Inferentia and Trainium each offer unique capabilities, perhaps their greatest strength is the sum of their parts: the two technologies working together to offer a range of innovations in machine learning. With just a few lines of code, developers can extract the full performance and benefits of these novel chips, regardless of the ML frameworks they use. Together, they provide an end-to-end solution to train and deploy ML models across a broad set of application types, such as speech recognition, recommendation, fraud detection, and image and video classification.

Amazon’s own product search engine is an example of an application powered by several ML models. This search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world. 

“We build models for what we call shopping sessions,” says Trishul Chilimbi, Amazon’s Vice President and Distinguished Scientist who leads Search M5, a team within Amazon Search. “Trainium helps us train the models more efficiently, and then Inferentia allows us to deploy these models to production and run them in real time. We need both pieces of the story to provide a delightful shopping experience for our customers.”

Building advanced ML silicon is part of AWS’s overall commitment to providing customers with the best price performance for their workloads. “Making machine learning accessible for all has always been our goal,” says Dave Brown, VP of Amazon EC2. “Inferentia and Trainium are unlocking a lot of innovation for companies that didn’t have access to powerful and affordable ML hardware. I think we’ll start to see many companies using this technology and creating things we only dreamed of.”

Unlike GPUs, which are built and used for a wide range of applications—like high-performance computing, graphics, and machine learning—AWS has built Inferentia and Trainium for a specific purpose: accelerating machine learning. 

And with this specialization, AWS is helping shape a future where increasingly powerful and sophisticated ML need not come with a hefty price tag.

This story was produced by WIRED Brand Lab for Amazon Web Services.