Sunday, May 18, 2025
No Result
View All Result
Blockchain 24hrs
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoins
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Blockchain Justice
  • Analysis
Crypto Marketcap
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoins
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Blockchain Justice
  • Analysis
No Result
View All Result
Blockchain 24hrs
No Result
View All Result

NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

Home Blockchain
Share on FacebookShare on Twitter




Lawrence Jengar
Aug 29, 2024 16:10

NVIDIA’s TensorRT Mannequin Optimizer considerably boosts efficiency of Meta’s Llama 3.1 405B massive language mannequin on H200 GPUs.





Meta’s Llama 3.1 405B massive language mannequin (LLM) is attaining new ranges of efficiency because of NVIDIA’s TensorRT Mannequin Optimizer, in response to the NVIDIA Technical Weblog. The enhancements have resulted in as much as a 1.44x enhance in throughput when working on NVIDIA H200 GPUs.

Excellent Llama 3.1 405B Inference Throughput with TensorRT-LLM

TensorRT-LLM has already delivered exceptional inference throughput for Llama 3.1 405B for the reason that mannequin’s launch. This was achieved by numerous optimizations, together with in-flight batching, KV caching, and optimized consideration kernels. These methods have accelerated inference efficiency whereas sustaining decrease precision compute.

TensorRT-LLM added assist for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling elements to protect most accuracy. Moreover, user-defined kernels comparable to matrix multiplications from FBGEMM are optimized through plug-ins inserted into the community graph at compile time.

Boosting Efficiency As much as 1.44x with TensorRT Mannequin Optimizer

NVIDIA’s customized FP8 post-training quantization (PTQ) recipe, obtainable by the TensorRT Mannequin Optimizer library, enhances Llama 3.1 405B throughput and reduces latency with out sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, lowering inference compute overhead.

Desk 1 demonstrates the utmost throughput efficiency, exhibiting important enhancements throughout numerous enter and output sequence lengths on an 8-GPU HGX H200 system. The system options eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e reminiscence every and 4 NVLink Switches, offering 900 GB/s of GPU-to-GPU bandwidth.




Most Throughput Efficiency – Output Tokens/Second8 NVIDIA H200 Tensor Core GPUs


Enter | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
120,000 | 2,048


TensorRT Mannequin Optimizer FP8
463.1
320.1
71.5


Official Llama FP8 Recipe
399.9
230.8
49.6


Speedup
1.16x
1.39x
1.44x

Desk 1. Most throughput efficiency of Llama 3.1 405B with NVIDIA inner measurements

Equally, Desk 2 presents the minimal latency efficiency utilizing the identical enter and output sequence lengths.




Batch Dimension = 1 Efficiency – Output Tokens/Second8 NVIDIA H200 Tensor Core GPUs


Enter | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
120,000 | 2,048


TensorRT Mannequin Optimizer FP8
49.6
44.2
27.2


Official Llama FP8 Recipe
37.4
33.1
22.8


Speedup
1.33x
1.33x
1.19x

Desk 2. Minimal latency efficiency of Llama 3.1 405B with NVIDIA inner measurements

These outcomes point out that H200 GPUs with TensorRT-LLM and TensorRT Mannequin Optimizer are delivering superior efficiency in each latency-optimized and throughput-optimized situations. The TensorRT Mannequin Optimizer FP8 recipe additionally achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Becoming Llama 3.1 405B on Simply Two H200 GPUs with INT4 AWQ

For builders with {hardware} useful resource constraints, the INT4 AWQ method in TensorRT Mannequin Optimizer compresses the mannequin, permitting Llama 3.1 405B to suit on simply two H200 GPUs. This technique reduces the required reminiscence footprint considerably by compressing the weights right down to 4-bit integers whereas encoding activations utilizing FP16.

Tables 4 and 5 present the utmost throughput and minimal latency efficiency measurements, demonstrating that the INT4 AWQ technique supplies comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.




Most Throughput Efficiency – Output Tokens/Second2 NVIDIA H200 Tensor Core GPUs


Enter | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
60,000 | 2,048


TensorRT Mannequin Optimizer INT4 AWQ
75.6
28.7
16.2

Desk 4. Most throughput efficiency of Llama 3.1 405B with NVIDIA inner measurements




Batch Dimension = 1 Efficiency – Output Tokens/Second2 NVIDIA H200 Tensor Core GPUs


Enter | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
60,000 | 2,048


TensorRT Mannequin Optimizer INT4 AWQ
21.6
18.7
12.8

Desk 5. Minimal latency efficiency of Llama 3.1 405B with NVIDIA inner measurements

NVIDIA’s developments in TensorRT Mannequin Optimizer and TensorRT-LLM are paving the best way for enhanced efficiency and effectivity in working massive language fashions like Llama 3.1 405B. These enhancements provide builders extra flexibility and cost-efficiency, whether or not they have intensive {hardware} sources or extra constrained environments.

Picture supply: Shutterstock



Source link

Tags: 405BEnhancesLlamaModelNVIDIAOptimizerPerformanceTensorRT
Previous Post

Proton Wallet Review: A Bitcoin Software Wallet That Simplifies Transactions

Next Post

Bitcoin (BTC) Strategy Working for El Salvador, Says President Nayib Bukele

Related Posts

Pi Network Ventures Out with 0 Million Fund
Blockchain

Pi Network Ventures Out with $100 Million Fund

May 17, 2025
Méliuz Becomes Latin America’s First Bitcoin Business
Blockchain

Méliuz Becomes Latin America’s First Bitcoin Business

May 16, 2025
How to Start Your Blockchain Career in 30 Days?
Blockchain

How to Start Your Blockchain Career in 30 Days?

May 16, 2025
THORChain Announces Mainnet Upgrade to Version 3.6.0
Blockchain

THORChain Announces Mainnet Upgrade to Version 3.6.0

May 16, 2025
Gala Games Unveils Brock Moneyman Mystery Box with Unique VEXI Characters
Blockchain

Gala Games Unveils Brock Moneyman Mystery Box with Unique VEXI Characters

May 17, 2025
Gala Music Launches The Hot Box Mystery Box with Exclusive NFTs and Rewards
Blockchain

Gala Music Launches The Hot Box Mystery Box with Exclusive NFTs and Rewards

May 18, 2025
Next Post
Bitcoin (BTC) Strategy Working for El Salvador, Says President Nayib Bukele

Bitcoin (BTC) Strategy Working for El Salvador, Says President Nayib Bukele

Trump Keeps Teasing His New Crypto Project, but Details Remain Scant

Trump Keeps Teasing His New Crypto Project, but Details Remain Scant

Facebook Twitter Instagram Youtube RSS
Blockchain 24hrs

Blockchain 24hrs delivers the latest cryptocurrency and blockchain technology news, expert analysis, and market trends. Stay informed with round-the-clock updates and insights from the world of digital currencies.

CATEGORIES

  • Altcoins
  • Analysis
  • Bitcoin
  • Blockchain
  • Blockchain Justice
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • NFT
  • Regulations
  • Web3

SITEMAP

  • About Us
  • Advertise With Us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact Us

Copyright © 2024 Blockchain 24hrs.
Blockchain 24hrs is not responsible for the content of external sites.

  • bitcoinBitcoin(BTC)$105,766.002.51%
  • ethereumEthereum(ETH)$2,449.03-1.03%
  • tetherTether(USDT)$1.000.01%
  • rippleXRP(XRP)$2.412.49%
  • binancecoinBNB(BNB)$647.461.20%
  • solanaSolana(SOL)$170.082.51%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.2293456.72%
  • cardanoCardano(ADA)$0.751.14%
  • tronTRON(TRX)$0.266556-1.10%
No Result
View All Result
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoins
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Blockchain Justice
  • Analysis
Crypto Marketcap

Copyright © 2024 Blockchain 24hrs.
Blockchain 24hrs is not responsible for the content of external sites.