Sunday, June 8, 2025
No Result
View All Result
Blockchain 24hrs
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoins
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Blockchain Justice
  • Analysis
Crypto Marketcap
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoins
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Blockchain Justice
  • Analysis
No Result
View All Result
Blockchain 24hrs
No Result
View All Result

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

Home Blockchain
Share on FacebookShare on Twitter




Iris Coleman
Jan 10, 2025 14:13

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for big language fashions with modern knowledge curation strategies.





NVIDIA has introduced the discharge of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of huge language fashions (LLMs). This dataset, derived from Widespread Crawl, goals to raise the accuracy and effectivity of LLMs via modern knowledge curation methods, together with the usage of 1.9 trillion tokens of synthetically generated knowledge, based on NVIDIA.

Enhancing LLM Pretraining

NVIDIA’s initiative addresses a crucial want in LLM coaching, the place the standard of pretraining datasets performs a pivotal function. Whereas latest fashions like Meta’s Llama collection have been based mostly on datasets comprising as much as 15 trillion tokens, the precise composition of those datasets stays largely undisclosed. Nemotron-CC seeks to fill this hole by offering the broader neighborhood with a high-quality dataset able to supporting each quick and lengthy token horizon coaching.

Conventional datasets typically sacrifice as much as 90% of knowledge to enhance benchmark accuracies, limiting their utility for in depth coaching. Nemotron-CC, nonetheless, demonstrates remodel Widespread Crawl knowledge right into a superior dataset, surpassing even the Llama 3.1 8B mannequin via superior strategies similar to classifier ensembling and artificial knowledge rephrasing.

Important Outcomes

Nemotron-CC’s efficacy is evidenced by its efficiency in varied benchmarks. When coaching 8B parameter fashions for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms main datasets like DCLM, growing MMLU scores by 5.6 factors. Moreover, the entire 6.3-trillion-token dataset matches DCLM on MMLU whereas providing 4 occasions extra distinctive actual tokens. This allows efficient coaching over lengthy token horizons, with Nemotron-CC-trained fashions surpassing Llama 3.1 8B in a number of metrics, together with a 5-point enhance in MMLU and a 3.1-point rise in ARC-Problem scores.

Revolutionary Information Curation Methods

The event of Nemotron-CC concerned a number of key insights. By ensembling totally different model-based classifiers, NVIDIA was capable of choose a broader array of high-quality tokens. Moreover, rephrasing methods diminished noise and errors, yielding various and useful knowledge variants. The choice to disable conventional heuristic filters additional boosted the dataset’s high quality with out compromising accuracy.

NVIDIA utilized its NeMo Curator device to extract and refine knowledge from Widespread Crawl, making use of filters for language, deduplication, and high quality classification. This course of was complemented by artificial knowledge technology, contributing roughly two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as a significant useful resource for pretraining state-of-the-art LLMs over various token horizons. NVIDIA plans to broaden its choices by releasing extra specialised datasets, together with these centered on particular domains like arithmetic, to additional improve LLM capabilities.

Picture supply: Shutterstock



Source link

Tags: DatasetIntroducesLLMMassiveNemotronCCNVIDIAPretraining
Previous Post

Solana Faces a Bold New Challenger Lightchain AI and the Future of Blockchain

Next Post

FLock Unveils Framework for Training Large Language Models on Consumer Hardware

Related Posts

Solana (SOL) Introduces Alpenglow for Faster Blockchain Consensus
Blockchain

Solana (SOL) Introduces Alpenglow for Faster Blockchain Consensus

June 7, 2025
AI Ronaldo Goes Viral, Meta Oversight Board Intervenes
Blockchain

AI Ronaldo Goes Viral, Meta Oversight Board Intervenes

June 7, 2025
WLFI Sends Legal Warning Over TrumpWallet Waitlist
Blockchain

WLFI Sends Legal Warning Over TrumpWallet Waitlist

June 6, 2025
AI Elevates Artistry at NVIDIA GTC Paris with Innovative Creations
Blockchain

AI Elevates Artistry at NVIDIA GTC Paris with Innovative Creations

June 6, 2025
Trump’s Bill Gets Roasted, Elon Musk Inspires M Token
Blockchain

Trump’s Bill Gets Roasted, Elon Musk Inspires $53M Token

June 5, 2025
G2 Spring 2025 Reports: 101 Blockchains Earned Record-breaking 32 Badges
Blockchain

G2 Spring 2025 Reports: 101 Blockchains Earned Record-breaking 32 Badges

June 5, 2025
Next Post
FLock Unveils Framework for Training Large Language Models on Consumer Hardware

FLock Unveils Framework for Training Large Language Models on Consumer Hardware

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Facebook Twitter Instagram Youtube RSS
Blockchain 24hrs

Blockchain 24hrs delivers the latest cryptocurrency and blockchain technology news, expert analysis, and market trends. Stay informed with round-the-clock updates and insights from the world of digital currencies.

CATEGORIES

  • Altcoins
  • Analysis
  • Bitcoin
  • Blockchain
  • Blockchain Justice
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • NFT
  • Regulations
  • Web3

SITEMAP

  • About Us
  • Advertise With Us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact Us

Copyright © 2024 Blockchain 24hrs.
Blockchain 24hrs is not responsible for the content of external sites.

  • bitcoinBitcoin(BTC)$105,556.001.06%
  • ethereumEthereum(ETH)$2,521.061.80%
  • tetherTether(USDT)$1.00-0.02%
  • rippleXRP(XRP)$2.170.37%
  • binancecoinBNB(BNB)$649.570.94%
  • solanaSolana(SOL)$149.550.81%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.1836052.00%
  • tronTRON(TRX)$0.2860842.72%
  • cardanoCardano(ADA)$0.660.52%
No Result
View All Result
  • Home
  • Bitcoin
  • Crypto Updates
    • General
    • Altcoins
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Blockchain Justice
  • Analysis
Crypto Marketcap

Copyright © 2024 Blockchain 24hrs.
Blockchain 24hrs is not responsible for the content of external sites.