ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.07413
  4. Cited By
JetMoE: Reaching Llama2 Performance with 0.1M Dollars

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

11 April 2024
Yikang Shen
Zhen Guo
Tianle Cai
Zengyi Qin
    MoE
    ALM
ArXivPDFHTML

Papers citing "JetMoE: Reaching Llama2 Performance with 0.1M Dollars"

21 / 21 papers shown
Title
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoE
VLM
96
1
0
01 May 2025
Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien
Charles-Étienne Joseph
Zain Sarwar
Ashwinee Panda
Anirban Das
Shi-Xiong Zhang
Stephen Rawls
Shri Kiran Srinivasan
Eugene Belilovsky
Irina Rish
MoE
75
0
0
06 Mar 2025
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
Jiashun Suo
Xiaojian Liao
Limin Xiao
Li Ruan
Jinquan Wang
Xiao Su
Zhisheng Huo
67
0
0
04 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
55
5
0
21 Feb 2025
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation
  Experts
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
Peng Jin
Bo Zhu
Li Yuan
Shuicheng Yan
MoE
32
4
0
09 Oct 2024
Searching for Efficient Linear Layers over a Continuous Space of
  Structured Matrices
Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices
Andres Potapczynski
Shikai Qiu
Marc Finzi
Christopher Ferri
Zixi Chen
Micah Goldblum
Bayan Bruss
Christopher De Sa
Andrew Gordon Wilson
39
1
0
03 Oct 2024
On-Device Language Models: A Comprehensive Review
On-Device Language Models: A Comprehensive Review
Jiajun Xu
Zhiyuan Li
Wei Chen
Qun Wang
Xin Gao
Qi Cai
Ziyuan Ling
44
27
0
26 Aug 2024
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate
  Scheduler
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Yikang Shen
Matthew Stallone
Mayank Mishra
Gaoyuan Zhang
Shawn Tan
Aditya Prasad
Adriana Meza Soria
David D. Cox
Rameswar Panda
31
11
0
23 Aug 2024
BAM! Just Like That: Simple and Efficient Parameter Upcycling for
  Mixture of Experts
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Qizhen Zhang
Nikolas Gritsch
Dwaraknath Gnaneshwar
Simon Guo
David Cairuz
...
Jakob N. Foerster
Phil Blunsom
Sebastian Ruder
A. Ustun
Acyr F. Locatelli
MoMe
MoE
50
5
0
15 Aug 2024
Compact Language Models via Pruning and Knowledge Distillation
Compact Language Models via Pruning and Knowledge Distillation
Saurav Muralidharan
Sharath Turuvekere Sreenivas
Raviraj Joshi
Marcin Chochowski
M. Patwary
M. Shoeybi
Bryan Catanzaro
Jan Kautz
Pavlo Molchanov
SyDa
MQ
39
37
0
19 Jul 2024
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for
  Sparse Architectural Large Language Models
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
Zihan Wang
Deli Chen
Damai Dai
Runxin Xu
Zhuoshu Li
Y. Wu
MoE
ALM
40
2
0
02 Jul 2024
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
Haoze Wu
Zihan Qiu
Zili Wang
Hang Zhao
Jie Fu
MoE
45
3
0
18 Jun 2024
Scaling Laws and Compute-Optimal Training Beyond Fixed Training
  Durations
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Alexander Hägele
Elie Bakouch
Atli Kosson
Loubna Ben Allal
Leandro von Werra
Martin Jaggi
38
34
0
28 May 2024
Zamba: A Compact 7B SSM Hybrid Model
Zamba: A Compact 7B SSM Hybrid Model
Paolo Glorioso
Quentin G. Anthony
Yury Tokpanov
James Whittington
Jonathan Pilault
Adam Ibrahim
Beren Millidge
30
45
0
26 May 2024
Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study
Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study
Chi Ma
Mincong Huang
Chao Wang
Yujie Wang
Lei Yu
29
2
0
15 May 2024
Granite Code Models: A Family of Open Foundation Models for Code
  Intelligence
Granite Code Models: A Family of Open Foundation Models for Code Intelligence
Mayank Mishra
Matt Stallone
Gaoyuan Zhang
Yikang Shen
Aditya Prasad
...
Amith Singhee
Nirmit Desai
David D. Cox
Ruchir Puri
Rameswar Panda
AI4TS
56
55
0
07 May 2024
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive
  Language Model Pre-training
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
Zexuan Zhong
Mengzhou Xia
Danqi Chen
Mike Lewis
MoE
51
15
0
06 May 2024
More Compute Is What You Need
More Compute Is What You Need
Zhen Guo
56
0
0
30 Apr 2024
Instruction Tuning with GPT-4
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
159
579
0
06 Apr 2023
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,996
0
31 Dec 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,821
0
17 Sep 2019
1