JetMoE: Reaching Llama2 Performance with 0.1M Dollars

11 April 2024

Papers citing "JetMoE: Reaching Llama2 Performance with 0.1M Dollars"

21 / 21 papers shown

Title
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing Piotr Piekos Róbert Csordás Jürgen Schmidhuber MoE VLM 96 1 0 01 May 2025
Continual Pre-training of MoEs: How robust is your router? Benjamin Thérien Charles-Étienne Joseph Zain Sarwar Ashwinee Panda Anirban Das Shi-Xiong Zhang Stephen Rawls Shri Kiran Srinivasan Eugene Belilovsky Irina Rish MoE 75 0 0 06 Mar 2025
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory Jiashun Suo Xiaojian Liao Limin Xiao Li Ruan Jinquan Wang Xiao Su Zhisheng Huo 67 0 0 04 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs Shane Bergsma Nolan Dey Gurpreet Gosal Gavia Gray Daria Soboleva Joel Hestness 55 5 0 21 Feb 2025
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts Peng Jin Bo Zhu Li Yuan Shuicheng Yan MoE 32 4 0 09 Oct 2024
Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices Andres Potapczynski Shikai Qiu Marc Finzi Christopher Ferri Zixi Chen Micah Goldblum Bayan Bruss Christopher De Sa Andrew Gordon Wilson 39 1 0 03 Oct 2024
On-Device Language Models: A Comprehensive Review Jiajun Xu Zhiyuan Li Wei Chen Qun Wang Xin Gao Qi Cai Ziyuan Ling 44 27 0 26 Aug 2024
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Yikang Shen Matthew Stallone Mayank Mishra Gaoyuan Zhang Shawn Tan Aditya Prasad Adriana Meza Soria David D. Cox Rameswar Panda 31 11 0 23 Aug 2024
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts Qizhen Zhang Nikolas Gritsch Dwaraknath Gnaneshwar Simon Guo David Cairuz ... Jakob N. Foerster Phil Blunsom Sebastian Ruder A. Ustun Acyr F. Locatelli MoMe MoE 50 5 0 15 Aug 2024
Compact Language Models via Pruning and Knowledge Distillation Saurav Muralidharan Sharath Turuvekere Sreenivas Raviraj Joshi Marcin Chochowski M. Patwary M. Shoeybi Bryan Catanzaro Jan Kautz Pavlo Molchanov SyDa MQ 39 37 0 19 Jul 2024
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models Zihan Wang Deli Chen Damai Dai Runxin Xu Zhuoshu Li Y. Wu MoE ALM 40 2 0 02 Jul 2024
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory Haoze Wu Zihan Qiu Zili Wang Hang Zhao Jie Fu MoE 45 3 0 18 Jun 2024
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations Alexander Hägele Elie Bakouch Atli Kosson Loubna Ben Allal Leandro von Werra Martin Jaggi 38 34 0 28 May 2024
Zamba: A Compact 7B SSM Hybrid Model Paolo Glorioso Quentin G. Anthony Yury Tokpanov James Whittington Jonathan Pilault Adam Ibrahim Beren Millidge 30 45 0 26 May 2024
Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study Chi Ma Mincong Huang Chao Wang Yujie Wang Lei Yu 29 2 0 15 May 2024
Granite Code Models: A Family of Open Foundation Models for Code Intelligence Mayank Mishra Matt Stallone Gaoyuan Zhang Yikang Shen Aditya Prasad ... Amith Singhee Nirmit Desai David D. Cox Ruchir Puri Rameswar Panda AI4TS 56 55 0 07 May 2024
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training Zexuan Zhong Mengzhou Xia Danqi Chen Mike Lewis MoE 51 15 0 06 May 2024
More Compute Is What You Need Zhen Guo 56 0 0 30 Apr 2024
Instruction Tuning with GPT-4 Baolin Peng Chunyuan Li Pengcheng He Michel Galley Jianfeng Gao SyDa ALM LM&MA 159 579 0 06 Apr 2023
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 253 1,996 0 31 Dec 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,821 0 17 Sep 2019