ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.05816
  4. Cited By
Attention as a Hypernetwork

Attention as a Hypernetwork

9 June 2024
Simon Schug
Seijin Kobayashi
Yassir Akram
João Sacramento
Razvan Pascanu
    GNN
ArXivPDFHTML

Papers citing "Attention as a Hypernetwork"

23 / 23 papers shown
Title
Function Vectors in Large Language Models
Function Vectors in Large Language Models
Eric Todd
Millicent Li
Arnab Sen Sharma
Aaron Mueller
Byron C. Wallace
David Bau
55
116
0
23 Oct 2023
Faith and Fate: Limits of Transformers on Compositionality
Faith and Fate: Limits of Transformers on Compositionality
Nouha Dziri
Ximing Lu
Melanie Sclar
Xiang Lorraine Li
Liwei Jian
...
Sean Welleck
Xiang Ren
Allyson Ettinger
Zaïd Harchaoui
Yejin Choi
ReLM
LRM
133
377
0
29 May 2023
How Do In-Context Examples Affect Compositional Generalization?
How Do In-Context Examples Affect Compositional Generalization?
Shengnan An
Zeqi Lin
Qiang Fu
B. Chen
Nanning Zheng
Jian-Guang Lou
Dongmei Zhang
75
54
0
08 May 2023
Measuring and Narrowing the Compositionality Gap in Language Models
Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press
Muru Zhang
Sewon Min
Ludwig Schmidt
Noah A. Smith
M. Lewis
ReLM
KELM
LRM
182
626
0
07 Oct 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
316
516
0
24 Sep 2022
Is a Modular Architecture Enough?
Is a Modular Architecture Enough?
Sarthak Mittal
Yoshua Bengio
Guillaume Lajoie
101
48
0
06 Jun 2022
Dynamic Inference with Neural Interpreters
Dynamic Inference with Neural Interpreters
Nasim Rahaman
Muhammad Waleed Gondal
S. Joshi
Peter V. Gehler
Yoshua Bengio
Francesco Locatello
Bernhard Schölkopf
91
31
0
12 Oct 2021
Multi-head or Single-head? An Empirical Comparison for Transformer
  Training
Multi-head or Single-head? An Empirical Comparison for Transformer Training
Liyuan Liu
Jialu Liu
Jiawei Han
66
33
0
17 Jun 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
278
2,453
0
20 Apr 2021
Linear Transformers Are Secretly Fast Weight Programmers
Linear Transformers Are Secretly Fast Weight Programmers
Imanol Schlag
Kazuki Irie
Jürgen Schmidhuber
119
247
0
22 Feb 2021
Transformers are RNNs: Fast Autoregressive Transformers with Linear
  Attention
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
201
1,771
0
29 Jun 2020
Stratified Rule-Aware Network for Abstract Visual Reasoning
Stratified Rule-Aware Network for Abstract Visual Reasoning
Sheng Hu
Yuqing Ma
Xianglong Liu
Yanlu Wei
Shihao Bai
52
106
0
17 Feb 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
439
20,181
0
23 Oct 2019
Are Sixteen Heads Really Better than One?
Are Sixteen Heads Really Better than One?
Paul Michel
Omer Levy
Graham Neubig
MoE
100
1,062
0
25 May 2019
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
114
1,141
0
23 May 2019
Modular Networks: Learning to Decompose Neural Computation
Modular Networks: Learning to Decompose Neural Computation
Louis Kirsch
Julius Kunze
David Barber
67
111
0
13 Nov 2018
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo
John Richardson
196
3,520
0
19 Aug 2018
Relational inductive biases, deep learning, and graph networks
Relational inductive biases, deep learning, and graph networks
Peter W. Battaglia
Jessica B. Hamrick
V. Bapst
Alvaro Sanchez-Gonzalez
V. Zambaldi
...
Pushmeet Kohli
M. Botvinick
Oriol Vinyals
Yujia Li
Razvan Pascanu
AI4CE
NAI
761
3,121
0
04 Jun 2018
Routing Networks: Adaptive Selection of Non-linear Functions for
  Multi-Task Learning
Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning
Clemens Rosenbaum
Tim Klinger
Matthew D Riemer
84
246
0
03 Nov 2017
Graph Attention Networks
Graph Attention Networks
Petar Velickovic
Guillem Cucurull
Arantxa Casanova
Adriana Romero
Pietro Lio
Yoshua Bengio
GNN
479
20,164
0
30 Oct 2017
Pointer Sentinel Mixture Models
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
325
2,876
0
26 Sep 2016
SGDR: Stochastic Gradient Descent with Warm Restarts
SGDR: Stochastic Gradient Descent with Warm Restarts
I. Loshchilov
Frank Hutter
ODL
333
8,130
0
13 Aug 2016
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich
Barry Haddow
Alexandra Birch
221
7,745
0
31 Aug 2015
1