ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.10852
  4. Cited By
Pix2seq: A Language Modeling Framework for Object Detection

Pix2seq: A Language Modeling Framework for Object Detection

22 September 2021
Ting-Li Chen
Saurabh Saxena
Lala Li
David J. Fleet
Geoffrey E. Hinton
    MLLM
    ViT
    VLM
ArXivPDFHTML

Papers citing "Pix2seq: A Language Modeling Framework for Object Detection"

26 / 76 papers shown
Title
A Generalist Framework for Panoptic Segmentation of Images and Videos
A Generalist Framework for Panoptic Segmentation of Images and Videos
Ting-Li Chen
Lala Li
Saurabh Saxena
Geoffrey E. Hinton
David J. Fleet
VGen
MLLM
43
102
0
12 Oct 2022
Machine Translation between Spoken Languages and Signed Languages
  Represented in SignWriting
Machine Translation between Spoken Languages and Signed Languages Represented in SignWriting
Zifan Jiang
Amit Moryossef
Mathias Müller
Sarah Ebling
18
22
0
11 Oct 2022
VIMA: General Robot Manipulation with Multimodal Prompts
VIMA: General Robot Manipulation with Multimodal Prompts
Yunfan Jiang
Agrim Gupta
Zichen Zhang
Guanzhi Wang
Yongqiang Dou
Yanjun Chen
Li Fei-Fei
Anima Anandkumar
Yuke Zhu
Linxi Fan
LM&Ro
28
335
0
06 Oct 2022
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar
Lucas Manuelli
D. Fox
LM&Ro
163
457
0
12 Sep 2022
Visual Recognition by Request
Visual Recognition by Request
Chufeng Tang
Lingxi Xie
Xiaopeng Zhang
Xiaolin Hu
Qi Tian
VLM
16
15
0
28 Jul 2022
EATFormer: Improving Vision Transformer Inspired by Evolutionary
  Algorithm
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
Jiangning Zhang
Xiangtai Li
Yabiao Wang
Chengjie Wang
Yibo Yang
Yong Liu
Dacheng Tao
ViT
34
32
0
19 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
53
392
0
17 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
27
124
0
15 Jun 2022
QASem Parsing: Text-to-text Modeling of QA-based Semantics
QASem Parsing: Text-to-text Modeling of QA-based Semantics
Ayal Klein
Eran Hirsch
Ron Eliav
Valentina Pyatkin
Avi Caciularu
Ido Dagan
36
12
0
23 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for
  Vision-language Models
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
26
38
0
23 May 2022
A Generalist Agent
A Generalist Agent
Scott E. Reed
Konrad Zolna
Emilio Parisotto
Sergio Gomez Colmenarejo
Alexander Novikov
...
Yutian Chen
R. Hadsell
Oriol Vinyals
Mahyar Bordbar
Nando de Freitas
LM&Ro
LLMAG
AI4CE
56
787
0
12 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
46
3,334
0
29 Apr 2022
Transformers Meet Visual Learning Understanding: A Comprehensive Review
Transformers Meet Visual Learning Understanding: A Comprehensive Review
Yuting Yang
Licheng Jiao
Xuantong Liu
F. Liu
Shuyuan Yang
Zhixi Feng
Xu Tang
ViT
MedIm
27
28
0
24 Mar 2022
InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene
  Understanding
InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding
Hanrong Ye
Dan Xu
ViT
21
84
0
15 Mar 2022
Backbone is All Your Need: A Simplified Architecture for Visual Object
  Tracking
Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking
Boyu Chen
Peixia Li
Lei Bai
Leixian Qiao
Qiuhong Shen
Bo-wen Li
Weihao Gan
Wei Wu
Wanli Ouyang
ViT
VOT
22
182
0
10 Mar 2022
DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting
DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting
Seonghyeon Kim
Seung Shin
Yoonsik Kim
Han-Cheol Cho
Taeho Kil
Jaeheung Surh
Seunghyun Park
Bado Lee
Youngmin Baek
20
8
0
10 Mar 2022
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
Feng Li
Hao Zhang
Shi-guang Liu
Jian Guo
L. Ni
Lei Zhang
ViT
44
647
0
02 Mar 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
53
850
0
07 Feb 2022
WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV
  Tracking
WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV Tracking
Chunhui Zhang
Guanjie Huang
Li Liu
Shan Huang
Yinan Yang
Xiang Wan
Shiming Ge
Dacheng Tao
36
23
0
19 Jan 2022
SPTS: Single-Point Text Spotting
SPTS: Single-Point Text Spotting
Dezhi Peng
Xinyu Wang
Yuliang Liu
Jiaxin Zhang
Mingxin Huang
...
Jing Li
Dahua Lin
Chunhua Shen
Xiang Bai
Lianwen Jin
ViT
30
63
0
15 Dec 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
27
111
0
23 Nov 2021
A Survey of Visual Transformers
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGS
ViT
77
330
0
11 Nov 2021
Contrastive Proposal Extension with LSTM Network for Weakly Supervised
  Object Detection
Contrastive Proposal Extension with LSTM Network for Weakly Supervised Object Detection
Pei Lv
Suqi Hu
Tianran Hao
27
5
0
14 Oct 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
255
4,781
0
24 Feb 2021
Transformers in Vision: A Survey
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
F. Khan
M. Shah
ViT
227
2,430
0
04 Jan 2021
Simple Copy-Paste is a Strong Data Augmentation Method for Instance
  Segmentation
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
Golnaz Ghiasi
Huayu Chen
A. Srinivas
Rui Qian
Nayeon Lee
E. D. Cubuk
Quoc V. Le
Barret Zoph
ISeg
252
969
0
13 Dec 2020
Previous
12