You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
Table 1: Perplexity of 355 million parameter models trained for 10 billion tokens on the Pile. Yet, some subquadratic gated-convolutions match attention on the non AR slice! Can we capture the strengths of both gated convolutions and attention in one purely sub-quadratic architecture? We find the AR gap is because gated convolution models (e.g. Hyena, H3, RWKV, RetNet) need model dimension that sc
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く