||Ulme Wennberg <firstname.lastname@example.org>|
||Kungliga Tekniska högskolan|
||2022-02-14 – 2022-09-01|
Original self-attention as proposed Vaswani et al. is quadratic in sequence length. This puts a practical cap on the sequence length that can be used, as it makes training large language models for long sequences cumbersome.
Pruneformer is, to our knowledge, the first transformer variant that simultaneously models both local and global attention patterns, while having memory usage scale linearly with sequence length, and its number of parameters be independent of sequence length.
The idea is to combine two approaches:
- O(n) self attention
- extrapolation abilities (which is something that almost all language models fail terribly with).