Swin Transformer V2: Scaling Up Capacity and Resolution
We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. By scaling up c...
We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. By scaling up c...
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concat...
Self-supervised learning (SSL) achieves great success in speech recognition,\nwhile limited exploration has been attempted for other speech processing tasks.\nAs speech signal c...
h-index: Number of publications with at least h citations each.