
資料內(nèi)容:
. Introduction 
Pretrained backbones with fine-tuning have been widely 
applied to various 2D vision and NLP tasks [13, 2, 10, 3], 
where a backbone network pretrained on a large dataset is 
concatenated with task-specific back-end and then fine-tuned 
for different downstream tasks. This approach demonstrates 
*
Interns at Microsoft Research Asia. †Contact person. 
its superior performance and great advantages in reducing 
the workload of network design and training, as well as the 
amount of labeled data required for different vision tasks. 
In the work, we present a pretrained 3D backbone, named 
SWIN3D, for 3D indoor scene understanding tasks. Our 
method represents the 3D point cloud of an input 3D scene as 
sparse voxels in 3D space and adapts the Swin Transformer 
[30] designed for regular 2D images to unorganized 3D 
points as the 3D backbone. We analyze the key issues that 
prevent the na¨?ve 3D extension of Swin Transformer from 
exploring large models and achieving high performance, 
i.e., the high memory complexity, the ignorance of signal 
irregularity. Based on our analysis, we develop a novel 
3D self-attention operator to compute the self-attentions of 
sparse voxels within each local window, which reduces the 
memory cost of self-attention from quadratic to linear with 
respect to the number of sparse voxels within a window and 
computes efficiently; enhances self-attention via capturing 
various signal irregularities by our generalized contextual 
relative positional embedding [48, 26]. 
The novel design of our SWIN3D backbone enables us to 
scale up the backbone model and the amount of data used 
for pretraining. To this end, we pretrained a large SWIN3D 
model with 60M parameters via a 3D semantic segmenta
tion task over a synthetic 3D indoor scene dataset [60] that 
includes 21K rooms and is about ten times larger than the 
ScanNet dataset. After pretraining, we cascade the pretrained 
SWIN3D backbone with task-specific back-end decoders 
and fine-tune the models for various downstream 3D indoor 
scene understanding tasks.
 
                