Differential Treatment for Stuff and Things

Teaser

Movitation

Based on the observation that stuff categories usually share similar appearances across images of different domains while things (i.e. object instances) have much larger differences, we propose to improve the semantic-level alignment with different strategies for stuff regions and for things:

1) for the stuff categories, we generate feature representation for each class and conduct the alignment operation from the target domain to the source domain;
2) for the thing categories, we generate feature representation for each individual instance and encourage the instance in the target domain to align with the most similar one in the source domain.

Method

Stuff and instance matching (SIM)

Stuff Matching (SM)

First, we discuss the matching process for the background classes such as road, sidewalk, sky and etc.. These classes usually cover a large area of the image and lack appearance variation, so we only extract the image-level stuff feature representation for them.

For each source domain image, we access the correctly classified label map by selecting the predicted labels matched with the ground truth labels in Eqn (4).

$$L_{P_i}^s=\mathop{argmax}\limits_{k \in N}(C(f_i^s)^{(k)})$$
$$L_{c_i}^s=L_{G_i}^s \cap L_{P_i}^s$$

where $L_{c_i}^s$ is the correctly classified label map, $L_{G_i}^s$ is the ground truth label map, $L_{P_i}^s$ is the predicted label map, and $i \in {1..|X^S|}$.

Instance Matching (IM)

Second, we discuss the instance matching process for the foreground classes such as cars, persons and etc.. Because the ground truth does not provide the instance level annotations, we generate the foreground instance mask by finding the disconnected regions for each foreground class in the label map L. This coarsely segment the intra-class semantic regions into multiple instances, and thus various instancelevel feature representations of one image can be generated accordingly.

Loss Func.

We follow a two-step training procedure to improve the performance of the generator G on semantic segmentation task on the target domain dataset. First, we train our model without the self-supervised learning module, and optimize the target function in Eqn (12) with G and D in an adversarial training strategy:
$$\mathop{min}\limits_{G,D} \mathcal{L_{step1} = \mathop{min}\limits_{G}(\lambda_{seg})}$$

Differential Treatment for Stuff and Things

Teaser

Movitation

Method

Stuff and instance matching (SIM)

Stuff Matching (SM)

Instance Matching (IM)

Loss Func.

Experiments

GTA5 to Cityscapes

Ablation study

SYNTHIA to Cityscsapes

Ablation study