We introduce a novel framework to build a model that can learn how to segment
objects from a collection of images without any human annotation. Our method
builds on the observation that the location of object segments can be perturbed
locally relative to a given background without affecting the realism of a
scene. Our approach is to first train a generative model of a layered scene.
The layered representation consists of a background image, a foreground image
and the mask of the foreground. A composite image is then obtained by
overlaying the masked foreground image onto the background. The generative
model is trained in an adversarial fashion against a discriminator, which
forces the generative model to produce realistic composite images. To force the
generator to learn a representation where the foreground layer corresponds to
an object, we perturb the output of the generative model by introducing a
random shift of both the foreground image and mask relative to the background.
Because the generator is unaware of the shift before computing its output, it
must produce layered representations that are realistic for any such random
perturbation. Finally, we learn to segment an image by defining an autoencoder
consisting of an encoder, which we train, and the pre-trained generator as the
decoder, which we freeze. The encoder maps an image to a feature vector, which
is fed as input to the generator to give a composite image matching the
original input image. Because the generator outputs an explicit layered
representation of the scene, the encoder learns to detect and segment objects.
We demonstrate this framework on real images of several object categories.

Source link