3D Ken Burns effect is an image processing that mimics real camera motion to create a video from a static picture. To perform this effect, one must generate a 3D scene from the initial image and then hallucinate the disoccluded parts of the scene, induced by the camera motion. Therefore, we propose improved techniques for depth estimation and image inpainting. Depth estimation often suffers from object inconsistency: very different depth values are estimated for one object. To address this, we propose an unsupervised mask loss function that encourages the network to produce depth maps that are smooth on the objects detected by a trained segmentation network. Ground truth depth is not required for the computation and any image dataset can be used to do so, which add variety to the training. Supervised disocclusion inpainting requires RGB-D ground truth from multiple views and such datasets are rarely made available. We introduce a supervised pre-training alongside with an unsupervised refinement method that only need the RGB-D from a single view. This method uses an adversarial discriminator based on perceptual features and a multi-scale architecure. This achieved state-of-the-art results for color and depth inpainting. We also showcase that our refinement technique is compatible with already trained networks and that it does improve their performances.