We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook. These codebook vectors can, then, be controllably sampled to ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results