The researchers compared their model to other deep learning methods that were given text descriptions and tasked with generating images that displayed the corresponding objects and their relationships. In each instance, their model outperformed the baselines.
They also asked humans to evaluate whether the generated images matched the original scene description. In the most complex examples, where descriptions contained three relationships, 91 percent of participants concluded that the new model performed better.
“One interesting thing we found is that for our model, we can increase our sentence from having one relation description to having two, or three, or even four descriptions, and our approach continues to be able to generate images that are correctly described by those descriptions, while other methods fail,” Du says.