I've tried to find the answer and could not find it. Why still no model (AFAIK) is capable of correct compositional images generation from text? I.e. "simple" (but uncommon) compositions of objects connected with prepositions under/in/on/...
P.S. I suspect some might argue that this question is not answerable so not allowed here. But papers discussing difficulties of compositional images generation might had been published even as failures are usually not published in the current scientific community. E.g. AI alignment is not solved, but there is a lot of material on that topic.