Carsim proposed an architecture where a system takes a natural language description, converts it into a formal description and instantiates corresponding scene elements. This was done in 2002, before Microsoft issued this patent. Reference: http://doc.utwente.nl/36659/1/00000051.pdf
Let alone that Stanford's Wordseye works similarly, Confucius, probably more!
In reference to the patent: US20060217979