The Current State of Text-to-3D
In the third installment of our AI for Game Development series, we tackle 3D asset generation. While text-to-image tools like Stable Diffusion have revolutionized game art, text-to-3D remains a nascent technology.
Recent advances include DreamFusion (using 2D diffusion for 3D assets), CLIPMatrix and CLIP-Mesh-SMPLX (directly generating textured meshes), CLIP-Forge (voxel-based models), CLIP-NeRF (driving neural radiance fields with text), and Point-E (point cloud generation). Many of these rely on view synthesis via NeRFs, which are not the same as the meshes used in game engines.
Why It Isn't Useful (Yet)
To a game developer, these technologies currently offer little practical value. Converting NeRFs to meshes is possible (e.g., NVlabs instant-ngp), but the result resembles photogrammetry outputs: high-fidelity but not game-ready without significant manual cleanup. For our farming game, it was faster to use colored cubes as placeholder crops than to process NeRF-to-mesh assets.
The Future of Text-to-3D
The gap between current text-to-3D and a truly game-ready solution may be closed in two ways:
- Better NeRF-to-mesh conversion, reducing post-processing effort.
- New rendering techniques that allow NeRFs to be directly used in game engines (possible work by NVIDIA and Google).
Until then, game developers may still prefer traditional low-poly modeling. Stay tuned for Part 4, where we use AI for 2D assets.
Note: This tutorial assumes familiarity with Unity and C#. If you're new, check out the Unity for Beginners series.