This paper introduces Hunyuan-DiT, a text-to-image model that can generate detailed, high-quality images from both English and Chinese text prompts. Key innovations include a tailored transformer architecture, a data pipeline for iterative optimization, refined image captions, and multi-turn dialog for prompt refinement.