Research Domain Expert - AI/ML/NLP
A pioneer in AI-driven knowledge automation
Full-time | Contract
Atlanta, GA
Job description
The Role
Cantai is developing a high-fidelity DiffSinger engine. We need an end-to-end AI Audio Specialist to take raw vocal recordings and turn them into production-ready singing voice models.
You will be responsible for executing the full technical pipeline—from data strategy to final acoustic model inference—ensuring state-of-the-art output.
Core Responsibilities
- End-to-End Pipeline Execution: Independently manage the full workflow: Data Prep →→ Training →→ Inference.
- Advanced Data Engineering: Implement[2] and refine automated alignment tools (e.g., Montreal Forced Aligner) to label phonemes, pitch, and duration. Note: This role requires hands-on verification of data accuracy.
- Model Training: Train and fine-tune DiffSinger acoustic models and variance adaptors for specific target voices.
- Vocoder Integration: Train or adapt neural vocoders (HiFi-GAN, NSF, etc.) for maximum audio fidelity.
- Optimization: Troubleshoot common synthesis issues (slurring, robotic pitch, metallic artifacts) and iterate on hyperparameters to fix them.
Requirements
- Proven DiffSinger Experience: You must have successfully trained DiffSinger (or very similar SVS architectures) before.
- Data Processing Mastery: Expert proficiency with Python, MFA (Montreal Forced Aligner), Praat, and phoneme dictionaries.
- Deep Learning Stack: Strong grasp of PyTorch and GPU environment management.
- Audio Domain Knowledge: You understand DSP basics (spectrograms, mel-bands, F0 estimation) and musical concepts (pitch, vibrato, timing).
Bonus Points
- Experience with OpenVPI, NNSVS, or Fish Diffusion.
- Background in music theory or vocal pedagogy.
- Experience exporting models for production (ONNX/C++).
Job Types: Full-time, Contract
Pay: $80.00 - $150.00 per hour
Expected hours: 30 – 40 per week
Benefits:
- Flexible schedule
- Professional development assistance
Work Location: Remote