Voice/Speech Generative AI Specialist

Job Description

Posted on: 
December 29, 2024

We are seeking a Voice/Speech Generative AI Specialist to join our innovative AI team. In this role, you will design, develop, and optimize generative AI models focused on speech synthesis and voice generation technologies. Your expertise will drive the development of cutting-edge applications such as text-to-speech (TTS), voice cloning, voice assistants, and other voice-based AI solutions that deliver high-quality, natural-sounding, and emotionally aware speech outputs.

The ideal candidate will have strong expertise in deep learning techniques for speech synthesis, with a solid understanding of natural language processing (NLP), audio signal processing, and generative models. You will work with large-scale datasets, experiment with state-of-the-art architectures, and innovate within the rapidly evolving field of voice AI.

Key Responsibilities:
  • Voice/Speech Synthesis Model Development:
    • Design and develop state-of-the-art speech synthesis models, such as neural text-to-speech (TTS), voice cloning, and voice transformation models.

    • Experiment with and implement advanced architectures like Tacotron, WaveNet, FastSpeech, and others to improve the naturalness, expressiveness, and diversity of generated speech.

    • Develop and fine-tune models for various use cases, including virtual assistants, voice interfaces, audiobooks, and accessibility tools.

  • Speech Data Processing & Feature Engineering:
    • Collect, preprocess, and augment large datasets of speech and text data to train generative speech models.

    • Apply techniques in audio signal processing, feature extraction (MFCC, spectrograms, etc.), and prosody modeling to enhance speech quality and intelligibility.

    • Implement strategies for handling noisy data, speaker diversity, and emotional tone in speech synthesis.

  • Model Evaluation & Optimization:
    • Define and implement robust evaluation frameworks for assessing the quality and naturalness of generated speech.

    • Use both objective metrics (e.g., MOS scores, signal-to-noise ratio) and subjective listening tests to validate model outputs.

    • Continuously optimize models for better performance, faster inference times, and lower resource usage.

  • Research & Innovation:
    • Stay at the forefront of research in voice and speech generative AI, particularly with new architectures and methodologies like GANs, variational autoencoders (VAEs), and unsupervised learning techniques.

    • Experiment with speech emotion recognition, voice style transfer, and multi-lingual/multi-accent speech generation.

    • Contribute to research papers, patents, or open-source projects in the field of voice/speech generative AI.

  • Deployment & Integration:
    • Collaborate with software engineers to deploy generative speech models into production environments, ensuring scalability, performance, and seamless integration with existing systems.

    • Develop and maintain APIs for speech synthesis and voice-based applications, optimizing them for real-time use cases.

  • Collaboration & Communication:
    • Work closely with product managers, designers, and business teams to ensure voice AI models meet the specific requirements of different products and platforms.

    • Clearly communicate technical results and insights to non-technical stakeholders and contribute to strategic decision-making in AI product development.

  • Ethical AI & Responsible Voice AI Development:
    • Ensure that all voice-generation models are developed with safety, fairness, and transparency in mind.

    • Implement safeguards against misuse, such as deepfakes and malicious voice cloning, and contribute to the development of ethical guidelines for voice-based AI technologies.

Required Skills & Qualifications:
  • Education:
    • Bachelor’s or Master’s degree in Computer Science, AI, Machine Learning, Electrical Engineering, or a related field. Ph.D. is a plus.

  • Experience:
    • 3+ years of experience in developing and deploying voice or speech synthesis systems, such as TTS, speech recognition, or voice cloning.

    • Strong experience with neural network-based speech synthesis models like Tacotron, WaveNet, FastSpeech, Deep Voice, or similar architectures.

    • Proficiency with deep learning frameworks such as TensorFlow, PyTorch, or JAX, specifically for speech generation tasks.

    • Solid understanding of audio signal processing, feature extraction, and voice synthesis techniques.

    • Familiarity with reinforcement learning, generative adversarial networks (GANs), or variational autoencoders (VAEs) applied to speech generation.

  • Technical Skills:
    • Proficiency in Python and related libraries (e.g., NumPy, SciPy, librosa) for speech signal processing and model training.

    • Experience with cloud platforms (AWS, GCP, Azure) and tools for model training and deployment.

    • Knowledge of speech evaluation metrics (MOS, signal quality, prosody evaluation) and hands-on experience with subjective and objective speech quality evaluation.

    • Familiarity with large-scale data pipelines and distributed training techniques, especially in the context of training complex generative models.

  • Soft Skills:
    • Strong problem-solving skills and a passion for working on innovative AI technologies.

    • Excellent communication and collaboration skills, with the ability to explain complex concepts to non-technical stakeholders.

    • A creative mindset and the ability to think outside the box when tackling complex AI challenges.

    • Ability to work effectively in a cross-functional team, balancing technical rigor with product-oriented goals.

Preferred Skills:
  • Experience with voice cloning or speaker adaptation technologies.

  • Knowledge of multi-lingual and multi-accent voice generation.

  • Experience with real-time speech synthesis for interactive applications (e.g., virtual assistants, voice bots).

  • Expertise in multi-modal AI, integrating speech with text or visual data (e.g., lip-syncing).

  • Experience with speech emotion recognition and generating speech with emotional expression.

  • Knowledge of ethical AI principles related to voice AI and combating misuse (e.g., deepfakes, synthetic voice generation for fraud).

  • Contributions to open-source projects in speech synthesis or generative AI.

Working Environment:
  • Flexible working hours with remote work options available.

  • Collaborative, innovative, and fast-paced team culture.

  • Opportunities for continuous learning and professional development in AI and speech technologies.

  • Competitive salary and benefits package.

Originally posted on Himalayas

Salary & Benefits

Apply nowApply now
Sorry, application are closed for this job. Check more jobs here.
This job was originally posted on
HimalayaRemotive

More digital nomad job openings

Get new nomad jobs weekly to
Please check your email to verify your signup
Oops! Something went wrong while submitting the form.
To confirm your subscription, we've sent you an email at EMAIL. Once you confirm, we'll start sending you the latest nomad jobs weekly. Be sure to check your spam folder if you don't see the confirmation email.
Got it