✌️About the job

We are looking for a talented ML Performance Engineer with a focus on Deep Learning and High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity.

Responsibilities: 

Optimizing Deep Learning Workflows: 

    • Monitor reports and dashboards and detect low utilization jobs, projects, users
    • Partner with researchers to check their workflow when they lack performance
    • Identify bottlenecks and suggest scripting optimisations
    • For high-scale jobs, introduce AWS proprietary profiler and libraries to boost performance
    • Scale-up gating process: check the scripts performance and vet requests to scale up
    • Build a knowledge base / best practices documentation for all researchers
    • Implement and monitor CPU usage levels for our CPU clusters; identify users that need assistance in properly coding to maximize usage of CPU’s
    • Train researchers on best practices on how to implement automatization strategies to minimize human oversight on jobs.

Develop and Test Strategies for Future Workloads:

    • Benchmark new systems capabilities and identify strategies to properly utilize them (H100, TRN2, TPUv5, Intel Gaudi)
    • Define the minimum needs for storage speeds and find better data loading strategies to support high processing demands of the new accelerators

High-Performance Computing (below responsibilities are optional; only the above are required) :

    • Maintain HPC cluster operations
    • Monitor dead nodes and recover them; document dead nodes and their fixes
    • Monitor shared volumes health, usage, and clean-up needs
    • Monitor the HPC Help Center and solve user problems
    • Assist users in properly launching their jobs
    • Monitor all CPU clusters for users
    • Create and maintain processes around authentication, authorization and accounting for clusters usage
    • Develop processes around security aspects of the HPC clusters, including tools to in case of security risks are identified (globally, by user, by team, by location, etc)
    • Convert and deploy SLURM scheduling for all clouds and all resource types; integrate TPUs into our larger enterprise approach when SLURM becomes available.
    • Maintain AWS resources associated with the HPC clusters (login nodes, S3 buckets, FSx volumes, VPCs, subnets, NAT Gateways, S3 VPC Endpoints, routing tables)

Qualifications: 

  • At least 8+ years of relevant experience
  • Applied programming experience in Python, C, and/or C++
  • Experience with libraries and tools like PyTorch and CUDA
  • Experience in building, productizing and monitoring orchestration pipelines for AI and Machine Learning pipelines
  • Experience with training frameworks like Megatron, NVIDIA or similar frameworks
  • Experience in leading more junior engineers
  • Experience with AWS and/or GCP
  • Experience/exposure to CI tools infra tools is a nice to have (Kubernetes)
  • Experience with Linux-based environments and scripting (Shell Scripting, Python, Powershell)
  • Ability to work well as an individual contributor as well as within a multidisciplinary team environment
  • Strong communicator with excellent interpersonal skills and can-do attitude to work and thrive in a fast-paced team environment

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.

 

Tagged as: AWS, H100, Intel Gaudi, PyTorch, TPUv5, TRN2

Print Job Listing
We use cookies to improve your experience on our website. By browsing this website, you agree to our use of cookies.

Sign in

Sign Up

Forgot Password

Job Quick Search

Share