Summary
The AI and Systems Co-Design team has a mission to explore, develop, and help productize high-performance software and hardware technologies for AI. Our team defines and drives the AI software and hardware roadmap at Meta. We are seeking a candidate who will work on a foundational tool of our internal workloads on current and next-generation AI platforms. Specifically, this position focuses on collecting, processing, storing, and analyzing various operators and workloads.
Job Responsibilities
Extract operators (e.g. aten, triton) from AI/ML models.
Run operators on multiple devices and collect performance data.
Process collected data and store it to a database while maintaining data integrity.
Implement, improve, and maintain programmatic and web interfaces to query and analyze performance data stored in the database.
Collaborate as part of a project team to coordinate development and determine project scope and limitations.
Review project requests to estimate time and cost required to complete the project.
Skills
Must-have skills
Hands-on experience with product-level Python programming
Proficiency in PyTorch, Kineto trace, dispatcher
Hands-on Experience with CUDA, Triton kernels
Hands-on experience in database management and SQL
Proficiency in Linux and Bash
Ability to work independently
Good-to-have skills
Experience in LLM especially Llama.
Knowledge of CI-based testing and automation
Education/Experience
At least three years of experience with above-mentioned skills is required for this role.