✨ Fill and validate PDF forms with InstaFill AI. Save an average of 34 minutes on each form, reducing mistakes by 90% and ensuring accuracy. Learn more

Lead Infrastructure Engineer

Capital One Mountain View, CA
lead infrastructure engineer training infrastructure ai clusters lead engineer capital gpu cloud engineering
April 19, 2024
Capital One
Mountain View, CA
OTHER

Lead Infrastructure Engineer

Job Summary:

The Senior Lead Engineer position in Generative AI Infrastructure is a key opening currently available at Capital One. This role is primarily focused on helping the organization improve their AI capabilities from the ground up. The job duties include collaborating on various projects to create distributed training clusters, deploying Large-Language Models (LLMs) on GPU instances, and providing support for AI research and development within the public cloud infrastructure.

Job Duties and Responsibilities:

  • The plan is to create distributed training clusters on public clouds that are operationally large, with a crucial emphasis on improving storage and networking infrastructure. Multiple parallel processing techniques will also be employed.
  • Develop a sturdy system that can manage arduous, long-term training duties by utilizing checkpointing libraries and containers for increased fault tolerance.
  • Develop an optimized server architecture specifically designed to handle the computational demands of LLMs and FMs while maintaining high levels of scalability and reliability.
  • Devise a framework that enables easy deployment of search indexes and embeddings in vector databases in a consistent and reliable manner.
  • Work jointly with cloud and container infrastructure and AI teams to devise and integrate key capabilities.

Qualifications and Experience:

  • The attainment of a Bachelor's degree in a technical field, such as Computer Science, Computer Engineering, or a related area, is essential.
  • The role necessitates a candidate who holds a wealth of knowledge in designing and constructing data-intensive solutions using distributed computing, accrued over a minimum of 8 years.
  • A minimum of 8 years of programming experience and expertise in at least one of the following languages: Python, Go, Scala, or Java, are mandatory for the suitable candidate.
  • The ideal candidate should have a solid background in high-performance computing (HPC), semantic search technologies, or vector embedding, with at least one year of experience in one or more of these areas.
  • A requirement for this job entails having a year or more of experience optimizing, scaling, or building deep neural network training or inference models.

Preferred Qualifications:

  • Obtaining a Master's or Doctoral degree in Computer Science, Computer Engineering, Electrical Engineering, Mathematics, or an allied discipline demands having a higher education in the said field.
  • With ample experience in machine learning, I am well-equipped to handle the complexity of training and deploying deep neural networks and transformer architectures at a large-scale.
  • I am experienced in implementing machine learning projects using popular frameworks like TensorFlow, PyTorch, and Lightning. These frameworks enable me to design highly efficient models that can analyze complex datasets.
  • Competent in thriving in a work environment that necessitates the ability to navigate ambiguity and handle multiple conflicting deadlines.
  • Applicants who have worked in tech and product-oriented companies/startups before are preferred.
  • The ability to work collaboratively with researchers and engineers is an advantage when it comes to quickly building foundational capabilities and enhancing product experience through rapid iterations.
  • Expertise in deploying large-scale neural network models in critical production contexts.
  • Building effective GPU clusters in the public cloud requires an in-depth understanding of GPU architecture, networking, and storage technologies. Our team has vast experience in the domain and is equipped with the necessary expertise and skills to build clusters that are high-performing, scalable, and cost-effective.

Salary:

  • New York City's hybrid on-site work environment offers a salary range of $234,700 to $267,900 for a senior lead machine learning engineer

Benefits of the Position:

  • Pay system with rewards tied to job-performance such as cash bonuses and long-term incentives.
  • Well-being is being prioritized with a comprehensive range of benefits, designed to support an individual's physical, financial, and emotional needs.
  • With a focus on developing its workforce, Capital One offers continuous training and mentorship programs to keep its employees up-to-date with the latest industry trends and practices.
  • A workplace that empowers individuals to develop personally and professionally.

About Company:

If you're excited about innovative technology and its potential to transform banking, join the team at Capital One in their commitment to creating AI systems that are trustworthy, dependable, and put people first.


Report this job

Similar jobs near me

Related articles