Site Reliability Engineer - US Government
Company: xAI
Location: Palo Alto
Posted on: February 14, 2026
|
|
|
Job Description:
Job Description Job Description About xAI xAI's mission is to
create AI systems that can accurately understand the universe and
aid humanity in its pursuit of knowledge. Our team is small, highly
motivated, and focused on engineering excellence. This organization
is for individuals who appreciate challenging themselves and thrive
on curiosity. We operate with a flat organizational structure. All
employees are expected to be hands-on and to contribute directly to
the company's mission. Leadership is given to those who show
initiative and consistently deliver excellence. Work ethic and
strong prioritization skills are important. All employees are
expected to have strong communication skills. They should be able
to concisely and accurately share knowledge with their teammates.
About the Role We are seeking a highly skilled Senior
Infrastructure Engineer to join our US Government Team, focused on
designing, building, and operating secure, scalable infrastructure
for critical government projects. In this role, you will develop
and manage training and inference clusters, as well as highly
reliable applications, across bare metal, classified cloud, and
hybrid cloud architectures. You will leverage your expertise in
Kubernetes and GPU hardware to deliver robust, secure systems that
support large-scale AI workloads while meeting stringent federal
compliance requirements. This role demands a passion for
automation, observability, and ensuring system integrity in a
fast-paced, high-security environment. Responsibilities Develop and
optimize software to provision and manage xAI's infrastructure
across on-premise, virtual machine, and classified cloud
environments, enabling efficient scaling for US government
initiatives. Enhance the reliability, performance, and
cost-effectiveness of infrastructure to support large-scale AI and
application workloads in secure, classified settings. Collaborate
with xAI engineers to understand workload requirements and design
tailored solutions that meet government-specific needs and
compliance standards. Implement robust observability, monitoring,
and security practices to ensure the integrity, availability, and
confidentiality of critical systems, adhering to federal protocols.
Manage storage infrastructure using Infrastructure-as-Code (IaC)
tools such as Pulumi, Terraform, or Ansible, with a focus on secure
data handling. Drive system reliability through incident
management, postmortems, and the definition of clear SLAs and SLOs,
while maintaining security and compliance. This is an in-person
role based in Palo Alto, CA or Washington, DC, with up to 50%
travel required. Required Qualifications Active Top Secret (TS)
security clearance. 5 years of experience as an Infrastructure
Engineer, Site Reliability Engineer, or similar role, with a focus
on building and maintaining reliable, scalable systems, preferably
in secure or government environments. Proficiency in managing
storage infrastructure with IaC tools such as Pulumi, Terraform, or
Ansible. Deep understanding of the Kubernetes stack, including CNI,
CRI, CSI, and related components. Demonstrated ability to improve
system reliability through incident management, postmortems, and
defining SLAs/SLOs. Excellent communication and documentation
skills, with the ability to handle sensitive information concisely
and accurately. Preferred Qualifications Deep familiarity with
installing and using GPU hardware, including setting up drivers,
debugging issues, and ensuring reliability. Experience with
high-traffic web or mobile application workloads, including
optimizing Kubernetes for large-scale deployments in classified or
federal settings. Familiarity with chaos engineering, capacity
planning, or similar practices for ensuring system resilience in
government projects. Proficiency with tools such as Kyverno,
ArgoCD, or Go programming for infrastructure automation. Strong
sense of ownership, curiosity, and enthusiasm for tackling complex
technical challenges in secure environments. Passion for
problem-solving and a proactive drive to deliver impactful results
while adhering to security protocols. Certifications in
security-related fields (e.g., CISSP) or experience in secure
federal environments. Interview Process After submitting your
application, our team will review your CV and statement of
exceptional work. If your application advances, you will be invited
to a 15-minute phone interview to discuss basic qualifications.
Successful candidates will proceed to the main process, which
includes: Technical deep-dive: Discussing your infrastructure and
secure systems experience. A hands-on challenge focused on
designing or troubleshooting infrastructure for secure
environments. A meet-and-greet with the wider team. Our goal is to
complete the main interview process within one week. Annual Salary
Range $180,000 - $440,000 USD Benefits Base salary is just one part
of our total rewards package at xAI, which also includes equity,
comprehensive medical, vision, and dental coverage, access to a
401(k) retirement plan, short & long-term disability insurance,
life insurance, and various other discounts and perks. xAI is an
equal opportunity employer. For details on data processing, view
our Recruitment Privacy Notice.
Keywords: xAI, Daly City , Site Reliability Engineer - US Government, IT / Software / Systems , Palo Alto, California