Senior Infrastructure Engineer

Remote

Full-time

Permanent employee

Your mission

Remote (CET)
About Us:
At Graswald AI, we’re building the AI operating system for fashion brands and retailers, starting with AI image and content generation. Our engineering team tackles rapid scaling challenges, GPU-intensive workloads, and enterprise-grade infrastructure to deliver fast, pixel-perfect results for global brands. In the past year alone, we’ve signed 50 enterprise fashion clients who rely on us to reduce costs and accelerate creative timelines. Backed by leading VCs and strategic investors including Lakestar and Orendt Studios, and preparing for a Series A later this year, we’re growing fast and building technology that is already reshaping how fashion content is produced

Overview
The Senior Infrastructure Engineer designs and operates the systems that power Graswald’s platform. This role focuses on reliability, scalability, security, and cost efficiency, while enabling product teams to move quickly and safely. As a senior member of the team, you’ll own core infrastructure architecture, make long-term technical trade-offs, and mentor others through example.

Your profile

Key Responsibilities

Infrastructure Design & Development: Design, build, and maintain scalable, resilient, and secure infrastructure systems. Implement automation and Infrastructure-as-Code (IaC) practices to ensure consistency, reliability, and maintainability of environments.
Technical Contribution & Operational Excellence: Actively contribute to the architecture, deployment, and ongoing improvement of cloud infrastructure and platform services. Perform rigorous peer reviews of infrastructure code, CI/CD pipelines, and system configurations to uphold quality, efficiency, and adherence to best practices.
Reliability & Operations: Own the stability, performance, and observability of production systems. Lead incident response, root cause analysis, and long-term improvements to prevent recurrence. Help defining a sustainable on-call culture.
Performance & Cost Optimization: Regularly review resource usage and optimize infrastructure for performance and cost efficiency. Propose architectural improvements where needed.
Collaboration & Enablement: Partner closely with product and engineering teams to design reliable infrastructure solutions, participate in architectural discussions and postmortems, and provide guidance on best practices for scalability, cost optimization and security.
Continuous Learning & AI-Driven Operations: Stay current with evolving cloud, DevOps, and infrastructure technologies. Explore and apply AI-driven capabilities in areas like monitoring, incident detection, and automated remediation to enhance operational excellence and productivity. Experiment with and champion modern practices to drive innovation within the infrastructure team.
Documentation & Knowledge Sharing: Create and maintain clear, comprehensive documentation for infrastructure designs, operational runbooks, and processes. Ensure that knowledge is easily accessible for current and future team members, reducing operational risk and onboarding time.
Security and Compliance: Implement and enforce security controls, access management policies, and compliance requirements across infrastructure environments.

Qualifications - skills, abilities and experience

Experience & Background:
- Several years of professional experience in infrastructure engineering, DevOps, or site reliability engineering (SRE) roles.
- Experience operating within agile software development teams and modern DevOps practices.
- Bachelor’s degree in Computer Science, Engineering, or equivalent professional experience.
Technical Expertise:
- Extensive hands-on experience with at least one of the cloud providers AWS or GCP.
- Proven ability to design and implement Infrastructure-as-Code (IaC) using tools such as Terraform.
- Proficiency in scripting and automation (e.g., Python, Bash, Go) to streamline operations and reduce manual tasks.
- Solid understanding of Linux systems, containerization (Docker), and orchestration platforms (Kubernetes, ECS, or similar).
Nice to Have
- Experience operating ML inference or training infrastructure at scale.
- Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo Workflows)
Operational Excellence:
- Experience building and operating highly available, reliable, and scalable systems in production environments.
- Strong background in monitoring, observability, and incident response, with tools such as Prometheus, Grafana, Datadog, ELK, or similar.
- Knowledge of security best practices, including identity and access management, secrets management, compliance, and secure system design.
Collaboration & Leadership:
- Demonstrated ability to work effectively in cross-functional teams, partnering with product engineers, security, and data teams.
- Strong communication skills with the ability to explain complex technical concepts clearly to both technical and non-technical audiences.
Problem-Solving & Adaptability:
- Track record of diagnosing and resolving complex infrastructure issues under pressure.
- Ability to balance short-term fixes and long-term architectural improvements.
- Proactive and curious mindset, with a drive for continuous improvement and innovation.

Why us?

Why Join Us

Impact: Build and own the core infrastructure that powers AI experiences for global brands.
Scale & Performance: Tackle challenging reliability and performance problems across training and inference.
Autonomy: High ownership to define standards, tooling, and best practices for reliability.
Growth: Work with a high-caliber team in a fast-scaling environment with significant career upside.
Culture: Pragmatic, collaborative, and quality-focused engineering culture.