SRE Manager
TradingView is the world’s #1 platform for all things investing. 100M+ users trust us to inform their trading decisions. Want to make an impact? Apply now — help shape the future of finance.
TradingView is the world’s largest financial analysis platform with more than 100M users across 180+ countries.
We build tools that help traders and investors make informed decisions — from advanced charting and market data to collaboration and publishing features. Our products are used daily by millions of individuals and trusted by companies like Revolut, Binance, and CME Group.
We’re continuing to grow and scale our platform, and we’re looking for people who care about product quality, take ownership of their work, and want to build systems used by a global audience.
About the team
We are the HUB SRE team — the group responsible for the reliability, availability, and performance of some of the most critical and heavily loaded services in the company.
Our infrastructure is a hybrid environment: a mix of cloud services and our own bare-metal servers, each with its own operational model and failure domains. We don't just keep things running — we engineer reliability into the system.
Responsibilities
Lead and manage the HUB SRE team, building a culture grounded in SRE principles: SLOs as contracts, error budgets as decision-making tools, toil reduction as a continuous practice.
Define, implement, and evangelize SLOs/SLIs/error budgets for the company's most critical services — make reliability measurable and actionable.
Drive toil reduction: identify repetitive operational work, set toil budgets, and ensure the team spends the majority of its time on engineering, not firefighting.
Own and evolve incident management processes: on-call rotations, structured incident response, blameless post-mortems, and follow-through on action items.
Build and improve observability across the stack: metrics, alerting, distributed tracing, and dashboards that give teams real-time understanding of system behavior — not just system status.
Drive capacity planning and performance engineering: ensure critical services handle growth without degradation, model capacity needs, and prevent outages before they happen.
Collaborate with HUB backend teams as a reliability partner: review architectures for failure modes, advocate for reliability improvements, and push back when error budgets are exhausted.
Build and evolve CI/CD pipelines toward one-click deployments with automated rollbacks and progressive delivery — make deploying safe and boring.
Champion runbook-driven operations: ensure every critical procedure is documented, tested, and ready for execution under pressure.
Mentor engineers in SRE practices and thinking, help them grow, and build a team that balances operational excellence with engineering ambition.
What makes you the perfect fit
Proven experience as an Engineering Manager, SRE Lead, or Reliability Engineering Lead managing a team of engineers.
Deep understanding of SRE as a discipline: SLOs/SLIs, error budgets, toil classification, capacity planning, incident management — not just tooling, but the philosophy and organizational practices.
Strong technical background in backend systems, Linux, networking, and distributed systems — you understand the services your team is responsible for at a deep level.
Experience working with hybrid infrastructure: cloud providers and bare-metal servers, understanding the reliability trade-offs of each.
Solid experience building and improving observability: monitoring, alerting strategies, distributed tracing, and meaningful dashboards.
Experience building and optimizing CI/CD pipelines for complex, multi-service environments.
Strong incident management skills: structured response, blameless post-mortems, driving systemic improvements from incidents.
Excellent communication, people management, and the ability to influence engineering teams you don't directly manage.
Will be a plus
Background with high-load systems serving millions of requests with strict latency and availability requirements.
Experience with bare-metal server operations: provisioning, networking, hardware failure handling.
Familiarity with chaos engineering or proactive reliability testing (game days, fault injection).
Experience defining on-call compensation models, sustainable on-call rotations, and escalation frameworks.
Background in performance engineering: profiling, load testing, bottleneck analysis.
Knowledge of Infrastructure-as-Code tools (Terraform, Ansible).
What we offer you
Flexible working hours and a hybrid work format
Well-equipped offices for focused and collaborative work
A global, distributed team of 500+ professionals
Learning, mentorship, and long-term career growth
Relocation support and private health insurance
Performance-based bonuses
TradingView Premium access
Regular team events and company-wide meetups
Join the TradingView team and help us build a product used by millions of traders and investors worldwide. We look forward to hearing from you!
TradingView is an equal opportunity employer. We embrace diversity and are dedicated to fostering a diverse and inclusive workplace. Our success is driven by 600+ professionals from 40+ countries who speak nearly 20 languages.
- Department
- Product Development
- Locations
- Tbilisi, Cyprus
About TradingView
We are TradingView, the world's most popular charting platform and the industry's forefront for financial visualization solutions. 100M+ traders worldwide use our platform as a go-to destination to chart, chat, and trade financial markets.