Summary
Romania’s largest broadband internet service, mobile telecommunications and GSM network operator serves one of the most dynamic mobile telephony markets in southeastern Europe, with close to 14 million mobile phone users and active-user mobile penetration around 115%.
For this case study, we spoke with an Enterprise Infrastructure Manager for the telecoms provider and Calin Damian, CEO of Temperfield, a valued member of the Runecast Partner Network.
Challenge
As a cloud provider with some infrastructure exposed to the Internet, it is important to keep up with the requirements of providing such services to so many millions of customers. The company was running 10 vCenters and about 100 ESXi servers, and was in the process of expanding the latter another 60, divided between two major centers of operation in two of the country’s biggest cities. Their stack comprised vSphere, NSX-V, vROps, SRM, vCloud Director, Log Insight, and vRealize Orchestrator.
The virtualization team was dealing with a large infrastructure, demanding applications, and limited personnel, which made for an increasingly difficult situation in trying to keep up with software and firmware. A number of incidents had already occurred, resulting in a definite need for a new solution or approach.
There were typically two types of problems, those they could solve themselves and those that required involving a third party for help. In both cases, it was a reactive approach that required lots of troubleshooting, checking logs, and searching for solutions online in Knowledge Base (KB) articles. In some instances there was downtime and to ensure service continuity for customers they needed to move applications, causing more work required by additional teams. Also, the log collection took time, and VMware support required time on their side to understand the infrastructure and find what might be wrong.
In firefighting mode, a proactive approach was not yet an option for their IT team. They were struggling to find and fix issues, and known areas to be proactive were, by necessity, a lower priority. This was especially problematic for undergoing security audits with confidence.
“They also had an ambitious roadmap to plan and perform storage upgrades a few times per year,” said Mr. Damian, “with multiple storage systems, which of course required a lot of work to ensure compatibility with new vSphere versions.” And they knew that they were not up to date regarding all of their BIOS, driver, and firmware levels being compatible with VMware servers.
The consensus of the team was that they were operating in firefighting mode, just looking after incidents, with little time to do any analysis required for a proactive stance. Mr. Damian stated, “They would have needed another team just to make the analysis to fix potential issues proactively.”
Solution
To address these challenges, the company’s IT team decided to have some conversations with engineers at Bucharest-based IT solutions provider Temperfield. They had previously worked with Temperfield, which had a proven track record of knowledge with VMware products. This time, Temperfield engineers brought Runecast engineers into the conversation as well.
The telecom’s Enterprise Infrastructure Manager saw a presentation by Runecast CTO and Co-Founder Aylin Sali at a VMUG Romania meetup and arranged for a PoC internally. Temperfield helped to run the PoC for a couple of weeks. After a couple of calls between Temperfield engineers and a Runecast account executive, the telecom giant’s IT team was already able to run a few internal tests and solve some issues.
“Runecast shows configuration drift. One server showed a different amount of memory, raising a flag that there was a malfunction – when nobody else, including the vendor, had detected it,” said Mr. Damian.
There were no issues in obtaining budget. The team received good feedback from the budget owner, who knew about their reactive struggles with the IT infrastructure and was supportive of transitioning to a proactive approach to ensure business continuity.
Deployment and first analysis was quick. Mr. Damian emphasized humorously, “It took only 30 minutes to deploy and see discoveries – less time than this interview for the case study.”
The Runecast platform immediately found and identified servers not properly configured, standard port groups not added to all physical servers from a cluster, and potentially critical PSOD-causing issues, identified through some specific entries into the log files – in convergence with driver/firmware version and vSphere version – which would have otherwise been very hard to spot.
Bonus aspects for the team included being able to analyze their hybrid cloud while running Runecast securely on-premises, with no data needing to be sent outside the organization to a third party – an increasingly important security posture for securing sensitive data in a hybrid cloud environment. Also, as the Romanian telecom’s IT team frequently requires upgrades, Mr. Damian said that “the Runecast ESXi upgrade simulation feature is important for the team,” as it simulates upgrades against VMware’s Hardware Compatibility List (HCL) to show avoidable issues even before they can occur.
Benefits
Mr. Damian stated that the high-level benefit of using the Runecast platform is that “Runecast has saved the telecom giant from needing an additional team.”
In addition to such major human-cost savings, the company sees greater uptime due to a proactive stance toward identifying and eliminating potential issues to its mission-critical infrastructure. Mr. Damian estimated that Runecast helped them to discover 100s of issues already in the first year.
Before Runecast, the company’s virtualization team was reactively dealing with around 15-20 issues per month, but with Runecast they now see closer to half that amount and are able to resolve them proactively – before they can cause an outage.
As the team runs Runecast in parallel with VMware support, they have had far fewer cases where that external support was needed. “They save around 20% of our time within the team now,” said Mr. Damian, “but other savings get factored in when you consider that some issues were previously taking half a year and the involvement of external teams to fix.”
When asked if he could recommend any advice to peers about using Runecast, Mr. Damian stated, “Runecast does half the work, it shows what issues you have and what you need to do to comply with best practices, security, etc… the other half is to allocate yourself time to decide what is good to focus on first for your specific org or not.”
Costs Saved with Runecast Analyzer
- Operational costs of not having to hire an additional team
- Downtime and associated reactive costs (to fix and to business reputation)
- Increasingly less reliant on third-party support