Client:
Travel Operator from the US
[ Detailed information about the client cannot be disclosed under the provisions of the NDA ]
Project workflow
Challenge
The client had a lot of different microservices developed by many other teams in various time zones. Response and resolution time for any infrastructure issue developers had was too high so that developers could do nothing for days because of blocker issues (e.g., can’t access build machines, don’t have permissions assigned).
Release time was usually prolonged because of non-developers issues.
Solution — Preliminary Investigation
We were part of the DevOps team, which consisted of 2 DevOps Engineers facing this challenge
Discovery showed that:
- The Application Development Team (ADT) requests were addressed to the Infrastructure Development Team (IDT), so both teams were blocked until the issue was resolved.
- IDT had no defined workflow or assigned person for handling tickets, so they came simultaneously from different sources (Slack, Microsoft Teams, and Jira). Because of that, requests were only processed when someone from IDT had spare time. So Application Developers were forced to constantly ping IDT members, which resulted in endless unnecessary communications and strained relationships between teams.
- Due to the lack of a centralized Solution Wiki, many similar requests were coming daily while IDT members were inventing similar solutions for a problem already solved by another team member but not documented.
- Build environments were poorly automated; developers needed to manually ask for more build machines or parameter changes
The decision was to create a separate Support Team (ST) for the Application Development Team. ST initially consisted of 2 DevOps engineers: 1 for India and 1 for US time zone coverage.
Solution — What was done
What was done:
- Creation of a centralized Confluence wiki: documenting possible issues and requests along with their solutions for both developers and the support team eliminated unnecessary requests that could be resolved without the support team’s involvement;
- Automation of low-effort, time-consuming requests such as permissions or resource creation was performed to reduce manual effort and streamline the process;
- Single point of entry for requests was established, where responses, resolution suggestions, and links to automated requests were provided. Developers were required to use the Jira ticket system and could not seek solutions from the support team through other channels. Jira ticket creation involved filling predefined forms, and Jira suggested possible solutions from the Confluence wiki articles before submitting a ticket.
- Key Performance Indicators (KPIs) were implemented to track the Support Team’s performance. Developers could rely on Initial Response Time and Average Resolution Time metrics.
In addition, proactive actions have been taken:
- Development infrastructure was covered with monitoring and proactive alerts to fix any possible issues before they block the Application Development Team;
- Build environments were made more flexible using tools like Jenkins and ECS. Developers gained control over various deployment options, including autoscaling and descaling build containers and customized job configurations.
Results
Decrease
of IRT
Decrease
number of requests
Timings
The project implementation took 3 months and 3 month to gather result data
In 6 months, ST has achieved the following results:
- Response time has decreased drastically – down to 15 minutes at max;
- The number of requests decreased from 30-50 per day to 10 per week;
- Requests automation and Confluence Guideline and Run books were created, that enabled the transition from the Support Team to Level 1 Engineers who could efficiently maintain the system
Benefits for the client:
- Increased development team efficiency: By removing non-development tasks and addressing infrastructure issues promptly, the development team’s efficiency improved.
- Eliminated release delays due to infrastructure issues: The client achieved zero releases postponed because of infrastructure issues.
- Increased feature deployment rate: The average number of features per release significantly increased, allowing the client to deliver more value to his customers
Technologies used
- AWS
- Jenkins
- Node.js
- Chef
- Terraform
- GitHub
- Vault
- PostgresSQL
- Splunk
- RabbitMQ
- Python
- Bash