I feel the best way to learn something is by doing and picking up a project which you are passionate about. You can watch all the courses and MOOC’s but if you don’t put them into practice it’s of no use. And so recently, I started working on a fun project to analyze the open source software releases and try to predict if a software release turns out with a lot of bugs or if it turns out to be with few bugs. Clearly, it is a straight-forward binary classification problem(0 and 1). The machine learning model can help release managers and project managers plan their software releases much better based on the resources available.
Photo by David Rangel on Unsplash
Unfortunately, i couldn’t find any dataset out there with software release data with information about the release like the number of commits, files changed and the bugs raised after the release and hence i decided to create one. My first choice was to fetch the data from GitHub and they have a nice API too. You can find my GitHub repository here which will help you get started to export all the releases for a Project along with the issues raised into a csv.
One great way to understand your data a lot better is doing EDA. Plot as many plots as possible between each independent variable(features) and between each independent variable and the target. From a binary classification perspective it is important to understand if there is a class imbalance problem in your data. I prefer to use Tableau to perform EDA and to know more how to perform EDA , please refer this article.
To check out the EDA I performed for this exercise, please check this link
Photo by Franki Chamaki on Unsplash
Now into the most fun part of the exercise, feature generation and machine learning . I started with the following features which i felt could predict the quality of a release.
I trained an Xgboost classifier model with these 5 features to get a baseline and as i imagined it wasn’t that good.
Obviously, there are a lot of assumptions here like the complexity of the project, size and the contributors involved and hence i decided to dig for more features which can somehow help me understand the complexity and popularity of the project. I added the following features which are the predictors of the popularity of the project.
GitHub social links
Let’s check now if the accuracy improves. Looks promising, a jump of 4%. let’s keep going.
One of the best practices in software development is unit testing. But how do we quantify unit testing into a number. How about counting the files which were modified under the tests folder or counting the files which were modified which have the word test in them. Each project has a different naming convention for their unit testing. I manually checked how each individual project in my dataset did unit testing and wrote patterns to count all the test files changed during a release, and these are the features i have.
Let’s see if adding a feature for unit testing improves our accuracy score and it does.
software releases are on going process. features gets added, bugs gets fixed from one version to the next. Does the order of release have an impact on the number of bugs raised. Are the initial releases more buggy? Let’s see if adding the order of the release to the dataset improves the accuracy in any way.
Looks like it does have an impact on the bugs raised.
How would you quantify someone’s experience and skills into a number, it is very subjective. I decided to calculate the total number of prior contributions made by the contributors for a particular release add it as a feature and check if that can improve the accuracy. As you can see below it looks like it’s not a good idea to add this feature and i decided to roll it back.
You can find the python notebook that i created here along with the sample dataset. Next steps would be perform hyperparameter tuning to get even better accuracy.
This exercise proved to me the value of domain knowledge in data science, since i knew few things about software development i could add new features which increased the accuracy of the model. Overall it was a fun exercise to learn the concepts and put them into practice. I really feel this is a fun way to enhance your skills and I hope everyone who is learning Machine learning, picks up a project they are passionate about and create something beautiful.
For any feedback or questions, feel free to reach out to connect with me on LinkedIn