Machine Learning by Clustering
On the back of the relative success of my last machine learning lottery post, I’ve decided to have a crack at another famous national lottery game Set For Life but using clustering this time round for a different approach.
It may be worth noting that machine learning is used to find patterns. If something is truly random then there won’t be any real pattern to discern from. We can still hope though right?
Before we head on further with our clustering project, feel free to add me on LinkedIn
What Is Clustering?
Clustering is a type of algorithm that helps determine if an element fits into a specific category. There are a few cool tricks to getting things right and I’ll go into a bit more detail on each one as I go through the project.
Getting and Cleaning the Data
Unfortunately, getting the data wasn’t as easy as I’d hoped. Going to my first port of call “The National Lottery” website, I can only get the last 3 months’ worth of draws. Which isn’t really enough. I did, however, find every draw, machine used and ball set used on a website. So I scraped it all off and stored it in a CSV for future use.
It didn’t all fall into place as easily as it’s laid out here. I needed to figure out how the pages were constructed to be able to get the data that I required. One of the more difficult parts was getting the numbers but following a link to a different page that had the ball set and ball machine on them to add to the same row.
Overall the web scraping took about a minute to get everything and convert everything into the formats I needed.
Step one, complete.
Exploring The Data
I was always curious as to why different machines and ball sets were used to achieve a random result. I was/am convinced that the balls had different weights. For example ball set 1 would be heavier towards the low end. Ball 1 would weigh 1.2g more than ball 49. and the opposite towards the latter end of ball sets. I also thought the machines would run at different speeds.
Plotting the ball machines and ball sets would be a perfect opportunity to explore this idea.
As I mentioned before, I was always interested to see if there was a pattern in the way the balls were pulled. By adding a hue to the machines to the plot, we can see a pattern emerge.
Going from left to right each Ball number is plotted showing the range of the overall position. It’s natural for the ranges to flow from left to right as each ball is in ascending order. The thing that surprises me is that I can see condensed areas of ball machines, which leans towards my initial theory.
Let’s look at Ball sets too to see if there’s anything there too.
There’s the natural flow from left to right as the subplots progress along. I can see some slight examples of all ball sets being used very closely with each other too. I would say that having 9 different ball sets dilutes the pattern here a bit. But it’s good to see that I can see something.
Now that I’ve got 2 different features I can crack on with the clustering.
Figuring out how many clusters I’ll need for the algorithm.
I know that there are 6 balls, so I’ll more than likely need 6 clusters. Things don’t always plan out quite how we want them so I’m going to apply the elbow method for discovering how many clusters I should ideally include.
From here I’ll exclude the Life Ball from the predictive clustering:
- The range is only 1-10
- If they are included it will dilute the results from the rest of the prediction.
Within-Clusters Sum of Squares (WCSS)
The within-cluster sum of squares is a measure of the variability of the observations within each cluster. In general, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares.
What we’re looking for is a drastic change in the values as the cluster amount goes up. If we don’t have enough clusters we won’t be able to capture the relative numbers properly, too many and we’ll end up with a useless analysis.
I’ve created two variables, an empty list and shortened the DataFrame to only include the observations I want to use which are:
- Ball 1 -5
- Life Ball
- Ball Machine
- Ball Set
You’ll see why it’s called the elbow method.
According to the graph, the nominal number of clusters for this dataset is between 3 and 7. I think 5 would be perfect, but it’s always worth checking out what each point comes up with.
I’ll reshape the data so that the ball frequency and the machine are present in the features so I can plot the results properly.
Now I have all the balls, their frequency, and the machine they came from it’s time to start some clustering machine learning.
Loop through a few different of these sections all in the same block and adjust the number of clusters to fit_predict for. Then create a 4×4 subplot to apply the data to to get an understanding of which clustering solution is the best one to go for.
It’s great that we’ve got the clusters all sorted. I’m going with 5 clusters in this instance as I have 5 balls in this collection of observations.
We’ve got the clusters, but now what? How are we supposed to get a definitive number from a cluster? Well, we’d use statistics to find the Centroid for each cluster. In order to find the centroid, you need to sum up all the points of the cluster, and then divide them by the number of observations. Essentially the average. This will yield the number of the specific clusters we’re after.
And The Results Are In
Almost disappointingly, the results are combined into a very small table. I’ve gone for the mean and I’ve also calculated the Median. Then added in the Life Ball at the end
Based on this quick example, I’d get two tickets, one for the mean and one for the Median. Then I’d be Set For Life!
As I reached the end of this clustering project I realised I was doing a cluster for everything to try to predict the most likely outcome for all ball sets and all machines. A better idea, and also more costly, would be to create clusters for just the machines, then another for just the ball sets. Then more combinations of just the ball sets for each machine.
It would take a fair bit of time splitting up the datasets to fit around a number of different possibilities. I might just do that one day.
If you enjoyed this article, then feel free to check out some of my other ramblings. A few posts I’d recommend from my fellow employees here at Embryo:
- Rethinking Your Communication – By Ruby Ruby Ruby
- 8 Things To Consider Before Launching an International SEO Campaign – By Amy Leach
- Where to put your CTA? Is above the fold still relevant? – By James Thornes