Google trends (GT) is an under-utilized superweapon and harvests a massive amount of search data. But, it hasn't been possible to use GT for real time machine learning tasks, such as predicting stock price or crypto currency movements, until now....
In this blog, we'll explain the problem with GT for machine learning, the fix to GT data and the edge we've built in crypto trading models at edgebase.io.
We are currently looking for experienced crypto traders as beta testers for our product - please reach out to [email protected]! Edgebase.io is a no-code platform for building your own AI trading signals (initially cryptos only).
What's the problem?
Let's assume you want to test whether Google trends keywords, such as "bitcoin", are leading indicators of the price movement for BTCUSD on the Kraken Exchange. You'll need an method to grab historical data for building ML models and the equivalent real time data to evaluate those models and make predictions. Easy. Not quite.
GT gives different results across different time windows:
Bitcoin for past 30 days on 26th April 2021
Bitcoin for past 90 days on 26th April 2021
Above we can see that the "interest over time" of "bitcoin" for the past 30 days and the past 90 days . Looking back 30 days, the value on 23rd April 2021 is 100, whereas looking back 90 days the value on 23rd April is 76.
Google trends considers the relative search volume within a time window. In the past 90 days, the day with the highest search counts was February 9th 2021 and thus the 23rd April gets reduced from 100 to 71. This is a problem for machine learning because historical GT data cannot align with new real time data.
In fact, the problem is even worse! GT also uses a sample of the true search count. Even if we query within the same date range, the results can suffer from huge variances due to small sample sizes. A group of German researchers made headlines in 2020, when they published a paper showing huge discrepancies for "kurzarbeit", the German word for placing workers on partial furlough. This was particularly concerning because German politicans were actively using "kurzarbeit" GT data for setting policy during the coronavirus pandemic.
What's the fix?
Our fix at edgebase.io is to use our proprietary correction algorithm for GT data. Starting from a 'seed date', we iteratively track changes in the GT reindexing and correct any reindexing over time. This means that different GT sample windows can be stitched together and provide a consistent view of the relative search count over time.
We also correct for low sample sizes in GT data, by taking multiple samples from the same time window and creating an average sample using the magic of the central limit theorem. The sample error reduces by a factor of sqrt(n), where n is the number of GT queries taken for the same window.
Google doesn't provide daily data beyond 90 days, and if you try to stitch together rolling 30 day windows of "bitcoin" GT data you get:
Above there are many peaks in search volume, some do correspond to bitcoin movements, but this is very noisy and unreliable data. However, if you use the edgebase correction algorithm you get:
Visually, you can see that our correction has significantly less spikes at 100, closely follows the Bitcoin price and scales over time. But can this be used to predict bitcoin price movements...
What's the edge?
At edgebase.io we quickly train ML models with new datasets, such as our new corrected GT bitcoin data at scale.
Let's start with a simple bitcoin price model. We can drag and drop datasets and create features off these dataset using the Edgebase model builder. In the example below , we take the 11 daily log difference of our new "Bitcoin" GT data and use it to predict the daily bitcoin Kraken closing price:
Now we have constructed our simple model, let's hit "Train Model":
Pretty impressive! The F1 score is 52, which means we predicted the bitcoin price movement about 52% of the time. Our profit is about 40% over 3 months and our Sharpe ratio is 1.24, which approximately means we earned 1.24 units of reward for every unit of trading risk taken.
We can improve the model by feeding the logarithmic 3 day (weighted) difference of bitcoin prices into our model as well:
Even better ! The F1 score is still 52, but with a slightly improved Sharpe ratio of 2.12 and higher returns at 75%:
This is only one dataset we provide to predict cryptocurrencies at edgebase.io . We also have volume data, futures data and other proprietary datasets for bitcoin, Ethereum, dogecoin and other altcoins.
Thank you for reading and if you are an experienced crypto trader please reach out to [email protected] as we are looking for beta testing users!