Gen AI for data privacy
I set out several months ago to deeply understand and engage with the modern AI tooling that is in the process of revolutionizing (or at least sensationalizing!) the world of Web Development as we know it. I had a single purpose: to build aĀ theoretically scalable systemĀ that could leverage this plethora of new technologies. And one that wouldnāt bankrupt me in the process.
I picked a use case that was an area of interest to me and one that I felt was ripe for the use of Generative AI. It is the world of data privacy. I built a tool that can scan any public-facing web URL, navigate, and register all the network requests. Then we analyse these requests and process the information using Generative AI.
Gen AI is a valid use case because data privacy is complicated in the sense that it is difficult to understand the meaning and consequences of compliance. I believe that Generative AI is mostly useful as a data synthesizer and a reducer. Much of the current frustration comes from the erroneous use of LLMS, which involves inputting a nugget of data and expecting Gen AI to mass-produce gold. It inevitably churns out rubbish. But if you input a high concentration of quality data and ask Gen AI to condense this into something useful, thatās when you get valuable output.
So, what do I want Gen AI to do? Simple, really, as someone who has been doing Web Development for nearly 20 years, I still find myself unable to answer seemingly trivial questions:
- Do I need permission to track user data? Which types? What is āuserā data anyway? An IP address?
- Iām not saving that PII, so itās legal, right? Right?
- When can I set cookies?
- But I use a Cookie to track my Cookie consent. Thatās allowed, I guess? But what is a āfunctionalā Cookie anyway?
- What kind of user tracking is permitted with and without consent?
- How do I need to ask for consent?
- Can I save consent? How do I even persist in negative consent?
- What is the difference between consent for tracking and cookies? Do I need both? Is that one button or two?
- What are the consequences of not respecting consent?
- How does this vary by geography?
- Is Google Analytics legal for use in Europe?
Quite honestly, a lot of the confusion in the above arises because data privacy consent is a grey area with room for interpretation. This isnāt helped by a difference in legislation between different geographies. But the key is that GPT-4o/Llama 3 excels at interpreting vast amounts of data and explaining them with simple language.Ā Perfect, thank you.
So, I set out to gather as much hard evidence as I could about what is actually happening on a simple navigation on a public-facing document (i.e., website!). We map this evidence with our understanding of the legislation, and we arriveĀ at a system that is capable of testing the data processing flow of any public website.
Woohoo. But you arenāt here for the cookies are you, youāre here for the AIā¦
One little system, one bucket load of AI.
- OpenAI - GPT-4o / DALL-E 3Ā We use this to analyze the COMPANIES that are the ultimate processors of the ingested data.
- Groq - Llama3-70b-8192Ā We use this to analyze the REQUESTS where the data is transmitted from the public document.
- Grok - https://developers.x.ai/Ā We use this to analyze sentiment and trends to inform our content generation strategy.
- Brave API - https://brave.com/search/api/Ā We use this to research public information on actors identified within our system.
- Algoria search - https://www.algolia.com/Ā We use this to intelligently map unstructured data into a lovely SQL database.
Did you say 5 APIs?
Five different AI products?
How was my experience with this mesh of AI? Very, very hardā¦
It nearly broke me. So, how did we end up with a platform that has five different AI integrations anyway? Experimentation, repetition, and a fair bit of lunacy.Ā Itās a pattern.Ā But when we untangle the system we have created, each component part makes sense.
The first trade-off isĀ GroqĀ vs ChatGPT. ChatGPT is, of course, the flagship product of OpenAI, the first worm out of the proverbial can. And their first-mover advantage shows. Their API and models have been more refined, and this is clear from the quality difference of the output. So I use ChatGPT for the long-form content and the quality of the results is indisputable.
BUT.
Itās expensive. I woke up in sweats several times a week worrying what would happen if somebody, anybody, actually used this platform Iād built. A great experiment, but one Iām willing to bankrupt myself for? Not likely.
Groq changed everything. Their API costs 100x less. Itās fair to say that had I not discovered Groq, I would probably have never released this blog, simply out of fear of the cost. The quality of GPT-4o over llama 3 is noticeable. But the price of Llama 3 on Groq is quite literally 1% of the price.
I useĀ Open AIĀ if we need to leverage content of the absolute highest quality. I use Groq when I need to process lots of information.
I have built a killswitch to turn everything to Groq at a second's notice. This switch is the difference between being able to launch or not.
gpt-4o - $5.00 / $15.00 per 1 million tokens.
Llama3-7b - $0.05/$0.08 (per 1M Tokens, input/output)
So our AI count is at two out of the doorā¦
Where doesĀ Brave AI SearchĀ come in?
You could easily interchange this with Perplexity AI or something similar. I was very impressed with the API offering. Brave found it's way into the stack as I was building my own SERP crawler and researcher. Mine was rubbish and consuming a lot of time, Braveās was excellent. Mine worked 50% of the time;Ā Braveās 95%.Ā To be able to generate high-quality content for people, we need to solve several puzzles. We need thorough research, and we also need to know what is interesting to the user.
Braveās search API is excellent for doing research for AI content. It provides links and references and shows high-traffic suggestions for users to follow the content rabbit trail. Without the research from Brave, the results from ChatGPT and Groq would be spam. It is a wonderful AI that feeds research and data into our AI. Thatās a 2024 phrase if Iāve ever heard one.
Three downā¦
Onto the most controversial selection. Grok byĀ Twitter (x)Ā is an LLM with a difference. It has built-in social media retrieval (I imagine it has some kind of proprietary RAG). How does this help?
This helps us understand content and topics that are trending and new. So before we research and generate content, we need to understand the hot topics.
Iām not yet convinced by the viability of Grok, but the potential to plug into real-time sentiment and use this as a search and content generation strategy is an exciting one for me. Put this one down as experimental. Iāll keep you posted.
So we end up with Algolia. Why do we need an AI-powered search on top of our AI-powered research and generative AI?
This comes down to how Iāve structured my platform. Weāll go deeper into the how and the why later in this article, but to build a world-class platform, we need to fill in some of the basics. In my old-school paradigm, you canāt have world-class content without a world-class CMS. World-class CMS requires clean, structured data. SQL.
We useĀ AlgoliaĀ to weave and map together the content from our different systems. Itās hard to define and strongly limit output from text generation models (the company recognized it could be Shopify, Shopifyās App, or Shop App). Getting JSON output is more or less stable these days. But converting JSON output to SQL with references between content types is tricky due to the unstructured nature of text generation. Algolia bridges this gap by condensing āsimilarā content into unique SQL data that can be consumed by a website.
Itās not perfect. But it works (95% of the time).
So here we are, 5 AIs in the hype boom forged, with one simple platform to rule them.
It was hard.
It nearly broke me.
We go from theoretical to engineering concerns. Chaining AI API calls to create a tolerable product. So why is using AI so hard?
Fundamentally there is one simple reason. The Internet is now fast. We expect things to be fast. Even AWS API Gateway HTTP requests timeout after a maximum of 30 seconds.
But generative AI?
Just crafting an image with researched content and output can require up to 5 chained calls to various APIs.
-
Identify the content (Sentiment analysis w/Grok)
-
Research the content (AI Search w/Brave)
-
Generate the content (Gen AI w/Llama3/ GPT-4o)
-
Generate the image (Gen AI w/DALL-E 3)
-
Save the content and image into SQL (Algolia)
Example content: HotJar data privacy analysis on privacytrek.com
Itās very hard to build something fast when the underlying APIs are so slow. You wonāt get quality output reliably generated in under a minute, especially as you need to knit together disparate APIs to build anything resembling quality content.
The perfectionist in me refuses to wait so long to deliver results on a website. Weāve come too far.
Whatās the solution? Streaming? Websockets? Background processes? Itās complicatedā¦
I tried every single one of the above. I hated every single one for different reasons; we could write a blog about eachā¦
I spent almost a week building and tweaking a Rabbit MQ broker so that my platform could subscribe to content from the backend responsible for negotiating with this mesh of AI APIs. I was so proud of myself; it was wonderful. It was also absolute rubbish. I deleted it. You know that saying, āEvery person has a book in them. Most should keep it there.ā The same applies to Software Engineers and their AI ideas.
Itās so easy to go off on a tangent and build around the problems that are inherent in artificial intelligence tools. Iāve done it many times until I reluctantly accepted that you canāt make an elephant run, and we needed a different approach. You should, too. Like horse and carriage congestion in the early 20th Century, eventually, it wonāt be a problem. But until it isnāt, it is.
Users expect fast web experiences; a sprinkle of AI will only sate patience for so long before web experiences become onerous and frustrating. So, to use AI at scale, we need to fetch our data before the user has even arrived. The number one rule for leveraging AI is to derive the value from our business logic long before the user has arrived.
The key to the kingdom is to use every word and every image. Every scrap of expensive generated content should be treated like proverbial gold. This means vigilant control of both inputs and outputs.
And so I save every API request, and I thoroughly research every API call I make. I test, and I tweak, I iterate, and I learn until I can bend the tool to my will.
AI Costs $$$
Treat AI API calls with the respect they deserve.
AI is prohibitively expensive. Imagine paying 5c for every API call you make to your CMS. I challenge you to do some matchstick math in your observability platform. Just look at the logs of any modern software system, requests to modern systems are typically measured in the hundreds of thousands, or millionsā¦
To make AI valuable at scale, we canāt treat its output as transient or ephemeral. The first thing I learned about working with AI APIs is to save every response, output, or image. It can be used later. And one of the ironic properties of AI output is that the more you refine and reuse it (condensing), the more valuable and realistic it becomes. Just make sure we conserve the building blocks or you will literally be paying the price.
Itās funny how paying and being on the hook for your own system really takes you back to the basics as an engineer. Nothing strikes fear into a developer more than being on the hook for a faulty API call that could accidentally cost thousands of dollars.Ā Nothing will make me optimize my API fallback strategy like the fear that an accidental loop could bankrupt me. Frankly, we should treat normal APIs with the same respect, but caching, cheap processing and laziness have made this approach redundant.
The founding principle of working with these APIs is to treat every output from an LLM with respect. Spend time considering the inputs and the outputs. Prompt engineering, RAG, and Vector DBs are the buzzwords. The principles are far more simple. Every question or input to a Gen AI system is costing you real dollars. Have you optimized that input to ensure that what is coming out is actually valuable? Or are you simply pounding away at a broken slot machine, throwing your money down the drain?
I spent a long time crafting every user and system prompt, optimizing the inputs and the outputs to ensure that what comes out of the LLM is of value. I failed more than I succeeded. It took me a long time to get beautifully crafted artisan image representations of my companies. I spent days trying to use ChatGPT to create an icon library (bad idea). The more you work with these APIs, the easier it is to see the cracks. Itās so easy to get rubbish output; if you havenāt carefully automated and scaled the input, it is the most likely outcome.
But when the robot gets it right, it becomes something very special indeed.
Ask ChatGPT
Just ask ChatGPT?
In my experience, the inverse is true; these LLMs arenāt generalists at all but specialists. I donāt know why this is surprising. Machine learning algorithms have always been thus. We have object detection models to detect objects from images. We have structured data extraction algorithms to extract data from text. We wouldnāt expect our object detection algorithm to extract structured data from text, right? But that is exactly what we expect from our LLMs. One superhuman AGI robot to rule them all. Absolute popsicleā¦
Nothing screams amateur to me more loudly than the companies building a wrapper to ChatGPT and assuming that āAIā will solve their problem. The LLMs have no AGI at all, not even intelligence. They are capable of processing extremely large datasets. Just the thought that they are a silver bullet to every problem shows me that not much thought has been given at all.
What are they specialists at? Condensing large amounts of information into valuable, smaller, intelligible versions of the same. Ironically, this is the exact opposite of the majority of use cases. Welcome to the trough of disillusionment.
LLMs are a tool in the armory that can solve problems in new and inventive ways. They have opened doors that we didnāt even know existed. So what next for this great experiment?
We are at the beginning.
I guess Iāve got proof of concept, and my goal is to convert this into a functional, modern platform. There are challenges to overcome.
The field of dreams conundrum. Iāve built it. Will they come? Experience tells me that probably not.
I need to turn this platform into a self-aware, SEO-optimized monster.Ā Iām going to use AI and the tools I've woven together to craft and scale human consumable content and bringĀ Data Privacy analysis to the worldā¦
Iām not sure how far Iāll get, but itās turning out to be a wonderful adventureā¦
Do come along for the ride.