paint-brush
Can You Use OpenAI's ChatGPT Without Leaking Your Business's IP?by@artyfishle
963 reads
963 reads

Can You Use OpenAI's ChatGPT Without Leaking Your Business's IP?

by Arty FishleJuly 19th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

ChatGPT and OpenAI’s Completion APIs are used by developers to create applications and use state of the art language models. If not used properly, these tools could inadvertently expose your company's intellectual property (IP) in future generative AI models. We’ll talk about the potential risks of using ChatGPT with internal company data and how you can reduce the risk for your company.
featured image - Can You Use OpenAI's ChatGPT Without Leaking Your Business's IP?
Arty Fishle HackerNoon profile picture

In the era of AI, tools like ChatGPT have become a go-to solution for many organizations, bringing improved efficiency and productivity. The data doesn’t lie: odds are, you or your employees are using ChatGPT to draft emails, generate content, perform data analysis, and even assist in coding.


However, if not used properly, these tools could inadvertently expose your company’s intellectual property (IP) in future generative AI models such as GPT-3.5, GPT-4, and eventually GPT-5, meaning any ChatGPT user can access that information.


Case in point: Samsung

Samsung engineers used ChatGPT to assist with source code checking, but The Economist Korea reported three separate instances of Samsung employees unintentionally leaking sensitive information via the tool. This led to confidential source code and recorded meeting contents ending up in the public domain, usable by future iterations of ChatGPT (Source).


Sure enough, OpenAI’s ChatGPT privacy policy is very clear:


When you use our non-API consumer services ChatGPT or DALL-E, we may use the data you provide us to improve our models.


How your data is used to improve model performance


In this post, we’ll talk about the potential risks of using ChatGPT and OpenAI’s APIs with internal company data, and how you can reduce the risk for your company as much as possible. We’ll also discuss other options for your company, like training your own language model that replicates ChatGPT’s functionality or using an open source model. Both of these options offer avenues to get the productivity benefits of ChatGPT without sending data to OpenAI.

Use OpenAI’s Completion APIs

OpenAI’s Completion APIs are used by developers to create applications and use OpenAI’s state of the art language models like GPT-3 and GPT-4, the models that power ChatGPT. These APIs offer an additional level of protection out of the box. Unlike ChatGPT, your data is only viewed by a contracted moderation team and not recycled into future training of OpenAI’s models. Their APIs follow a data policy that doesn’t allow information submitted to be used for training future models (their API data usage policy states your data is only retained for 30 days for abuse and misuse monitoring. Then it’s removed.)


However, depending on the nature of your data submitted to the API, you may decide that using OpenAI’s API is still too risky. Eventually, an OpenAI employee or contractor will look at some of the data you send to the API, and if it contains sensitive, personally identifiable, or personal health information, that could mean loads of trouble.

Disable Chat History & Training

Chat History & Training button on ChatGPT's settings page

At the end of April 2023, ChatGPT released a way to manage your data, a “Chat History & training” button in the ChatGPT settings. With this feature off, any data shared on the platform is not used to train future models. Below the button, there is a note: “Unsaved chats will be deleted from our systems within 30 days”. This 30 days note is likely referring to the abuse and misuse monitoring policy. This brings the same risks as using OpenAI’s APIs as noted above.

Training your own model

Some companies might consider training their own models as an alternative, following the path Samsung reportedly embarked on after their data leakage incident. This approach might seem like a silver bullet: you’d maintain full control over your data, avoid potential IP leaks, and gain a tool tailored to your specific needs.


But let’s pause for a moment. Training your own language model is no small task. It’s resource-intensive, requiring significant expertise, computational power, and high-quality data. Even after developing a model, you’d face the continuous challenges of maintaining, improving, and adapting it to your evolving needs.


Moreover, the quality of language models largely depends on the amount and diversity of data they’re trained on. Given the vast datasets used by companies like OpenAI to train their models, it’s challenging for individual companies to match that level of sophistication and versatility. The companies that do succeed are companies like Bloomberg, which created BloombergGPT from their 40 years of financial data and documents (Source). Sometimes, the data is just not attainable for small companies trying to get a leg up.

Use open source or self-hosted models

The state of the art of open-source models is advancing rapidly. An open-source model can be downloaded and run on your machine, making it self-hostable and eliminating the need for a company like OpenAI to be involved.


Models trained by organizations like Open Assistant are producing remarkable results and are fully open source. Their community is actively collecting data to engage in the same reinforcement learning human feedback (RLHF) loop that OpenAI utilized with ChatGPT. The model’s performance is impressive, especially considering its reliance on the open source community (including my own contributions). However, Open Assistant is transparent about the limitations of their model, acknowledging that their data is biased towards a male, 26-year-old demographic. They only recommend using their model in research settings, demonstrating responsible behavior in disclosing these demographics. Kudos to Open Assistant!


Orca is a promising, unreleased open- source model trained by Microsoft. It is smaller than GPT-3, yet produces on par and sometimes better results than GPT-3. There’s a great video by AI explained on Orca if you’re interested. However, you cannot use OpenAI’s models to train your own models, as this would constitute a violation of OpenAI’s Terms of Service. Orca is explicitly trained on outputs from GPT-3.5 and GPT-4, so Microsoft claims they will be releasing this model only for “research”.


Both of these models are specifically designed for research purposes, making them unsuitable for business applications. After reviewing other open-source models as alternatives, I found that most of them are either derived from Meta’s LLAMA model (thus subject to the same “research” limitations) or too large to run efficiently.


An encouraging option is to leverage a company such as MosaicML to host your inference privately. MosaicML stands out as one of the few commercially available open-source language models. They assert that their MPT-30b model achieves comparable quality to GPT-3. While they do not provide specific benchmarks, I am inclined to trust their claim, as a friend and I began testing one of their smaller models (MPT-7b), and the initial results are promising!

MPT-7b-Chat model answering a question about the differences between nuclear fission and fusion. It provides a cogent and complete response!

Conclusion

Depending on the nature of your data and use cases, using ChatGPT or OpenAI’s API may be unsuitable for your company. If your company doesn’t have policies for what data can be sent or saved in ChatGPT, now is the time to start those conversations.


Misuse of these tools in private business settings can lead to IP leakage. The implications of such exposure are massive, ranging from loss of competitive advantage to potential legal issues.

If you are interested in further exploration of MosaicML’s models, which are among the limited options that are both open source and commercially available for large language models, please let us know! We share the same interest and are excited to further explore this topic together.


If you’re interested in a solution that offers secure, retrieval augmented generation using your own company data, we are developing a tool specifically designed to safeguard your data with SOC2 compliance, integrate with your SSO providers, enable conversation sharing within your organization, and enforce policies on data inputs. Our ultimate objective is to provide ChatGPT quality for your data without any risk of IP leakage. If you’re interested in such a tool, we encourage you to fill out our survey or visit mindfuldataai.com.


Thank you for taking the time to read this post!