Using Machine Learning Technologies to Identify Confidential Documents
Data confidentiality has always been an important issue, one of even greater relevance today. A series of serious data breaches have shown cyber security professionals that protecting personal information needs special attention.
To understand the seriousness of the situation, remember the Marriott International hack incident, where 25 million passport numbers were stolen. Such a scale cannot but be alarming.
That is why developers are constantly working to mitigate storage vulnerabilities. At the moment, one of the most promising solutions is artificial intelligence. In this article, I will talk about the capabilities of AI for this area and how it works.
Artificial Intelligence Capabilities
An artificial intelligence (AI) system can independently identify whether documents are confidential or not, and then do the following with them:
- Prevent sending confidential documents to external email addresses.
- Encrypt confidential documents before sending them.
- Restrict access yourself or recommend restricting viewing, copying, forwarding a document, etc.
- Record the fact of sending a document with a set of parameters in the audit log.
For example, the recipient can only open the message for reading, without the possibility of copying the data or further forwarding. At the end of the check, a message is sent to the information security service about the discovery of the corresponding document, and about the introduced access restrictions, if they have been changed.
How To Train an AI Assistant
The assistant will be able to understand which documents are confidential and which are not. To begin with, an empty model is taken that needs to be configured. Step-by-step instruction:
- Collecting data to train the model. As a first step, the assistant collects an array of data, where confidential and non-confidential documents are in the same ratio, to enable the model to make a choice on its own. The more the documents—tens of thousands would be recommended — the better the model will work. This is not an easy task, but it is a must.
- Data preparation. The assistant collects text from all data and sorts it (if required). Then it unloads everything that has turned out into a vector.
- Model development. Using the Python programming language and the ML.Net framework or other tools, the assistant creates a model for subsequent data loading.
- Model learning. All data from the first two steps are loaded into the model, after which it is trained using the example of confidential and non-confidential data.
- Deployment. A trained model is programmed into the module, which will receive data, process and analyze it. Then testing is done by this module.
- Retraining. To maintain the effectiveness and relevance of the model, it should be sent for refresher training. A group of company experts creates an augmented register with new confidential and non-confidential documents, which is then used to retrain the model.
In practice, one of the possible architecture options consists of a model located on a separate server and a module that receives, retrieves, and transmits text to it. The latter can be implemented through binding to the Outlook mail client or other resource.
If the model receives a confidential document, the RMS Client can be used to change the access settings. On a separate database, new ones are accumulated, which are subsequently retrained.
Please note that some of the quality tools that you can use to work with documents may have regional restrictions. To get around them, you should consider using a VPN that can help you gain access and ensure a secure connection.
Specific Data Privacy Concerns
A dataset from the point of view of confidentiality is represented by n rows of human data and m columns with attributes associated with it (gender, religion, state, etc.). The first type is personal information (PII) characterizing one person, the second is a quasi identifier (QI), which determines a person's belonging to one general category, for example, gender.
In machine learning, to protect large amounts of data, various types of anonymization are used, the main ones being k-anonymity, l-diversity, and t-proximity. With their help, certain data are deleted or changed. For example, in the QI column, an age range is inserted instead of age. Some columns can be deleted entirely.
Large companies using AI provide it with too much unstructured data for training. This leads to three problems:
- The likelihood of recognizing anonymized confidential information increases.
- The model can include all the collected data due to the addition of dimension to the training set.
- It becomes more difficult to make de-identification and data protection in tiered architectures.
That is, AI can recognize and recreate confidential information, even with a minimum of personal data. At the same time, it is not difficult to extract the results of the model analysis, because they are stored locally on the server. This opportunity only discredits confidential information.
Current Trends in Privacy-Preserving Machine Learning
Federated Machine Learning (FL)
In FL, there is no access to the original data, it uses only decentralized data of end-use devices. However, this approach is most vulnerable to attacks by cybercriminals, for example, GAN.
Differential Privacy (DP)
DP guarantees the security of confidential data. The system allows you to find out only information about a person without revealing their identity. The protection is based on computations with the generation of distorted responses by an intricate algorithm. Formally, the query result does not differ from the original one, but it is impossible to understand from it whether a particular person possesses a particular quality.
DP is only suitable for certain data types with low sensitivity to Gaussian distributions. If the data contains information about income, more sophisticated computation algorithms are required, which increases the risk of inference attacks with subsequent information retrieval.
Should You Use Machine Learning?
Despite the great potential of artificial intelligence in the struggle to maintain confidentiality, it also creates new problems that threaten the security of data storage. Relatively new ways of machine learning are not yet able to solve the problem, so the Internet world, more than ever, needs new opportunities for its implementation.
Nevertheless, at the moment, AI is one of the most effective ways to protect confidential information, so it can and should be implemented in all areas. The main advantage of machine learning is encryption, which in the hands of an experienced specialist can defend against hacker attacks.