AI Mesh Overview
Last updated
Was this helpful?
Last updated
Was this helpful?
Getvisibility offers cybersecurity AI products, specifically aimed at Data Security Posture Management (DSPM). In a broader sense, we also provide solutions for Data Governance. Our flagship product, DSPM+, is a sophisticated file classification pipeline. It seamlessly integrates with various data sources through a range of connectors including, but not limited to, Samba for network file servers, Windows file servers, Google Cloud, AWS, Dropbox, and SharePoint. The process involves downloading all the files from these sources, putting them through a pipeline that includes implementing our cutting-edge artificial intelligence technology to analyse the context of each file, and the classifying them under multiple criteria.
At the heart of this classification pipeline lies an artificial intelligence classification service designed to work on unstructured text. Once the text is extracted from files sourced through various connectors, it undergoes classification by diverse machine learning algorithms.
The classification process utilizes an AI mesh - a network composed of different AI components.
The typical mesh deployment is inhomogeneous, and contains the following types of nodes:
LLM-like miniature language models transforming text into salient document vectors, with between 10-30 million parameters;
deep neural network classifiers for sentiment analysis, with below 100,000 parameters, use the document vectors to produce classification outcomes;
bag-of-word models for topic detection;
filters based on regular expressions or fuzzy text searches;
other types of evaluators (ex. complexity of text, etc) - Python code segments;
mapping multiple input models into outputs with Bayesian inference.
Our deployments are 10 times smaller than even the smallest and most efficient Large Language Model (LLM)-based classifier deployments. This scale allows us to classify a file within 200 milliseconds, relying solely on a normal CPU without the need for specific GPU deployment. Additionally, given that our models are 10,000 times smaller than typical large AI deployments, we are not subject to regulations that apply to large AI deployments, such as the EU AI Act.
This network typically generates a multitude of classification outcomes, or signals. Each classification decision is generally binary—true or false—indicating whether the text viewed by the AI mesh is related to a specific signal. Furthermore, each outcome is accompanied by a confidence value, which is a number between zero and one. In rare instances, constituting less than 5% of the cases, the mesh outputs a categorical signal. Unlike the binary true/false, it classifies the text into one of three, four, or possibly even five mutually exclusive categories.
The Service Level Agreement (SLA) for the accuracy of the ML components used in the AI mesh stipulates no less than 80% accuracy on a balanced dataset—comprising 50% positive examples and 50% negative examples. This accuracy rate is measured on an out-of-sample basis, meaning the data used for this accuracy assessment is not employed in training the machine learning model. This approach provides insights into the model's ability to generalize.
The confidence level associated with each classification outcome in machine learning models, whether binary or categorical, varies between zero and one and indicates the certainty of the prediction. A confidence of 0 suggests that the classifier views the prediction as no better than a random guess, implying a 50% probability of accuracy. On the other hand, a confidence of 1 indicates maximum certainty, meaning the input data closely matches the training data for the given classification. Confidence levels between 0 and 1 are linear and uniformly distributed based on the training data, with a confidence of 0.5 representing a median level of certainty.
The AI Mesh functions as a Bayesian network, where results are propagated forward. This process involves using machine learning models, such as a Continuous Bag Of Words (CBOW) model and various filters, to determine whether a file is confidential. Both outcomes (true and false) are considered with their respective probabilities, which are then propagated forward to influence the confidence score. Users utilizing this confidence score will take into account its value, leading to situations where a strong classification signal might be overshadowed by other signals if, collectively, they provide stronger evidence. In Bayesian networks, this sampling technique is known as forward sampling or ancestral sampling. The AI mesh employs a highly efficient implementation of this technique by constraining the distributions of the internal nodes to either categorical or binary distributions.
The typical token window analysed is 512 tokens, which corresponds to roughly a page of text. For larger texts, the results from multiple passes are integrated with the mesh. For shorter texts, the mesh composition can be adjusted to accommodate.
For example, in order to determine if a document is confidential, in a rudimentary setting, a machine learning model that works on document vectors is involved. This model performs sentiment analysis on the original document to understand if it sounds confidential. Additionally, a simpler model searches for words like "confidential" or words similar to "confidential" syntactically as part of topic detection. There are filters and detectors designed to pick up specific keywords, such as the word "confidential" itself, which may be stamped by another application, included as part of a watermark, or in the context of certification and compliance policies. Finally, a Bayesian network of all these models is used to infer the outcome and associated confidence level.
We list below the functional diagram of the classification pipeline around the AI Mesh.
The AI mesh features a stereotypical structure designed to facilitate easy reasoning and training for individuals involved in proofing, training, and selling the mesh. Since the mesh is a directed acyclical graph, it allows for the definition of inputs, intermediary nodes, and outputs.
The inputs or entry points take in raw information about the file, which is then analysed and produces some sort of signal. This signal is interpreted by other nodes in the mesh. Inputs include various forms of transforming the input text into document vectors or word vectors, elements collecting statistical information about the input text, or processing it for other types of statistical information collectors. Additionally, filters provide a signal indicating whether certain keywords or patterns of keywords are present in the input text.
In an effort to streamline the deployment of the AI mesh and make it more user-friendly, there is an emphasis on reducing the number of filters that are directly relevant to the AI mesh. For example, when detecting banking information, a straightforward approach might involve creating detectors for words like "bank" or "account." However, such words' relevance to declassification can vary significantly between use-cases, making it challenging to establish a universally understandable policy for managing these detectors to meet expectations.
To overcome this challenge, information is organized within the network using CBOW models. This allows for ongoing tweaking of signal sensitivity based on user feedback. The strategy also involves restricting filters to use case-specific information. For instance, to identify confidential information on a specific premises, CBOW models are deployed to detect text indicating confidentiality or secrecy. Machine learning models assess the likelihood of text containing trade secrets or intellectual property. Users are encouraged to input filters relevant to the confidential signal, using specific keywords related to their technology, such as internal product names, codewords, or internal product IDs, which would not be known externally.
Intermediary nodes function by utilizing information provided by the inputs or other intermediary nodes, yet they are not visible in the user interface (UI). This can be attributed either to the irrelevance of the information processed by these nodes to the user—such as computation of reading ease scores and document complexity, which could clutter the user's view—or to the inaccuracy of intermediary signals. Efforts are made to furnish a more accurate signal by combining various intermediary signals.
Examples of intermediary nodes comprise machine learning classifiers that employ document level vectors to determine if the text aligns with a certain type of signal, CBOW classifiers that ascertain whether a specific topic is being discussed in the document, and Bayesian mappings that integrate several signals into a conclusive output signal.
Intermediary or output mappings often exhibit a stereotypical structure where multiple input signals are consolidated to create a more robust and accurate output signal. For instance, to determine whether a file is an HR document, input signals might include a machine learning model that assesses whether the file reads as an HR document, a CBOW model that detects topics relevant to the HR sector present in the file, and several filters searching for HR-specific terminology. While there are numerous methods to combine these signals into an output signal, a standardized approach, referred to as the "standard mapping," is typically employed to ensure consistency and efficiency in the process.
The standard mapping process outputs a true or false value based on inputs from three types of true/false signals, which can either be filters or machine learning models.
Hard Signals: These are decisive signals that set the standard mapping to true whenever any one of them is true, regardless of the status of other signals. For instance, the detection of a highly specific and unique identifier like a Social Security number in certain contexts immediately indicates the presence of private identifiable information, irrespective of other detectors' output.
Soft Signals: These signals set the standard mapping to true only if one of them is true and is also supported by other true signals. This is used in cases where broad criteria need further verification. For example, detecting the word "account" may flag a text potentially as financial information. However, it requires additional corroborative evidence from other sources or models to be classified definitively as financial information.
Supporting Evidence: These signals influence the standard mapping's truth value either if all are true with high confidence, providing strong evidence that the mapping should be true, or if they are true with low confidence but a soft signal is also true. This layered approach ensures a nuanced decision-making process that accounts for evidence strength and relevance.
This structured approach to output mapping ensures accurate and reliable determinations based on the nature and strength of the input signals. This approach is outlined in pseudocode below:
Output nodes utilize information from intermediary nodes to generate signals that are directly presented in the user interface (UI). These signals encompass:
Data Attributes: Important for characterizing the data or data asset attributes, such as whether the data is financial, HR-related, etc.
Compliance Labels: These labels indicate whether the data may be subject to specific compliance regulations, including PII (Personal Identifiable Information), PHI (Protected Health Information), etc.
Classifications: Define the kind of actionable results that should be derived after classifying the file, providing a clear directive for subsequent actions.
Notably, especially in the context of compliance and classification, these output nodes can also be used to stamp information directly onto the file. This ensures that important data about compliance and classification is visibly and immediately associated with the file, facilitating easy access to this critical information through the UI.
The typical classification system categorizes the level of sensitivity of a file. This can range from a binary flag indicating whether the file is sensitive or not, to a more nuanced classification with three to five labels, such as:
Public
Internal
Confidential
Highly Confidential
Secret/Top Secret
However, it is recommended to avoid using more than four or five mutually exclusive outcomes for classifying a file. This is because having too many categories can complicate implementation on the customer's side and pose challenges in verifying the accuracy of the classifier. Simplifying the classification spectrum helps both in ease of use and ensuring a more straightforward validation of classification results.
We offer below the visualization of a large AI mesh (80 nodes) with input nodes at the bottom and output nodes at the top.
Nodes are colour coded as follows:
yellow - document and word vectors
blue - ML classifiers
green - light ML (CBOW) classifiers
red - Python / engineered signals
black - forward mappings
Notice how few input filters are entangled with knowledge collected by ML models, and how the classification output node (top) integrates information from all these nodes.
The AI mesh is designed to be multilingual, catering to the requirements of machine learning models that depend on word vectors or document factors derived from unstructured text. The strategy to achieve multilingual capability involves generating the same document or word vectors for the same text translated into multiple languages (language-agnostic representations). This approach compresses the text into sentence or document vectors, and the language model itself has a certain capability to translate between the languages it supports.
For the sake of classification speed and accuracy, the deployment is typically restricted to bilingual models, where one of the languages is English and the other could be Arabic, French, or any other language. Although the solution has been tested with up to 12 different languages, in practice, a more focused bilingual approach is preferred.
For other types of nodes within the mesh, such as filter nodes or complexity detectors, adequate adjustments are necessary to account for language-specific differences. This ensures that the AI mesh can efficiently and accurately process information across different languages, maintaining its effectiveness and utility in multilingual environments.
The design of the AI mesh carefully balances exposing a reasonable number of signals and accurately characterizing a block of text of a certain size. Limiting the number of relevant signals to no more than 100 is very important for maintaining the explainability of the mesh in relation to the analysed content. This approach ensures that users can understand how and why certain analytical outcomes were reached without being overwhelmed by too much information.
When the AI mesh produces a classification outcome, we also store to the database the prerequisites for that outcome within the mesh. This includes which models contributed, in what way, and the confidence scores that contributed to the ancestral sampling of that classification outcome with a specific confidence score. This rich signal provides substantial information about the unstructured text that the mesh processes.
These prerequisite signals are essential for explaining the classification outcome that the user observes. Explanations can be provided on a per-file basis by examining the outputs of intermediary nodes in the mesh or on a population basis by identifying which factors lead to particular decisions for specific file populations. Natural language synthesis can be employed to translate these intermediary outcomes into understandable natural language, further enhancing the explainability of the mesh's analytical processes.
The target quality for the user experience with the AI mesh aims to mirror the Service Level Agreement (SLA) for the ML classifiers, where around 80% of the predictions are expected to be perceived as accurate by the user. Adjustments to the mesh will be made if the user's perception significantly deviates from this standard. Specifically, for any given file analysed by the mesh, approximately 8 out of 10 data attributes collected should be correct or flagged with low confidence. Similarly, for any specific data attribute, about 8 out of 10 files should yield a correct prediction or a prediction marked with low confidence.
After a file is evaluated, the per-file outcomes from the classification network within the AI mesh are stored in a database, making them accessible to GQL enabled filters and reports. This approach leverages the rich signal derived from the unstructured content to generate a wide array of actionable reports. Moreover, the classification pipeline incorporates Active Directory information about who has access to the files. This integration is important for assessing the risk associated with highly confidential files being accessed by trustees, as part of the DSPM+ suite.
Characterization of data (static or in-flight) with an AI mesh of narrow models has a series of advantages compared to using Large Language Model (LLM) AI technology.
The overall compute required to run the AI mesh 100x-1000x less than that of a classification LLM with similar accuracy. Due to that, it can be successfully productized without requiring specialized hardware such as GPUs.
Owing to the way the AI mesh is constructed, tweaking it towards providing expected outcomes for different use cases entails modifying a small number of nodes, which lowers the cost of adapting the mesh to expectations.
Since the mesh relies on specialized detectors which are associated with intuitive concepts, it can be used natively to build robust explanations regarding the classification outcomes, with or without language synthesis by LLM.
The mesh uses narrow AI classifiers which are trained on synthetic datasets which are small (1-10M tokens) compared to LLM corpora (trillions of tokens). These datasets are available for review and audit, and can be used to completely characterize the behaviour of the AI system, and to ascertain its regulatory liability.
The layout of the mesh natively allows integration with any sources or 3rd party signals via its mapping mechanism.