Machine learning principles. Secure development

07 August 2024

Content

Secure Your Supply Chain
Secure Your Development Infrastructure
Manage the Full Life Cycle of Models and Datasets
Choose a Model that Maximizes Security and Performance

We continue to analyze the article from the National Cyber Security Centre on principles that help make informed decisions about the design, development, deployment, and operation of machine learning (ML) systems. Let’s move on to the second section, "Secure Development".

Secure Your Supply Chain

The ML model depends on the quality of the data it is trained on. The process of collecting, cleaning, and labeling data is costly. Often, people use third-party datasets, which carry certain risks. Data may be poorly labeled or intentionally corrupted by malicious actors (this method is known as "data poisoning"). Training a model on such corrupted data can lead to a deterioration in its performance with serious consequences.

What can help implement this principle?

Verify any third-party inputs are from sources you trust

When acquiring assets (models, data, software components), it is important to consider supply chain security. Be aware of your suppliers' security levels and inform them of your security requirements. Understanding external dependencies will simplify the process of identifying vulnerabilities and risks.

Use untrusted data only as a last resort

If you must use "untrusted" data, keep in mind that methods for detecting poisoned instances in datasets have mostly been tested in academic settings and should only be applied as a last resort. It is important to carefully assess their applicability and impact on the system and development process.

Consider using synthetic data or limited data

There are several techniques for training ML systems on limited data instead of using untrusted sources. However, these methods come with their own challenges:

Generative adversarial networks (GANs) can enhance and recreate statistical distributions of the original data.
Game engines can introduce specific characteristics.
Data augmentation is limited to manipulating only the original dataset.

Reduce the risk of insider attacks on datasets from intentional mislabeling

The level of scrutiny for labelers should correspond to the seriousness of the consequences that incorrect labeling may have. Industry security and personnel vetting recommendations can assist with this. If labeler vetting is not possible, use processes that limit the impact of a single labeler, such as segmenting the dataset so that one labeler does not have access to the entire set.

Secure Your Development Infrastructure

The security of the infrastructure is especially important in ML, as a compromise at this stage can affect the entire lifecycle of the system. It is essential to keep software and operating systems up to date, restrict access only to those who have permission, and maintain records and monitoring of access and changes to components.

What can help implement this principle?

Follow Cyber Security Best Practices and Recognized IT Security Standards

Adhering to cyber security best practices is not limited to ML development. Key measures include protecting data at rest and in transit, updating software, implementing multi-factor authentication, maintaining security logs, and using dual control. Pay attention to the globally recognized standard ISO/IEC 27001, which offers a comprehensive approach to managing cyber risks.

Use Secure Software Development Practices

Secure software development practices help protect the development lifecycle and enhance the security of your machine learning models and systems. It is important to track common vulnerabilities associated with the software and code libraries being used.

Be Aware of Legal and Regulatory Requirements

Ensure that decision-makers are aware of the composition of your data and the legal regulations regarding its collection. It is also important for your developers to understand the consequences of data breaches, their responsibilities when handling information, and the necessity of creating secure software.

Manage the Full Life Cycle of Models and Datasets

Data is the foundation for developing ML models and influences their behavior. Changes to the dataset can undermine the integrity of the system. However, during the model development process, both data and models often change, making it difficult to identify malicious alterations.

Therefore, it is important for system owners to implement a monitoring system that records changes to assets and their metadata throughout the entire lifecycle. Good documentation and monitoring help effectively respond to instances of data or model compromise.

What can help implement this principle?

Use version-control tools to track and control changes to your software, dataset, and resulting model.

Use a standard solution for documenting and tracking your models and data.

Track dataset metadata in a format that is readable by humans and can be processed/parsed by a computer.

To track datasets, you can use a data catalog or integrate them into larger solutions such as a data warehouse. The necessary metadata composition depends on the specific application and may include: a description of data collection, sensitivity level, key metrics, dataset creator, intended use and restrictions, retention time, destruction methods, and aggregated statistics.

Track model metadata in a format that is readable by humans and can be processed/parsed by a computer.

For security purposes, it is useful to track the following metadata:

The dataset on which the model was trained.
The creator of the model and contact information.
The intended use case and limitations of the model.
Secure hashes or digital signatures of the trained models.
Retention time for the dataset.
Recommended method for disposing of the model.

Storing model cards in an indexed format facilitates the search and comparison of models for developers.

Ensure each dataset and model has an owner.

Development teams should include roles responsible for managing risks and digital assets. Each created artifact should have an owner who ensures secure management of its lifecycle. Ideally, this should be a person involved in creating the asset, with contact information recorded in the metadata while considering personal information security.

Choose a Model that Maximizes Security and Performance

When selecting a model, it is important to consider both security and performance. An inappropriate model can lead to decreased performance and vulnerabilities. The feasibility of using external pre-trained models should be assessed. At first glance, their use seems appealing due to reduced technical and economic costs. However, they come with their own risks and can make your system vulnerable.

What can help implement this principle?

Consider a Range of Model Types on Your Data

Evaluate the performance of various architectures, starting with classical ML and interpretable methods before moving on to modern deep learning models. The choice of model should be based on the task requirements, without bias towards new algorithms.

Consider Supply Chain Security When Choosing Whether to Develop In-House or Use External Components

Use pre-trained models only from trusted sources and apply vulnerability scanning tools to check for potential threats.

Review the Size and Quality of Your Dataset Against Your Requirements

Key metrics for assessing dataset quality include completeness, relevance, consistency, integrity, and class balance. If there is insufficient data to train the model, additional data can be collected, transfer learning can be utilized, existing data can be augmented, or synthetic data can be generated. It is also beneficial to choose an algorithm that works well with limited data.

Consider Pruning and Simplifying Your Model During the Development Process

Simple models can be quite effective, but there are methods for reducing the size of complex neural network models. Most of these fall under the category of "pruning" — removing unnecessary neurons or weights after training. However, this can affect performance and is primarily studied in academic settings.

Halfway through, we have two sections left to cover: secure deployment and secure operation. If you haven't read the article on secure design yet, we recommend doing so as soon as possible!

Read about secure design