Verta | Blog

Data Science Teams and EU AI Act Advance | Verta.ai

Written by Andy Reese | June 20, 2023

Last week the European Parliament advanced the proposed EU AI Act to the stage of final negotiations within the bloc, where the language of the act will be finalized through a series of negotiations among EU bodies.

While the act is likely months away from passage, data science teams already face a looming conundrum: The act’s provisions are worryingly at odds with current data science practices in several key ways.

As a result, data science and ML practitioners at companies leveraging AI/ML should start preparing now for compliance with key provisions of the act, including those around data and data governance.

What’s in the Act

The act establishes requirements for those developing and deploying AI systems depending on the level of risk assigned to the AI system. The act would outright ban systems deemed to pose an unacceptable risk, such as social scoring, real-time remote biometric ID systems in public spaces, predictive policing, and scraping facial images from the internet or CCTV footage to create facial recognition databases.

Much of the act, however, is devoted to so-called “high risk” systems that would pose significant harm to people’s health, safety, fundamental rights or the environment. For these systems, the act lays out an extensive set of requirements, including around data and data governance.

Under the Act, companies must ensure that they apply “appropriate data governance management practices” to training, validation and testing datasets. They must also ensure that their data meet quality criteria laid out in Article 10 and that they adhere to procedural and documentation requirements, including:

  • Following relevant governance practices for data use
  • Using appropriate design choices and data collection methods
  • Analyzing datasets for gaps and shortcomings, and making assessment plans for addressing those issues
  • Reviewing datasets for potential biases
  • Using appropriate data preparation, labeling, cleaning, enrichment and aggregation procedures
  • Formulating assumptions related to the data
  • Assessing dataset availability, quantity and suitability for the intended AI application
  • Ensuring datasets are relevant, representative, error-free and complete
  • Ensuring datasets have appropriate statistical properties overall and for groups the system will be used on/for
  • Using datasets representative of specific geographical, behavioral or functional settings where the system will be used

Impact on Data Science

A recent analysis from researchers at Stanford University looking at foundation models noted that the EU AI Act would present challenges to organizations leveraging these models, specifically around reporting on copyrighted materials. 

But the act’s requirements will create broader compliance challenges for data scientists working with all kinds of AI systems because, frankly, those requirements are at odds with current data science practices in several areas:

1) Requiring Thorough Data Analysis: Many data science teams today do not spend adequate time performing exploratory data analysis (EDA) or analysis of distributions of data before they get started with modeling. Even if they do, it's typically only when they’re building the model for the first time, not while making updates to it. 

For data scientists and data engineers to meet the requirements of the act, they will need to regularly perform data distribution and bias checks on every data set they use. This will necessitate the use of automation and tooling, including tools for EDA, data monitoring tools and bias detection tools.

2) Documenting Data: Documentation in the data science space is generally lacking, and data documentation for training, test or validation data is no different. Under the act, it's not sufficient to be running tests. It's also important to systematically document the results and any subsequent analysis to demonstrate due diligence. 

(Providers of foundation models used for generative AI systems like ChatGPT would have additional requirements to make publicly available detailed summaries of any copyrighted data used for training.)

Note that, as a side benefit, if implemented right, all this documentation could help new data scientists working on a project, consumers of the model and governance teams.

3) Documenting Data for Application and Gaps: Finally, this may sound obvious, but it’s frequently overlooked that a particular distribution of data may have one set of ramifications for Application A but very different implications for Application B. 

For example, the popular census income dataset may be good for estimating income for US adults. However, it would be a poor dataset to study trends in teenage social media usage. As a result, analyzing and documenting the analysis of the datasets with respect to the application is critical.

Getting Ready for Compliance

Data science teams can start preparing for compliance with the EU AI Act now by reviewing and understanding the act’s provisions in conjunction with their colleagues from data teams, governance, risk, legal and IT. Additional steps:

  • Identify datasets currently in use (including training, validation and testing datasets) to assess the quality, characteristics, gaps and shortcomings of these datasets related to relevance, representativeness, errors, biases and statistical properties.
  • Ensure adherence to guidelines and protocols for data collection, storage, sharing and usage. 
  • Document the entire data lifecycle, and maintain clear and detailed documentation of the assumptions made regarding the data and any data-related decisions.
  • Implement processes for assessing and mitigating biases in the data, as well as mechanisms to ensure the quality of datasets used.
  • Regularly assess the availability, quantity and suitability of datasets to ensure they adequately cover the target use cases and are appropriate for the intended AI system's purpose.
  • Employ tools like a model catalog with robust governance capabilities, such as configurable checklists, documentation support, and support for collaboration among diverse stakeholders around collaboration.

Perhaps most importantly, data science teams can develop a culture of responsible data usage, continually improving data practices, and ensuring transparency and accountability throughout the ML model lifecycle. This will help ensure they are prepared when the EU AI Act’s provisions come into force.

A model catalog can allow you to centralize, organize, document and manage all your models as you prepare for compliance with the EU AI Act and other current and pending AI laws and regulations. Learn more.