Showcasing Dataset Versioning at Vectice

At Vectice, we empower teams developing and validating machine learning models to efficiently create documentation that ensures regulatory compliance, among other critical objectives.

A key innovation in this process is dataset versioning, a feature essential for crafting comprehensive Model Development Documents (MDDs).

In this blog post, we will explore the concept of dataset versioning, how Vectice implements it, and its significant impact on traceability, regulatory compliance, and streamlined modeling workflows.

What Is Dataset Versioning?

Dataset versioning is the process of systematically managing changes to datasets, ensuring every iteration is tracked and uniquely identifiable. This concept is integral to maintaining data quality and consistency, particularly in the context of model development and regulatory frameworks such as SR 11-7, SS 1/23, and EU AI ACT.

Why Dataset Versioning Matters

Dataset versioning is critical for:

  • Ensuring data integrity and reproducibility in machine learning models.
  • Facilitating collaboration across teams.
  • Meeting compliance requirements for traceability and auditability.
  • Supporting the creation of robust Model Development Documents (MDDs).

How Vectice Implements Dataset Versioning

At Vectice, we have built a robust system for dataset versioning that serves multiple purposes:

Building Lineage

Our versioning system tracks changes to datasets during the model development process, enabling the creation of dataset lineage across model development tools. This lineage captures transformations made through feature engineering and identifies train, test, and validation datasets used in model training operations. It is seamlessly integrated into MDDs, providing a comprehensive historical record of dataset usage. This can be particularly helpful for validation teams when they want to reproduce results from data science teams and ensure that the documented dataset version matches the submitted model. Unfortunately, out-of-date or inaccurate documentation remains a frequent occurrence—making this lineage capability even more critical.

Ensuring Uniqueness of Objects

Our platform guarantees the uniqueness of dataset versions in the database, preventing duplication. This unified approach ensures that every dataset referenced in an MDD is precise and consistent, enhancing traceability and compliance with regulatory standards. This is particularly helpful for model validation teams when they want to reproduce results made by others.

Real-World Applications

Consider a straightforward example where you're using a local file to train your model, declaring it is as simple as it gets:

Whenever you log an iteration, the versioning system automatically triggers our algorithm, which generates a new dataset version based on detailed, fine-grained information:

When creating a new dataset, our platform will reuse the existing object if it is based on the same data. However, if a different file is used, a new object will be created:

Experience the Power of Dataset Versioning

Ready to streamline your model development process and ensure regulatory compliance?

Start your own trial or watch a demo

Data security and privacy

At Vectice, we prioritize data security and privacy, making them central to our development process and, by extension, to our versioning algorithm. Unlike other tools, we do not require access to your data to compute versioning; we only process metadata. For those who prefer even more control, we offer the ability to disable data collection entirely, ensuring your privacy is always respected.

How it works

To simplify the process, we generate a unique digest to identify datasets and ensure their uniqueness. This approach enables us to create new versions only when necessary and maintain a robust data lineage.

The challenge lies in generating this digest. Some tools, such as S3, provide built-in digest generation, facilitating seamless integration and simplifying the workflow.

However, other tools lack this capability. To address this, we have identified a set of metadata properties for datasets that allow us to reliably generate a unique digest, ensuring consistent and accurate identification.


Simplified versioning algorithm


Challenges and Lessons Learned

One of the biggest challenges was ensuring support for the vast number of data platforms. To name a few, there are Snowflake, BigQuery, S3, and others, each with its own nuances and requirements. Additionally, these platforms can either store data as files or as databases, creating a multifaceted ecosystem that needs to be addressed. This complexity is further compounded when integrating with proprietary systems or hybrid setups used by organizations. 

To tackle these challenges, we adopted an open architecture designed for flexibility and scalability, making it easier to integrate diverse providers. This approach not only ensures smooth integration with major platforms but also allows users to incorporate their own custom data platform effortlessly. By focusing on adaptability, we created a platform that supports both current needs and future expansions, ensuring our platform remains relevant and robust as the data landscape continues to evolve.

It enables the use of multiple data platform or the ability to switch between them, all while maintaining seamless compatibility with Vectice at minimal cost.

Conclusion

Dataset versioning is a foundational component of our approach at Vectice. It ensures data integrity, prevents inefficiencies, and empowers our users to document their modeling workflows with precision and confidence. With its critical role in supporting traceability and regulatory compliance, dataset versioning exemplifies our commitment to enabling robust and compliant model development practices.

Back to Blog
Login
Support
Documentation
Contact Us

Showcasing Dataset Versioning at Vectice

January 21, 2025

Table of content

At Vectice, we empower teams developing and validating machine learning models to efficiently create documentation that ensures regulatory compliance, among other critical objectives.

A key innovation in this process is dataset versioning, a feature essential for crafting comprehensive Model Development Documents (MDDs).

In this blog post, we will explore the concept of dataset versioning, how Vectice implements it, and its significant impact on traceability, regulatory compliance, and streamlined modeling workflows.

What Is Dataset Versioning?

Dataset versioning is the process of systematically managing changes to datasets, ensuring every iteration is tracked and uniquely identifiable. This concept is integral to maintaining data quality and consistency, particularly in the context of model development and regulatory frameworks such as SR 11-7, SS 1/23, and EU AI ACT.

Why Dataset Versioning Matters

Dataset versioning is critical for:

  • Ensuring data integrity and reproducibility in machine learning models.
  • Facilitating collaboration across teams.
  • Meeting compliance requirements for traceability and auditability.
  • Supporting the creation of robust Model Development Documents (MDDs).

How Vectice Implements Dataset Versioning

At Vectice, we have built a robust system for dataset versioning that serves multiple purposes:

Building Lineage

Our versioning system tracks changes to datasets during the model development process, enabling the creation of dataset lineage across model development tools. This lineage captures transformations made through feature engineering and identifies train, test, and validation datasets used in model training operations. It is seamlessly integrated into MDDs, providing a comprehensive historical record of dataset usage. This can be particularly helpful for validation teams when they want to reproduce results from data science teams and ensure that the documented dataset version matches the submitted model. Unfortunately, out-of-date or inaccurate documentation remains a frequent occurrence—making this lineage capability even more critical.

Ensuring Uniqueness of Objects

Our platform guarantees the uniqueness of dataset versions in the database, preventing duplication. This unified approach ensures that every dataset referenced in an MDD is precise and consistent, enhancing traceability and compliance with regulatory standards. This is particularly helpful for model validation teams when they want to reproduce results made by others.

Real-World Applications

Consider a straightforward example where you're using a local file to train your model, declaring it is as simple as it gets:

Whenever you log an iteration, the versioning system automatically triggers our algorithm, which generates a new dataset version based on detailed, fine-grained information:

When creating a new dataset, our platform will reuse the existing object if it is based on the same data. However, if a different file is used, a new object will be created:

Experience the Power of Dataset Versioning

Ready to streamline your model development process and ensure regulatory compliance?

Start your own trial or watch a demo

Data security and privacy

At Vectice, we prioritize data security and privacy, making them central to our development process and, by extension, to our versioning algorithm. Unlike other tools, we do not require access to your data to compute versioning; we only process metadata. For those who prefer even more control, we offer the ability to disable data collection entirely, ensuring your privacy is always respected.

How it works

To simplify the process, we generate a unique digest to identify datasets and ensure their uniqueness. This approach enables us to create new versions only when necessary and maintain a robust data lineage.

The challenge lies in generating this digest. Some tools, such as S3, provide built-in digest generation, facilitating seamless integration and simplifying the workflow.

However, other tools lack this capability. To address this, we have identified a set of metadata properties for datasets that allow us to reliably generate a unique digest, ensuring consistent and accurate identification.


Simplified versioning algorithm


Challenges and Lessons Learned

One of the biggest challenges was ensuring support for the vast number of data platforms. To name a few, there are Snowflake, BigQuery, S3, and others, each with its own nuances and requirements. Additionally, these platforms can either store data as files or as databases, creating a multifaceted ecosystem that needs to be addressed. This complexity is further compounded when integrating with proprietary systems or hybrid setups used by organizations. 

To tackle these challenges, we adopted an open architecture designed for flexibility and scalability, making it easier to integrate diverse providers. This approach not only ensures smooth integration with major platforms but also allows users to incorporate their own custom data platform effortlessly. By focusing on adaptability, we created a platform that supports both current needs and future expansions, ensuring our platform remains relevant and robust as the data landscape continues to evolve.

It enables the use of multiple data platform or the ability to switch between them, all while maintaining seamless compatibility with Vectice at minimal cost.

Conclusion

Dataset versioning is a foundational component of our approach at Vectice. It ensures data integrity, prevents inefficiencies, and empowers our users to document their modeling workflows with precision and confidence. With its critical role in supporting traceability and regulatory compliance, dataset versioning exemplifies our commitment to enabling robust and compliant model development practices.

Experience the Power of Dataset Versioning

Try Vectice Now