The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning.
—Claude Shannon(1948)
Data contracts are abstraction layer agreements between data producers and data consumers.
They can help us prevent data quality issues by formalizing interactions between different systems and teams handling data.
Often application and data teams are separated by ELT/ETL infrastructure and software developers have little visibility into how the data is being consumed. For the data producers, the data platform is a black box. Converting the term data contracts is a way to raise awareness of this issue even though the implementation is almost similar to what developers have already been doing using APIs.
Data Contracts regulate:
- the structure/schema of the data (semantic nature of entities, events, attributes)
- quality
- the terms of use
As the business requirements evolve, changes can be made to both the data structure and the triggering logic.
To make use of any data, you need to know what the structure of the data is. This is typically defined in a schema. Schemas are already widely used, for example by relational database. You can use different Interface Definition Languages (IDLs) to define schemas for serialized data, including protocol buffers, Apache Thrift, and Apache Avro. There’s also JSON Schema for defining a schema over JSON documents, upon which standards such as OpenAPI and AsyncAPI are built. Alternatively, you could define your schemas in some custom or abstract format, for example in YAML or JSON or even in code.
PayPal has released under an Apache 2 license their template for a data contract in the form of a detailed YAML specification.
In the newest version (v2.1.1), PayPal’s data contract focuses on eight sections:
- demographics
- dataset & schema
- data quality
- pricing
- security & stakeholders
- roles
- Service-level agreement (SLA)
As you can read, it does not limit itself to a mere schema. More information can be found on AIDA User Group’s open-source GitHub. This can be managed in a Git repository and form the basis for further automation. However, the process for creating data contract management is not supported.
Data Mesh Manager is a tool for managing data contracts, developed from INNOQ.
Data Contract Studio is a free web tool to develop and share your data contracts online, and bringing data producers and data consumers together.
Data Contract CLI is a free CLI tool to help you create, develop, and maintain your data contracts.
Great Exceptions is an agnostic open source tool defeating pipeline debt through data testing, documentation, and profiling. You can develop your own custom data quality tool or use this one.
Great Exceptions makes it possible for data teams to quickly deploy extensible, flexible data quality testing into their data stacks. Its human-readable documentation makes the results accessible to technical and nontechnical users. An overview of the key features are listed here.
Gable.ai, a new platform, allows data producers and data consumers to work together via data contracts. It helps software and data developers prevent breaking changes to critical data workflows within their existing data infrastructure. The platform features data asset recognition by connecting data sources; data contract creation to establish data asset owners and set meaningful constraints; and data contract enforcement via continuous integration/continuous deployment within GitHub.
Keep in mind that the adoption of data contracts will vary depending on the industry, organization size, and specific use cases. Staying informed about these developments and adapting data contract strategies accordingly will be essential for businesses to remain agile and competitive in the evolving data landscape.
Any feedback is welcome. Don't be shy 😉.