3 Immutable Rules for Successful Data Sharing | by Louise de Leyritz | Jan, 2023

By Jessie Hobb On Jan 21, 2023

Unlocking the Power of Data Collaboration

3 rules for successful data sharing — Image courtesy of Castor

In my previous article, I discussed the topic of data sharing which is already a well-established concept. Data sharing refers to the practice of opening data access to all departments, to empower each department to make data-driven decisions.

It is still too common for companies to jump into data-sharing initiatives without a proper plan, believing that simply granting increased access rights to business departments will be sufficient. This approach is misguided. In reality, data sharing is a complex undertaking that requires thoughtful planning and execution in order to be successful.

We propose three immutable guidelines to ensure the success of your data-sharing initiative

Thou shalt not compromise on data quality
Thou shalt enrich the data with bountiful context
Thou shalt provide the right interface for exploring the data

the first rule about data quality is the backbone of data sharing — it’s a non-negotiable prerequisite. Data quality is the responsibility of data producers (software and data engineering teams). It is about putting good quality data in the hands of the data team. Without qualitative data, the data team cannot do its job and even less engage in data sharing with other departments. In fact, if the data team can’t use the data, why even bother sharing it with others?

The second and third rules in this article are focused on ensuring that high-quality data is effectively shared with the business teams. This involves not only providing accurate and reliable data, but also enriching it with relevant context and making it easily accessible through user-friendly interfaces. By doing so, even teams that are less technically proficient can easily make use of the data. You can find a visual representation below.

Three rule s for effective data sharing — Image courtesy of Castor

Disregarding any of these rules will inevitably lead to failure, which we’d ideally like to avoid. Let’s delve deeper into each.

The foundation of successful data sharing is to maintain the quality of the data you share with business units.

Data sharing is about equipping business units with the ability to make data-driven decisions. For this to happen, you must provide them with top-notch data.

When you share flawed data, people obviously make poor decisions. This can lead to significant financial losses, missed opportunities, and damage to your company’s reputation. More importantly, this can erode trust in your data and lead to general disinterest in data. If the plan is not to share first-rate data, then don’t share data at all. Data sharing is an all-in or all-out effort. If not executed properly, it can be detrimental to your organization.

Data quality is the umbrella term encompassing all the factors influencing whether data can be relied upon for its intended use. There are several characteristics that define high-quality data, including but not limited to:

Accuracy: The degree to which data correctly describes the real-world phenomenon it represents.
Completeness: The data is complete and contains all the necessary information.
Consistency: The data is consistent across different sources and platforms.
Reliability: The data is up-to-date and relevant to the intended use case.
Usability: The ease with which data can be understood and used by intended audiences to make informed decisions.

You can find more data quality metrics in Kevin Hu’s article about the topic.

Data quality attributes and their associated metrics — Image courtesy of Castor

When you share data with these attributes, you’re increasing the odds of improved decision-making and efficiency. But that’s not all there is to data quality.

A good way of making sure your data meets the right quality standards is to implement data contracts.

Data contracts are an important component of any data democratization initiative. The data community has a love-and-hate relationship with data contracts. But we think they are worth mentioning in a data-sharing conversation.

Data contracts are agreements between data producers and data consumers that outline the specific terms and conditions for sharing and using data. They can play an important role in ensuring data quality by setting clear expectations and guidelines for how the data should be handled.

A data contract specifies that the data must respect some format, constraints, and semantic meanings before it is shared, or it might include clauses that require data to be regularly audited for quality.

Data contracts might include information such as:

What data is being collected
How often and how the data is being ingested
Who owns and is responsible for the data (individual or team)
Who has access to the data and at what level
Security and governance measures, such as anonymization

For example, let’s consider the Machine learning model that powers Ubereats. The model’s performance depends on the accuracy of its training data, which is sourced from various tables within the company.

To ensure the model functions correctly, we expect the integrity of the data to be maintained at all times; this means the columns should never be removed, the values of each field should remain consistent, and all critical business logic should be upheld. If any of these conditions are not met, the model’s performance may be compromised.

To ensure that these expectations are met, they should be outlined in a data contract to hold data producers accountable for maintaining the integrity of the data.

Overall, data contracts can provide a framework for ensuring data quality by setting clear guidelines and expectations for how data should be handled and maintained. can help ensure that all parties involved are held accountable for maintaining the quality of the data. This way, data contracts can prevent flawed data from landing in the hands of operational teams.

Maintaining a high level of data quality is important, but it alone is not sufficient. The next step is to ensure that context is also provided.

Context is the second key to effectively implementing data sharing. Data without context is dangerous and worthless because it is left open to interpretation by various teams.

Let me tell you, this is not a safe bet. Different interpretations mean different conclusions, and ultimately mean incoherent reporting across departments. If you’re going to lead business teams in uncharted territory, give them a map. Context is the map.

People understand a dataset when they are aware of the needs this data will satisfy, its content, and its location. Once people find the relevant dataset, they did 10% of the job. They then need to go through a checklist of 10+ questions to make sure they understand what data they’re using. People understand the data only when they can answer the following questions:

Where does the data come from?
Where does it flow and which tables does it feed downstream?
Who owns it / who is responsible for it?
What is the meaning of a given field in my domain?
Why does it matter?
When was the last time this table was updated?
What are the upstream and downstream dependencies of this data?
Is this production-quality data?

Context starts with documentation. All the shared data assets need to be documented for stakeholders to understand them. In practice, this means curating your data assets with column definitions, tags, owners, etc. When you document your data properly, people know where to find it and how to use it without having to reach out to someone else in the company.

The second aspect of providing context is to have a robust data lineage capability. Data lineage is an extremely powerful transparency device. It enables people to understand how data assets are related. If something breaks upstream, data lineage allows everyone to understand what the consequences will be downstream, avoiding unpleasant surprises. Lineage can also assist stakeholders in identifying the source of data problems when they arise.

Data lineage: tracing the relationship between data assets — image courtesy of Castor

The third aspect of providing context is facilitating social discovery among stakeholders. This can be achieved by sharing information about how the data is being utilized.

When people can see how their peers are using and querying the data, they are able to start with a stronger foundation and can learn from the insights and strategies of their colleagues. Social discovery allows teams to build on one another’s knowledge and thus work more efficiently.

For instance, a marketing analyst who wants to perform an analysis on Marketing Qualified Leads (MQLs) can leverage Social Discovery to streamline the process. With social discovery, the analyst can quickly identify the most relevant tables and data sets being utilized by the rest of the marketing team. Additionally, he can access the queries that have been performed by the team, which can serve as a starting point for his analysis. This not only saves time but also allows the analyst to gain insights and learn from the work of his colleagues.

If you’re going to share data with anyone, you have to do it through the right interface. Not all team members have the same level of technical expertise and not all teams have the same data needs. It is essential to provide the right interface for the right team in order to make data accessible to all.

If you are documenting your data in dbt, you cannot expect the marketing team to fetch the documentation there. Context should be made available in tools that are user-friendly for business teams. There are two ways to go about this:

One way to achieve this is by offering a tool that enables efficient search and navigation. The tool should be easy to use and understand, to ensure that non-technical team members are able to use it effectively. A data catalog is an example of such a tool that can be used to discover, understand and access data easily.
‍

Another approach to providing the right interface is by making data easily accessible within the tools that business teams already use. This approach involves delivering the data to the tools that are already familiar to the teams. Reverse ETL tools can be used for this purpose.

By making the data findable within existing tools, teams can access the data they need without having to navigate new systems or learn new software. For example, once Lead Scoring has been calculated on top of the data warehouse, Reverse ETL allows for syncing this metric into salesforce. This allows sales to access it directly within the tools they are familiar with.

No matter your approach, keep in mind that if you want to make your data available to all, you must cater to the business team’s needs. Demanding they learn the technical team’s tools and processes will only hinder your efforts.

Providing the right interface is crucial for democratizing data and making it accessible to all team members. It is important to consider the technical expertise and data needs of different teams when deciding on the right interface. By providing an easy-to-use tool or shipping data to existing tools, teams can access the data they need to make informed decisions and drive results.

In conclusion, data sharing is a powerful tool for driving data-driven decisions and fostering collaboration across departments.

But is a complex undertaking that requires thoughtful planning and execution in order to be successful.

We propose three immutable rules to ensure the success of your data-sharing initiative: 1) Maintaining data quality, 2) Providing rich context around the data, and 3) Providing the right interface for exploring the data.

Of course, data sharing involves privacy and security concerns that I have failed to mention in this article. My next article will be entirely dedicated to this topic!

We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.

At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.

Want to check it out? Reach out to us and we will show you a demo.

‍