AI observability and data as a cybersecurity weakness | by Jeremie Harris | Sep, 2022

By Jessie Hobb On Sep 28, 2022

Dave Hirko on the importance of monitoring and understanding your data

Editor’s note: The TDS Podcast is hosted by Jeremie Harris, who is the co-founder of Gladstone AI. Every week, Jeremie chats with researchers and business leaders at the forefront of the field to unpack the most pressing questions around data science, machine learning, and AI.

Imagine you’re a big hedge fund, and you want to go out and buy yourself some data. Data is really valuable for you — it’s literally going to shape your investment decisions and determine your outcomes.

But the moment you receive your data, a cold chill runs down your spine: how do you know your data supplier gave you the data they said they would? From your perspective, you’re staring down 100,000 rows in a spreadsheet, with no way to tell if half of them were made up — or maybe more for that matter.

This might seem like an obvious problem in hindsight, but it’s one most of us haven’t even thought of. We tend to assume that data is data, and that 100,000 rows in a spreadsheet is 100,000 legitimate samples.

The challenge of making sure you’re dealing with high-quality data, or at least that you have the data you think you do, is called data observability, and it’s surprisingly difficult to solve for at scale. In fact, there are now entire companies that specialize in exactly that — one of which is Zectonal, whose co-founder Dave Hirko will be joining us for today’s episode of the podcast.

Dave has spent his career understanding how to evaluate and monitor data at massive scale. He did that first at AWS in the early days of cloud computing, and now through Zectonal, where he’s working on strategies that allow companies to detect issues with their data — whether they’re caused by intentional data poisoning, or unintentional data quality problems. Dave joined me to talk about data observability, data as a new vector for cyberattacks, and the future of enterprise data management on this episode of the TDS podcast.

Here were some of my favourite take-homes from the conversation:

Data suppliers can’t always be trusted to deliver the data they promised, when they promised. Dave has found that in some cases, data suppliers will duplicate, or even outright fabricate samples to add to a body of legitimate data, in order to fool their customers into thinking they’re receiving the full volume of data promised. By and large, companies lack a systemic way of assessing when and if that’s happening.
Just as important as data quantity and quality is the timing of data arrival. Old data can be outdated, and if there’s an important delay between the expected age of supplier data and its actual age, you can run into out-of-distribution sampling issues that can make models behave poorly. Seen through that lens, data observability can become an AI safety issue.
Synthetic data is a topic we’ve discussed before. It allows us to generate new samples that may actually be more information-rich than “regular” data, by leveraging large models (think: GPT-3) that have a lot of world knowledge embedded within them, which they can use to generate samples that account for context that wasn’t captured by the original data. But synthetic data can also be used by data suppliers to puff up a much smaller, authentic dataset to deceive customers into believing that they’re getting more original data than they are. Dave sees this becoming a major issue in the future, as synthetic data generation continues to improve.
Dave discussed an experiment Zectonal ran recently, which showed how data could be used as a vector to execute a new kind of cyberattack. Many cyberattacks involve sending commands to vulnerable APIs, which cause those APIs to fail in ways that create openings that can be exploited by hackers. As a result, an entire ecosystem of cyberdefences has evolved to prevent these sort of frontal attacks from succeeding, or at least, to mitigate their impact. But Zectonal’s study showed how malicious payloads can be inserted into data, which doesn’t receive nearly as much cybersecurity scrutiny. Dave hypothesizes that data will become an important channel for cyberattackers in the future, particularly given how little attention has been paid to securing data historically.

Chapters:

0:00 Intro
3:00 What is data observability?
10:45 “Funny business” with data providers
12:50 Data supply chains
16:50 Various cybersecurity implications
20:30 Deep data inspection
27:20 Observed direction of change
34:00 Steps the average person can take
41:15 Challenges with GDPR transitions
48:45 Wrap-up

Dave Hirko on the importance of monitoring and understanding your data

APPLE | GOOGLE | SPOTIFY | OTHERS

Here were some of my favourite take-homes from the conversation:

Data suppliers can’t always be trusted to deliver the data they promised, when they promised. Dave has found that in some cases, data suppliers will duplicate, or even outright fabricate samples to add to a body of legitimate data, in order to fool their customers into thinking they’re receiving the full volume of data promised. By and large, companies lack a systemic way of assessing when and if that’s happening.
Just as important as data quantity and quality is the timing of data arrival. Old data can be outdated, and if there’s an important delay between the expected age of supplier data and its actual age, you can run into out-of-distribution sampling issues that can make models behave poorly. Seen through that lens, data observability can become an AI safety issue.
Synthetic data is a topic we’ve discussed before. It allows us to generate new samples that may actually be more information-rich than “regular” data, by leveraging large models (think: GPT-3) that have a lot of world knowledge embedded within them, which they can use to generate samples that account for context that wasn’t captured by the original data. But synthetic data can also be used by data suppliers to puff up a much smaller, authentic dataset to deceive customers into believing that they’re getting more original data than they are. Dave sees this becoming a major issue in the future, as synthetic data generation continues to improve.
Dave discussed an experiment Zectonal ran recently, which showed how data could be used as a vector to execute a new kind of cyberattack. Many cyberattacks involve sending commands to vulnerable APIs, which cause those APIs to fail in ways that create openings that can be exploited by hackers. As a result, an entire ecosystem of cyberdefences has evolved to prevent these sort of frontal attacks from succeeding, or at least, to mitigate their impact. But Zectonal’s study showed how malicious payloads can be inserted into data, which doesn’t receive nearly as much cybersecurity scrutiny. Dave hypothesizes that data will become an important channel for cyberattackers in the future, particularly given how little attention has been paid to securing data historically.

Chapters:

0:00 Intro
3:00 What is data observability?
10:45 “Funny business” with data providers
12:50 Data supply chains
16:50 Various cybersecurity implications
20:30 Deep data inspection
27:20 Observed direction of change
34:00 Steps the average person can take
41:15 Challenges with GDPR transitions
48:45 Wrap-up

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.