Published On: September 3rd, 2020Categories: Data GuideBy 3.5 min read

How bad data harms artificial intelligence and machine learning.

What is “bad data”?

“Bad data” can mean several things. Sometimes it means that the data is labeled incorrectly, is full of errors, has missing values or is otherwise poor in quality. When this data is used, the results will not be a true reflection of reality. This means predictive models will fail. The model may predict something will happen less frequently than it would if it had the full data set, and costly breakdowns of machinery might ensue.

Sometimes bad data might mean the data is biased or too complex for the algorithm to handle. Humans have unconscious biases that we apply to the world around us. These biases have a way of sneaking into how we design our data analytics systems. For example, in recent years it was widely reported that facial recognition systems are much less accurate at identifying black or Asian faces than white faces. The machine learning algorithm isn’t inherently biased towards white faces, but the data it was fed was mostly white faces. This meant that it became much more accurate at identifying these faces.

Complexity and unconscious biases also lead to “bad data”

Similarly, IBM designed a computer system called Watson[3], which was tasked with helping doctors with cancer diagnoses. The idea was that Watson would be able to look at any combination of symptoms and come up with a diagnosis and a confidence level and do all of this in a matter of seconds. However, Watson proved much less impressive in practice.

There were several issues with Watson. Firstly, many doctors around the world complained that Watson was biased towards American methods of diagnosis and treatment. Secondly, Watson had great trouble understanding handwritten medical notes from doctors, meaning it was missing a good chunk of new data in this form.

Lastly, and crucially, Watson could only think in statistics. On the surface, this might seem like a good thing, and it is for 99% of scenarios. Watson could crunch thousands of medical journals and other data sets to make confident assertions about what type of cancer a person might have and how to treat it. But it couldn’t understand when something was significant in the same way a doctor or scientist can. For example, in 2018, the FDA approved a new cancer drug that is effective against all tumors that have a certain genetic mutation. They did this based on one study of only 55 patients, solely because the results were so promising. Watson ignored the relevance of this study due to how small it was — it didn’t deem it significant.

In business, the stakes might be less severe than cancer diagnoses, but these issues with complexity and bias still exist.

How to ensure “good” data

NodeGraph makes it possible to ensure good data by tracking and tracing the origins, movements, and touchpoints of data to provide insight into its quality and uncover its value.

Read the full guide

Access the full guide and learn about the importance of high-quality data and the harm that can come from poor quality data, as they are used in AI and ML.

The guide covers

An overview of the current data landscape
The relationship between AI and data quality
How bad data harms AI and ML
How to achieve high data quality
Data quality and ROI

Fill in the form to access the full guide.

Learn more from our most popular resources.

Watch NodeGraph in action.

Sometimes you need to see it to believe it.

Watch Demo