The Importance of Data Cleansing

by Pete Mazzocchi

Make sure your data is clean and accurate before it starts causing problems in your bottom line

Organizations often don’t didn’t recognize the need for data quality until it’s apparent that company initiatives are failing due to poor data quality. Even when organizations take additional time and effort resolve various data issues, they end up losing momentum, confidence is reduced, and reputations are impaired. Effective data quality programs are becoming a near-absolute requirement.

When dealing with multiple data sources, Data cleansing software is an absolute requirement. Data cleansing should ensure that the data values are timely. For example, the organization should ensure that they have current mailing addresses or phone numbers, not ones that were valid last year but have since changed. This appears to be rather simple at first, but it can get fairly complicated when dealing with data from multiple sources, and the data needs to be combined into a single database.

When data is collected from multiple sources, and then merged into a unified data set. It must then be cleansed to correct inconsistencies and errors. Verifying and correcting data as it is entered into the system is an important step that helps to minimize errors, but it may not catch inconsistencies from multiple sources and systems.

When comparing records to determine if they contain the same data values, you must have the data values in the same format. Records such as names, dates, telephone numbers, addresses, and ID numbers can all be better compared if the data values for a given field are in a common, standardized format. Many data integration tools can convert source data from disparate files to a standard format by reformatting telephone numbers to eliminate non-numeric characters such as parentheses, decimal points, or dashes. But their ability to further cleanse the data is limited.

For example, they can’t check whether the area code for a home phone is consistent with the ZIP Code for the home address. They also cannot parse addresses into standardized values (e.g., changing St to Street) or convert a free-form entry into individual fields (e.g., separating an unstructured name field into last name, first name, middle name, and title entries). Most data integration tools do not have sophisticated matching algorithms, so they would not recognize that “B. Jones at 1234A Cherry Lane” is probably the same person as “Jones, Barry at 1234 Cherry  Ln.”  And only a sophisticated data cleansing tool can recognize that in New York City, Avenue of the Americas is another name for 6th Avenue.

Data cleansing involves:

  • Converting data fields to a standard format
  • Filling in missing values
  • Identifying and correcting errors
  • Parsing data to convert unstructured fields into individual components
  • Matching records and eliminating duplicates

Data cleansing is an integral and important part of any data quality regime, but especially so when dealing with disparate sources.