Last active
August 29, 2015 14:13
-
-
Save paulghaddad/64f49e82ac27cd840879 to your computer and use it in GitHub Desktop.
LevelUp 6: Understands data quality, and how it has bitten us in the past
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Define data quality. How do we know that we are capturing and delivering the right data? | |
Data quality is the degree of the reliability of the data we bring into our system and whether the data is complete (we aren't missing data). | |
Data of high quality is defined with the following traits, although they are not inclusive: | |
- Free from incomplete or corrupted records. This can lead to situations where expected information is not present in the record or can lead to attempts to divide by zero. | |
- Free from duplicate records. | |
- Free from typographical errors in text and numbers (such as too few or too many zeros) | |
- Has the correct context, such as the correct units. You don't want some records in standard units and others in metric. | |
We know that we are capturing and delivering the right data when we take steps to ensure the reliability of the data in our database: | |
- Eliminating duplicate data (reasons include slow performance; maintenance issues - it takes multiple reads and writes to handle duplicate data; and potential inconsistencies and anomalies). We can do this by removing items that occur repeatedly in a table and move it into its own table. For example, if customer information is placed in a customer purchases table, we can extract the customer data and put it in its own table. And then make a reference from the customer purchases table to the customer table. Now, if the customer information changes, it only needs to be modified in one place, reducing the possibility of bad records. | |
- Eliminating inconsistent data: We can assign unique identifiers to an entity, such as a customer, that is then referenced by other tables. This prevents the problem of simple typographical differences resulting in the database thinking the two entities are different entities. | |
- Not using multipart fields; instead break data down into its separate components. For example, instead of storing a complete address (address, city, state and zip code) in a cell of a database, we should separate each component of the address into separate attributes. It is much easier to work with. | |
- Preventing data conflicts. For example, if a product table has a price attribute, and each product is referenced in a separate purchases table, we can have a data conflict if a change in the price doesn't change the order amount in the second table. The solution to this is to not store the dependent amount, in this case the order amount, at all. Instead, let the computer calculate it on the fly when it is needed. This principle holds for a lot of data that is a sum, average, median or any other dependent quantity. | |
- Requiring complete information. Incomplete data results in incorrect or misleading information. To ensure complete information, you can specify that a field can't be null. | |
- Not collecting more data than is necessary. This puts us at risk of not being able to reasonably derive useful information from it. | |
2. Name some situations where data quality has bitten us, and the consequences of those situations. | |
- Dates in approvals exist outside of Enova's operational years: less than 2000 and greater than 3000. Consequently, inaccurate data appeared in reports. | |
- The dates in the People table seem to be too old in CNU_US. Consequently, this product inaccurate results in reports. | |
- Incomplete data: Since we can't go back and create the good data, we can't fix it. And this is a problem for our regulators as well, as they may want to go back 10 years to get a necessary information. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment