A Data Quality Riot Act January 14, 2010 12 Comments

As the son of a former Marine Chief Warrant Officer (CW2) I have not only seen the occasional riot act being doled out to my brother; I have been on the receiving end of a few myself.  I have discovered that in order for your data quality initiatives to be successful you need someone who is willing to play the role of drill sergeant on occasion. There are going to be occasions where someone needs to take control when a business process or system provides the latitude to introduce bad data into the environment. This person is going to need to motivated and passionate about rooting out bad data and the processes that allow it to be introduced.

Recently I received an email from a business analyst inquiring about a discrepancy in the between the operational system and the analytical application that his team had recently distributed across the enterprise. The discrepancy was that a name change in the operational system was not being reflected in the analytical application. At the surface, this seemed fairly benign. We took a look at the table responsible for this particular entity the table below depicts what was discovered.

Id First Name Last Name Employee Id User Id
10087632 Olivia Johnson 257303 johnsono
10108465 Olivia Matthews 0257303 matthews

Since this table behaves as a Type 4 slowly changing dimension, an update of the last name and user id would have simply move the original record to an audit table and introduced a new record with the correct last name. Instead, a new record was created not only introducing a new surrogate key for the same person but the unique constraint on the employee id was bypassed by inserting a leading zero into the field.  This had to have been done by a user who was not properly trained on handling employee name changes in the operational system. A well written email can prevent this from happening again. The user could even be instructed to go back in to the system, delete the new record, and update the existing record properly. But that was not the case.

The users responsible for managing user accounts in the operational system have been instructed to create a new user account when someone submits a name change. These instructions also explicitly state to insert a leading zero for the employee id to bypass the uniqueness constraint placed on the field. What happens when an employee gets married then divorced and no longer wishes to retain her now ex-husband’s last name or marries again? Will they add a second leading zero to the employee id?

It was upon hearing the details of the business process that I could feel the increase in my blood pressure. This operational system has been live enterprise wide for less than six months and the business processes that support it are already showing signs of failure. History did not provide the learning experience that you would have hoped; it will take someone to put on their drill instructor uniform and break the privates in their platoon of their bad habits. Only then can they be built back up with understanding that they are to protect the enterprises data like it was their homeland.

Themeword 2010: FOCUS December 31, 2009 No Comments

Focus

I have never been one for putting together a list of resolutions every year only to forget about them before the new year is a month old. However, there was a discussion and game of tag that emerged on Twitter among the SQL Server tribe recently about choosing a themeword instead of listing resolutions. After reading about the themewords that were chosen by Thomas Larock (@SQLRockstar), Colin Stasiuk (@BenchmarkIT), and doing a little reading about the concept over at HPC, I decided that I would jump into the ring and try it out this year.

My themeword for 2010: FOCUS

The one thing that I realized over the past year that has been missing in both my professional and personal life is focus. I need to focus more on my career development this year. The role of a Teradata DBA has been a great source of achievement and satisfaction for me over the past several years. I still enjoy the thrill I get when I hyperfocus on a performance tuning a query, writing a statistics maintenance routine, building an object catalog or developing a multivalue compression routine. But I also have the desire to learn more about data management, data quality, and data governance. While I blogged about data governance, data management and business intelligence this past year, many of the issues that I have dealt day in and day out with have centered on being a DBA and less on theses other areas.  This year I want to spend more time focusing on these areas because they are going to continue become increasingly more important in both the public and private sectors.

I have also made several great connections on Twitter. I would like to focus on strengthening theses connections over the course of the upcoming year. In doing so, I need to also make sure that I focus on the value that LinkedIn can provide through its discussions and even local meet-ups that occur on a semi-regular basis. This also includes branching out into new realms such as attending the local SQL Server PASS meetings on a regular basis.

However, just as important as focusing on career development there needs to be equal amounts of time spend focusing on developing as a father and husband. I have been very fortunate that I have been able to be an active participant in the activities that both my sons have been involved in over the past several years, from coaching Little League to bowling tournaments to roller skating and school parties. But I need to work harder as they grow older to serve as a role model and provide guidance for them as they face the transition from childhood to adolescence and adulthood. As a husband I need to focus on setting time aside for my wife and I to do things together. All too often we are so caught up in our children’s activities, day-to-day life, or my Playstation 3 gaming habits. It doesn’t need to be anything elaborate like cruises or vacations abroad (although I wouldn’t mind either!). Just something simple like catching dinner and a movie or having take-out dinner and watching a movie at home.

How to Pick Your #ThemeWord for 2010

  1. Think of a word that reflects your hopes and dreams for 2010.
  2. Share your ThemeWord with friends on Twitter, Facebook, or Your Blog.
  3. Be sure and use the hashtag #ThemeWord.

Tag Your IT!

I am going to tag a couple of the people that I have met on Twitter this year:

All Your URL are Belong to Us December 14, 2009 1 Comment

Today's Lesson

Today, Google announced the launch of goo.gl, its own URL shortening service. It is currently in a “limited” release for users of its Google products such as Feedburner and the Google Toolbar. This was done in concert with Feedburner’s new enhancement called Socialize, which will currently send your feed to the Twitter account you specify. In order for your feed to be socialized in near real-timte they suggest you leverage Feedburners “ping” feature whenever you publish a post. Overall, the Socialize feature of Feedburner is a nice enhancement and I hope that we will see additional services like Facebook, MySpace, and LinkedIn included in the future.

However, the most important feature that was launched today was the URL shortening service.  Why is this so important? Because several URL shortening services have come and gone over the past couple years. With the demise of these services comes the inability to redirect the user from the shortened URL to the target URL, also known as linkrot.  Brent Ozar wrote about his suspicions of using free web services to shorten URLs for this very reason. With Google the chance that your shortened URL will rot is virtually nil given Google’s appetite for data.

While Google has said it will protect users of its URL shortening from malware and phishing it would be nice to see a link preview feature enabled. The current version of TweetDeck (v0.32.1) simply opens your web browser with the resolved goo.gl link while for other services such as bit.ly a preview of the full URL is given in a pop-up window . This is either due to TweetDeck not understanding how to handle goo.gl’s URL preview or because goo.gl has not exposed a URL previewing feature. Google would be doing its users a disservice if a URL previewing feature is not exposed sooner than later.

Finally, how long before the current crop of URL shortening services fade away or are bought by Google. With the adoption of bit.ly as the default URL shortening service by Twitter in 2009 combined with Twitter’s explosive growth in 2009 the loss of bit.ly’s URL resolution over time could be troubling for many.

Will goo.gl become the dominating URL shortening service or do you think services such as bit.ly and ow.ly can survive?

Teradata Database Object Summary December 8, 2009 No Comments

In my previous blog post I showed how to build an Object Catalog of your Teradata data warehouse. Trouble is even on moderately sized systems your Object catalog could result in tens of thousands of rows. One of my object summaries is over thirty thousand rows. This can be overwhelming if you hand it off to management since they prefer a thirty thousand foot view.

So in order to provide a higher level overview of your Teradata system you can create a view on top of your object catalog that summarizes information at the database or user level providing information such perm space allocation/consumption, average table perm space consumption, average number of rows per table, number of tables with/without stats, and a count of each object type in the database.

The following view could be extended fairly easily to include additional metrics on such as the largest and smallest tables in both row count or perm space consumption. For my needs, I didn’t need to include those items.  The SQL can be found below:

Read the rest of this entry »

Building a Teradata Object Catalog December 7, 2009 4 Comments

Recently I was asked to provide an inventory of a production Teradata system that included a list of all user created objects and as much information about the objects that I could gather included perm space usage, row counts, and object type. Two of the three could easily be gathered using the data dictionary tables in DBC. However, the row count requirement made me pause for a moment and weigh the impact this could have. The production system that I was working on had hundreds of databases and thousands of tables.

I knew writing an SQL script that would build the SELECT queries that would perform a COUNT(*) against every table and capture that in a flat file or table on the system was trivial. However, when you have a system with thousands or tens of thousands of objects it would take quite some time to run all of those statements even if many of the tables are relatively small. It just didn’t make good sense to take that approach.

Then I realized that nearly every table had some level of statistics collected against it and maintained at least semi-regularly depending on the DBA that was responsible for the database. The trouble is that Teradata’s HELP STATISTICS command only shows you the number of unique values for each set of columns containing statistics. But then I remembered that the view I used as part of my stats maintenance routine did have a column for the number of rows. The views that I use as the basis for my stats maintenance routine were written by Dieter Noeth and is available for the general public to use courtesy of the TeradataForum. For the sake of brevity and maintaining a single source for the code I am not going to post the SQL for the views here but you can find them here in the Attachment Library over at the TeradataForum. (It should be noted that there are two different sets of views that support both the 32 bit and 64 bit versions of Teradata. Make sure you use the correct set of views for your installation.)

In order to keep things organized there are three views used to gather the information instead of creating a bunch of derived tables. The view Stats_Data is responsible for collapsing the number of fields into a comma delimited list for the first 16 columns and then extracting the pertinent binary data for the statistics collected on a given set of columns. The second view Stats_Basics_[32/64]bit extrapolates the binary data from the Stats_Data view into individual binary fields that require further refining. The final view Stats_Details composes the binary data into its final numeric and temporal values from which additional reporting and analysis can be done.

With these three views in place the Object_Catalog view could easily be constructed by simply bringing together a view additional data dictionary tables and the Stats_Details view. The Object_Catalog view displays the following attributes about the objects in each database on the system:

  • Database Name – self-explanatory
  • Object Name – self-explanatory
  • Column Count – This is a count of the number of fields for the object in TVFields.
  • Table Rows – The number of rows on the table when the stats were last collected. NULL if stats are not collected.
  • Table Size – The current perm space being consumed by the table.
  • Table Missing Stats – Indicator field for if stats exist on the table. NULL for non-table objects.
  • Table Kind – A single character field categorizing the object type. (e.g. T, V, M, etc.)
  • Table Kind Description – A description of the object type (e.g. Table, View, Macro, etc.)

The SQL for the Object_Catalog view can be found below: Read the rest of this entry »

Teradata Routine Maintenance: Perm Space History November 30, 2009 No Comments

A user over at TeradataQuestions recently asked a if there was a table in Teradata that allowed you to track the growth of tables or databases over time. The short answer to this question was no. The data dictionary tables in Teradata only provider the current information about tablesizes and the current and max sizes of databases. If you want to track this historically for chargeback or capacity planning you need to write your own maintenance routine.

The good news is this routine is fairly straightforward once you figure out what information you want to retain historically. The data dictionary objects that will be involved in this are DBC.Dbase, DBC.TVM, and DBC.DatabaseSpace. Originally, I was going to use the data dictionary view DBC.TableSize but opted to denormalize the table and retain the database level information here as well.

I am using a database called SysDBA in this example. This is where I prefer to keep all my DBA related objects that go above and beyond what Teradata provides in the Data Dictionary database (DBC), keeping my objects separate from their objects as a matter of personal preference. Below you will find the table definition that I am using. I opted to use the same data type that is found in DBC.DatabaseSpace for consistency sake, but that isn’t to say you couldn’t opt to retain the value in BIGINT or DECIMAL. Teradata is tracking the tables sizes in bytes but I have opted to also capture them in megabyte and gigabyte equivalents. I have also decided to retain both the current and max database size on this table as well by leveraging the ALL table. I have excluded PeakPerm from this, but it could easily be included if you feel there is value knowing what the PeakPerm was for a database when the data was capture in the table.

CREATE TABLE SysDBA.PermSpaceHist, NO FALLBACK,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT
     (
      DatabaseName VARCHAR(30) CHARACTERSET LATIN NOT CASESPECFIC NOT NULL,
      TableName VARCHAR(30) CHARACTERSET LATIN NOT CASESPECIFIC NOT NULL,
      TableSize FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      TableSizeMB FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      TableSizeGB FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      DatabaseSize FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      DatabaseSizeMB FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      DatabaseSizeGB FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      MaxDatabaseSize FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      MaxDatabaseSizeMB FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      MaxDatabaseSizeGB FLOAT FORMAT '----,---,---,---,--9' NOT NULL,
      CollectTimestamp TIMESTAMP(0) NOT NULL
     )
PRIMARY INDEX PIDX (DatabaseName)
;

I prefer to use a BTEQ script to maintain this table. This allows me to schedule it to run using cron from our ETL server. If you have an enterprise scheduling system or ETL tool that allows you to run custom scripts those would be appropriate locations to run this. This data is currently being captured on a weekly basis as there is no need to track the growth at a daily basis currently.

Below is core SQL to maintain this table.

INSERT INTO SysDBA.TableSizeHist
SELECT D.DatabaseName
     , T.TVMName AS TableName
     , SUM(CASE WHEN TVMId <> '000000000000'XB
                THEN CurrentPermSpace
                ELSE NULL
           END ) AS CurrentTablePerm
     , CurrentTablePerm / 1024000.0000 AS CurrentTablePermMB
     , CurrentTablePerm / 1024000000.0000 AS CurrentTablePermGB
     , SUM(CASE WHEN TVMId = '000000000000'XB
                THEN CurrentPermSpace
                ELSE NULL
           END ) AS CurrentDBPerm
     , CurrentDBPerm / 1024000.0000 AS CurrentDBPermMB
     , CurrentDBPerm / 1024000000.0000 AS CurrentDBPermGB
     , SUM(CASE WHEN TVMId = '000000000000'XB
                THEN MaxPermSpace
                ELSE NULL
           END ) AS MaxDBPerm
     , MaxDBPerm / 1024000.0000 AS MaxDBPermMB
     , MaxDBPerm / 1024000000.0000 AS MaxDBPermGB
     , CURRENT_TIMESTAMP(0)
FROM DBC.DatabaseSpace S
INNER JOIN
     DBC.Dbase D
      ON D.DatabaseId = S.DatabaseId
INNER JOIN
     DBC.TVM T
      ON T.TVMId       = S.TableId
WHERE D.DatabaseId <> '000000000000'XB /* All Database */
ORDER BY 1,2
GROUP BY 1,2
;