SQL Server Tricks and Tips from a Self-Proclaimed Data Nerd

Using COLLATE to FIND differences in field value Case:

So....    How many times have you run into an issue where you have Customer-entered values in the database (or maybe even Staff-entered.  Shhh.  Don't tell anyone.) and for whatever reason, the values entered into the field in question have the wrong case?  

Or perhaps you're running into a slightly more critical issue where the system actually behaves differently if a value being looked for is capitalized versus not.

Regardless of what's happening or why, if data of this nature needs fixing in the database, the proximate question posed to the "fixer" is going to be, "How to do I actually FIND the values in need of fixing?"

One means of doing this (and please note, there are many, many possible roads to take here) is using the COLLATE function.

For example, a standard (and very, very basic) SQL query to look at your existing User Status values would look like this:

   SELECT DISTINCT Status
   FROM Users

The results of this query will look something like this:

   facstaff
   grad
   GRADUATE
   misc
   None
   STAFF
   test
   udg
   UNDERGRADUATE

But the problem here is that, to SQL Server, unless specifically looked for, case is ignored.  

So the value "staff" and the other value "STAFF" (and for that matter, even "Staff") will be seen as identical by the above query, and, in the case above, it displays in the query result set as "STAFF".

But are all of the Users table Staff values being returned there actually in all caps?  We have no way of knowing unless we query specifically for that case.  

We can do that using the COLLATE function.  So if we use this slightly more complicated query instead of the original:

   SELECT DISTINCT Status COLLATE sql_latin1_general_cp1_cs_as
   FROM Users

, we instead get results that look like this:

   Staff
   STAFF
   grad
   None
   facstaff
   misc
   test
   udg
   Graduate
   GRADUATE
   UNDERGRADUATE

Notice that "staff" now appears in the list twice, first as "Staff", and then again as "STAFF".  That tells us that we have two different case variations of that value in the table.  And please note:  Had "staff" or "staFF" also existed in there, those two case-specific values would also be listed in the results set.

We also see that "graduate" exists in the list twice, once as "Graduate" and then again as "GRADUATE". 

So, problems found.  We now know what might potentially be in need of fixing.  How do we then fix them?  

Well, to paraphrase a favorite cooking show host, that's another article, but hopefully this will give you an introduction of sorts to the COLLATE function. 

Until next time.....

- Kevin

 

"Apologies, Captain. I seem to have reached an odd functional impasse. I am, uh ... stuck."

--DATA, Star Trek: The Next Generation, "The Last Outpost"

 

ADDITIONAL RESOURCES:

https://docs.microsoft.com/en-us/sql/t-sql/statements/collations

https://docs.microsoft.com/en-us/sql/relational-databases/collations/set-or-change-the-database-collation

http://stackoverflow.com/questions/5039211/what-does-collate-sql-latin1-general-cp1-ci-as-do

2

Comments

1 comment
  • Thanks!  This is very helpful.  Will try it myself.  Our statuses are kind of a mess and I'd love to clean them up, but the only way to do so is manually.  Looking forward to your next post!

     

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Didn't find what you were looking for?

New post