Fun with SQL Joins (for us Data Nerds)
Howdy all. I keep looking for inspiration for these articles in my every day job duties, and since I’ve been up to my eyeballs in SSRS reporting for the last several weeks, I’ve found something that might be a little interesting, and possibly even a bit fun, for the more technically inclined among you.
For those with no direct experience with these things, “SSRS” stands for SQL Server Reporting Services. It’s essentially the built-in reporting functionality for SQL Server. It’s generally installed with SQL Server itself, and can be used by anyone with the proper access and credentials for the database.
When creating SSRS reports, one of the keys to making your report work is the creation of your datasets, from which the report itself will pull and display the data you’re attempting to report. The SSRS report in question may only have one of these datasets, but in my experience at least, most of them have several.
Regardless of the number of datasets, however, each dataset in a given SSRS report is populated via query, and most queries depend on something called JOINS to link multiple tables in the database together in order to get data out that both makes sense as meets whatever criteria you’re actually trying to specify in the report.
For example, a very, very simple query with a join in it might looks something like this:
SELECT *
FROM Transactions
JOIN Users ON Transactions.Username = Users.Username
WHERE <meets some condition>
Translated into English, what that’s saying is: ”Show me all of the data from the Transactions and Users Tables, linked together on the Username field, where the data meets the specified condition”.
All fields in the Transactions and Users tables will be returned in the result set for all the records that meet the conditions.
This is a lot of data.
What I started running into in the SSRS reports recently was a need to speed up report generation by being a bit more selective in my SQL Joins. And interestingly enough, it turns out that you don’t actually have to join TABLES. Any data object will do just fine, as long as the syntax is correct, and as long as you reference it properly.
So rather than referencing tables, you can instead reference named-subqueries. A simple example of this would be to replace the Transactions and Users table references in the above query with named subqueries. So instead of “Transactions”, use a more targeted Transactions query and name that “TransactionData”, and then do the same for “Users”, named “UsersData”, like this:
SELECT *
FROM
(Select Username, TransactionNumber
FROM Transactions
Where DocumentType = 'Dissertation')
AS TransactionData
JOIN
(Select Username, Status, Department
FROM Users
WHERE Status = 'Faculty')
AS UsersData
ON TransactionData.Username = UsersData.Username
WHERE <meets some condition>
This new query is essentially the same, but the data being returned is limited to ONLY the fields we need to find rather than returning both entire tables. Less data means less time and effort for the report to manipulate and display everything.
And of course, once you realize that you can drop in any object for that JOIN in place of the table, you can get really complicated to get at that pesky data you need, like this:
SELECT DISTINCT
TransactionDataSet.Username AS [Users UserName]
,TransactionDataSet.Status
,TransactionDataSet.Department
,TransactionDataSet.Organization
,(CASE WHEN TransactionDataSet.AuthType = 'Aeon' THEN 'Local' ELSE TransactionDataSet.AuthType END) AS AuthType
,TransactionDataSet.State
,TransactionDataSet.Zip
,TransactionDataSet.Country
,ReadingRoomHistoryDataSet.Location
,ReadingRoomHistoryDataSet.ReadingRoom
FROM
(SELECT DISTINCT Users.Username, Users.[Status], Users.Department, Users.Organization, Users.AuthType, Users.[State], Users.Zip, Users.Country, Transactions.TransactionNumber
FROM
Transactions
INNER JOIN Users
ON Transactions.Username = Users.UserName
INNER JOIN Tracking
ON Tracking.TransactionNumber = Transactions.TransactionNumber
INNER JOIN Queues
ON Tracking.ChangedTo = Queues.ID
LEFT JOIN TransactionAccountLinks
ON Transactions.TransactionNumber = TransactionAccountLinks.TransactionNumber
LEFT JOIN Accounts
ON TransactionAccountLinks.AccountID = Accounts.ID
WHERE
Transactions.DocumentType = @DocumentType
AND Tracking.ChangedDate BETWEEN dbo.ufn_Aeon_ConvertLocalDateTimeToUTC (@BeginDate) AND dbo.ufn_Aeon_ConvertLocalDateTimeToUTC (DATEADD(day, 1, @EndDate))
) AS TransactionDataSet
INNER JOIN
(SELECT ReadingRoomHistory.Username, ReadingRoomHistory.ReadingRoom, ISNULL(ReadingRoomHistory.Location, 'N/A') AS Location
FROM ReadingRoomHistory
WHERE ReadingRoomHistory.TimeIn BETWEEN dbo.ufn_Aeon_ConvertLocalDateTimeToUTC (@BeginDate) AND dbo.ufn_Aeon_ConvertLocalDateTimeToUTC (DATEADD(day, 1, @EndDate))
AND ('N/A' IN (@ReadingRoom) OR ReadingRoomHistory.ReadingRoom IN (@ReadingRoom))
) as ReadingRoomHistoryDataSet
ON ReadingRoomHistoryDataSet.Username = TransactionDataSet.Username
See, I told you it was fun. 😊
- Kevin
Comments
I have started using SQL Server Reporting Services for our statistics and reporting needs because I was unable to get reliable use out of the connection we had with Microsoft Access. I have a couple Questions:
I am using Microsoft Report Builder 3.0 because I wasn’t able to make a connection using the newer edition of the Report Builder, Did I do something wrong? Am I stuck with Microsoft Report Builder 3.0?
I was wondering if there is anyone that had datasets or reports they are proud of, and are you willing to share them?
I am trying to solve the question: what does our processing time, past and present? Have you solved this question yet?
I use this Report in Ares. It gets me some information about all the items, and the start date/time of that item. For the base of the report I have the Query:
SELECT
Items.ItemID
,Courses.CourseID
,Courses.Semester
,Items.DocumentType
,Items.ProcessingLocation
,Items.PickupLocation
,Items.Callnumber
,Items.CurrentStatus
,Items.CurrentStatusDate
,COUNT(ItemHistory.Entry) AS Usasge
,StartTime.TrackingDateTime AS StartTime
FROM
Items
INNER JOIN Courses
ON Items.CourseID = Courses.CourseID
INNER JOIN ItemHistory
ON Items.ItemID = ItemHistory.ItemID
INNER JOIN (SELECT
ItemTracking.ItemID
,ItemTracking.TrackingDateTime
,ItemTracking.Status
FROM
ItemTracking
WHERE
ItemTracking.Status IN ('Item Submitted', 'Item Submitted via Cloned Course', 'Item Submitted via Item Cloning')
) StartTime
ON Items.ItemID = StartTime.ItemID
GROUP BY
Items.ItemID
,Courses.CourseID
,Courses.Semester
,Items.DocumentType
,Items.ProcessingLocation
,Items.PickupLocation
,Items.Callnumber
,Items.CurrentStatus
,Items.CurrentStatusDate
,StartTime.TrackingDateTime
ORDER BY
Courses.CourseID
,Courses.Semester
,Items.ItemID
At least I think that is what it is doing, I haven't had this query checked at all.
When looking for an end time I considered a couple problems. Many Items are done processing once they are placed in the Pending Activation Queue. The remaining items are either completed before the end of the semester or not. Of the items are complete, many have duplicate available statuses.
One scenario is that the item was made Available, then work was done on that item, and then made available again. In that scenario, the Max Data/Time is used.
Another scenario is when an item is made available on the reserve shelves and then made available on electronic reserves, or vice versa; in that scenario the max Date/time is also used.
However, when an available status is clicked more than once the minimum date time is used. AND you cant just use a max Date/Time for everything except for instances of scenario 3 because potentially all of these instances satisfy the scenario 3 parameters.
What have you used to get the end date time of processing an item?
Todd,
Apologies for the response delay here. I need to do some testing in order to answer the Report Builder Version questions, but hopefully this will help you a bit with the query portion of things.
Processing Turnaround Time:
Turnaround time can be one of the hardest sorts of reports to make work correctly, in that you have to reference multiple time values, for different events, within the same sort of "unique transaction" reporting structure.
This is especially true for something like Ares, where different sites will often have very different workflows, and they will thus require very different turnaround time queries to pull data specific to whatever their particular process might be. For that reason, there isn't yet a default Ares Turnaround time report.
It looks like you're on the right track though with your query:
SELECT
Items.ItemID
,Courses.CourseID
,Courses.Semester
,Items.DocumentType
,Items.ProcessingLocation
,Items.PickupLocation
,Items.Callnumber
,Items.CurrentStatus
,Items.CurrentStatusDate
,COUNT(ItemHistory.Entry) AS Usage
,StartTime.TrackingDateTime AS StartTime
FROM
Items
INNER JOIN Courses
ON Items.CourseID = Courses.CourseID
INNER JOIN ItemHistory
ON Items.ItemID = ItemHistory.ItemID
INNER JOIN (SELECT
ItemTracking.ItemID
,ItemTracking.TrackingDateTime
,ItemTracking.Status
FROM
ItemTracking
WHERE
ItemTracking.Status IN ('Item Submitted', 'Item Submitted via Cloned Course', 'Item Submitted via Item Cloning')
) StartTime
ON Items.ItemID = StartTime.ItemID
GROUP BY
Items.ItemID
,Courses.CourseID
,Courses.Semester
,Items.DocumentType
,Items.ProcessingLocation
,Items.PickupLocation
,Items.Callnumber
,Items.CurrentStatus
,Items.CurrentStatusDate
,StartTime.TrackingDateTime
ORDER BY
Courses.CourseID
,Courses.Semester
,Items.ItemID
You've got the StartTimes figured out (Nice subquery for StartTime, BTW), and you're pulling pretty much all you would need for the initial data set.
Note: I'm defining "Initial Data Set" in this case to mean the initial set of Item records to be looked at for this report/query and from which any results will be generated. That may include all items in the table (as in this case) or it might be refined further to be everything within a specific date range, everything that meets some set criteria (whatever that might be beyond dates), or both.
As an aside, rather than having to run this query in Report Builder (whatever version) every time, you can also set up some date (and other) parameter variables in the query itself by adding something like this to the top of the query:
DECLARE @BeginDate DateTime, @EndDate DateTime;
-- EDIT THESE VALUES BEFORE RUNNING SCRIPT --
SET @BeginDate = '2017-12-01 00:00:00.000';
SET @EndDate = '2018-03-31 00:00:00.000';
-- EDIT THESE VALUES BEFORE RUNNING SCRIPT --
Then just reference the variables (rather than the applicable set values) lower in the query. This essentially just allows you to set up parameter values in Report Builder and not have to modify the query with specific values before running it by itself in SQL Management Studio every time to need to. Just be sure not to copy this parameter-setting portion of the query into Report Builder.
Report Builder will use the same variables you set up in the query (assuming you tell it to, of course), so you can run it faster (in either program) with a smaller data set for a specific time frame, rather than having to look at all your data all at once every time.
Regardless though, the next step in designing your report is going to be to determine what your StopTime values are going to be. As you say, depending on how your staff is processing everything, there may be any number of different statuses that your operation defines as "End Statuses" for completed processing.
I ran a quick query to look at your ItemTracking data, weeded out the obvious non-finished statuses (Awaiting X, In X Processing, etc.) like this:
Select Distinct Status from ItemTracking
Where Status not like 'Awaiting%'
And Status not like '%Processing%'
And Status not like '%Cancelled%'
And Status not like '%Missing%'
And Status not like '%Removed%'
And Status not like '%Submitted%'
And Status not in ('In Stacks Searching')
and I see the following POSSIBLE finished statuses:
Check for Proxy
Hardcopy Future Item Activation
In Supply by Instructor
Item Activation Pending
Item Available at Reserve Desk
Item Available on Electronic Reserves
Note: I'm just guessing here based on knowledge of other Ares systems, what specific statuses exist in the table, and how each one is named. Tweak things accordingly for whatever YOU think should be a finished processing status, obviously, to make your determination.
Once you have your list, you'll want to add in another subquery referencing ItemTracking data and named something like EndTime, referenced in the same way as StartTime. This new subquery will need to do the following:
Of those requirements, identifying the initial "finish" when multiple possible statuses are present is going to be the hardest task.
There are probably several ways to do that (both in the query and in Report Builder itself), but the first thing that comes to mind would be a row number/partition approach like this:
SELECT
ItemTracking.ItemID
,ROW_NUMBER() OVER(PARTITION BY ItemTracking.ItemID ORDER BY ItemTracking.TrackingDateTime) Position
,ItemTracking.TrackingDateTime
FROM
ItemTracking
WHERE
ItemTracking.Status IN ('Check for Proxy', 'Hardcopy Future Item Activation', 'In Supply by Instructor', 'Item Activation Pending', 'Item Available at Reserve Desk', 'Item Available on Electronic Reserves')
Order by ItemID
Obviously you'll want to tweak that to use the status list you select, and for whatever other logical changes you might want to make, but that should give you a relatively easy hook for the rest of the query on that "Position" value, as it will return data that looks like this:
ItemID Position TrackingDateTime
73 1 2007-02-28 13:31:31.000
73 2 2007-02-28 14:33:43.000
74 1 2007-04-13 09:48:19.000
and so forth.
And again, there MAY be better ways to accomplish this. This was just the first thing that occurred to me.
And lastly, I mentioned the option of using date ranges above rather than looking at the entire data set all the time.
If that's something that appeals to you, it's absolutely critical to determine exactly to what that date range applies, so there's no confusion down the road. For example, in the range I listed above:
SET @BeginDate = '2017-12-01 00:00:00.000';
SET @EndDate = '2018-03-31 00:00:00.000';
We need to know whether we're looking for Items that STARTED during that range, or Items that were FINISHED during that range.
Either is potentially relevant depending on what you're after here, so just be sure that it's specifically stated somewhere on the report (or even in notes within the query, or both) what exactly you're looking at. Just remember that the person running the report in the end may not be someone who participated in its design.
Hope that helps, and I'll post again regarding the Report Builder question once I've done a little more testing.
- Kevin
Todd,
There was apparently a new version of Report Builder released last month that's compatible up to SQL 2016, but I don't think it's backward compatible with your existing report server functionality.
Both your ILLiad and Ares systems appear to be running on SQL 2008 R2, and when I try to point to an existing report on that SQL Version, I get the following error:
The report definition is not valid or supported by this version of Reporting Services. This could be the result of publishing a report definition of a later version of Reporting Services, or that the report definition contains XML that is not well-formed or the XML is not valid based on the Report Definition schema. Details: The report definition has an invalid target namespace 'http://schemas.microsoft.com/sqlserver/reporting/2016/01/reportdefinition' which cannot be upgraded.
I'm thinking that the back end Reporting Services (and Likely SQL Server itself) probably needs to be updated for this Report Builder version to work for you, I'm afraid.
- Kevin
Please sign in to leave a comment.