7 of 10 isn’t so bad.
In an earlier post we started generating prime candidates testing only numbers of the form 6k+1 or 6k-1 less than or equal to the square root of the next integer. The reason we were doing this is that all prime numbers greater than 3 are of the form 6k+1 or 6k+2 and we only have to modulo test primes. The reason we only have to test primes is that the Fundamental Theorem of Arithmetic indicates that all non-prime numbers greater than one is the product of primes.
What that means is that we only have to modulo test prime numbers. Of course in pursuing that particular optimization we actually neglected to test first whether the next integer we were testing was of the form 6k+1 or 6k-1. The point of this post is to highlight how in the course of developing SQL (or any other code for that matter), we can read our requirements in a number of ways and will need to step out of it and reconsider whether our first reading was complete or even correct.
Consider the following statements: “We only need to modulo test primes for a given integer.” and “We need to test whether or not their exists a prime factor for a given integer.” These two statement mean essentially the same thing. When combined with the form of primes being 6k+1 or 6k-1 led us to skip testing the integer itself for being of that form, but assume we have already filled a prime table to a certain number and are testing the prime candidates. Those two statements can lead us to two very different approaches to SQL.
The first statement: “We only need to modulo test primes for a given integer.” could lead us to write something like:
DECLARE @n INT
SELECT TOP 1 1 FROM prime_candidates pc INNER JOIN primes p ON pc.p % p.p = 0 WHERE pc.p=@n
Whereas the second statement: “We need to test whether or not their exists a prime factor for a given integer.” might lead us to write something like:
SELECT TOP 1 1 FROM primes p WHERE @n % p = 0
Since our requirement was to test for prime factors of a given integer and not to produce a set of prime factors performing we do not need the non-equijoin that produces a less efficient execution plan.
Seeks that lead to lookups are usually more efficient than scans. Of course unnecessary operations are inherently inefficient. In either case it is important to test that we do in fact get the results we expect.
It seems that everyone has an opinion on naming conventions. Microsoft even has a few recommendations. You would think with 256 (or 128 or 30) characters it would be simple to devise an apt system for describing database objects.
SELECT TYPE_NAME(system_type_id) [base_type], max_length FROM sys.types WHERE name = ‘SYSNAME’;
Of course, we might not want to use our screen real estate on object explorer. We also might not want to use that real estate up for a single column name (especially if that column was a bit flag). Fortunately, we have options. In addition to being able to use management studio to filter objects, we can us schemas to group objects and alias data types to categorize.
An example case for using schemas might be to create separate schemas for staging updates and deletes in a CDC based warehouse loads. An example for using alias data types might be differentiate between date types that are application data and date types that are row level metadata (e.g. inserted or updated date).
Earlier this year I posted a series of posts about prime numbers to demonstrate coding principles and techniques. As recursion is typically a very efficient approach and in my last post on the subject I stated that we could create a stored procedure to call our stored procedure recursion that pulls from our function recursion. Of course as a parent procedure that procedure would contribute to the maximum procedure nesting of 32. That approach would look something like this:
-- ============================================= -- Author: Thomas Marshall -- Create date: 4/23/2015 -- Description: -- ============================================= CREATE PROCEDURE iteratingParent3100PrimeTestCalls @max_iterations INT = 0, @demo_override INT = 2 AS BEGIN SET NOCOUNT ON; DECLARE @n INT = 1; WHILE @n <= @max_iterations BEGIN EXECUTE [dbo].[thirtyTwoHundredPrimalityTests] @demo_override --2 SET @n += 1; END END GO
With the default value for the max iterations parameter nothing will happen. If we pass the demo override parameter a value of 1 or less we will get an error because our procedure will nest more than 32 levels. As you may know iteration is dramatically less efficient than recursion. If you didn’t already guess from the title of this post we are going to try and get more out of the recursive part. What’s great is just how easy it is.
First we are going to make a variation on our recursive CTE function. You’ll notice the only difference from the hundred more prime tests function besides the name is that we restrict our CTE to 2 to the power of 15 rows. It’s important to note that the query inside the function will throw an error without OPTION (MAXRECURSION 23767) and the create statement will error with it. It’s tangential to note that (2^15)-1 is the largest signed 16 bit integer and the zip code for Paisley Florida.
CREATE FUNCTION [dbo].[sixteenBitsMorePrimeTests](@n int) RETURNS TABLE AS RETURN ( WITH recurseMore AS ( SELECT @n AS p, 1 AS n UNION ALL SELECT p + 1 , n + 1 FROM recurseMore WHERE n < 32768 ) SELECT p, n FROM recurseMore WHERE dbo.intIsPrime(p) = 1 ) GO
Fortunately for us we can specify the OPTION (MAXRECURSION 32767) in the call to the procedure so we can avoid erring on the recursion.
-- ============================================= -- Author: Thomas W Marshall -- Create date: -- Description: -- ============================================= CREATE PROCEDURE [dbo].[thirtyTwoTimeTwoToThePowerOfSixteenPrimalityTests] @c INT = 1 AS BEGIN SET NOCOUNT ON; DECLARE @n INT; SELECT @n = MAX(p) FROM dbo.primes; INSERT INTO dbo.primes SELECT p FROM dbo.[sixteenBitsMorePrimeTests](@n) OPTION (MAXRECURSION 32767); SET @n = @c + 1; IF @n < 32 EXEC thirtyTwoHundredPrimalityTests @n ; END GO
Now if we like we can have our new procedure called iteratively to keep inserting primes into our table. The hardest part should be naming the parent procedure.
If you are like me it’s not enough for you dynamic SQL to just be executed at design time, you have to make time to design the code you are dynamically building. What am I talking about? The first part is about not executing dynamic SQL as part of a data tier application wherever possible. That is to say dynamic SQL is not going to be able to reuse cached plans the way static and parameterized SQL can. The second part is about including formatting in your dynamic SQL, at least enough to make it readable.
Using dynamic SQL to write your code can be a huge time saver when you have a lot of code to write. It can also be a huge time suck when you have debug, optimize, or just get it all into one string ((N)VARCHAR field).
Dynamic SQL can make debugging challenging for a number of reasons. For starters Management Studio will be unable to detect a syntax error until its executed. Even after it executes you won’t be able to double click on the error message and jump to the line in the query editor. You can however use GO to separate your code into batches which double clicking the error will jump to beginning of. It is also difficult to divine from the error message (ERROR_LINE()) which line is actually causing the error.
Let’s not get sidetracked. If you want to noodle on those debugging consider this screenshot from SQL Server Management Studio 2014. If I were going to go off on a tangent it might be about the fact that the GO in the dynamic string does not behave the way it does outside of the string. That is to say the entire string fails in spite of the GO where only the batch containing the error does in the static SQL. And don’t get me started on how the execution plan calls a script a batch when determining cost relative to batch. Also worth noting is that the breakpoint encompasses both statements and the assignment where with static SQL a single statement is broken on.
When it comes to optimization of dynamic SQL I am going to simply link some reading material and comment. Jeremiah Peschka at Brent Ozar Unlimited has a great post about filtered indexes that kind of contradicts what I am saying about optimizing dynamic SQL. I say kind of because Jeremiah is using dynamic SQL to take advantage of a filtered index he created when a parameter might cause it to be skipped over. Andrew Kelly at SQL Server magazine mentions in this article that compiling is cpu intensive, he also goes on to show how sp_executesql allows us to cache plan. Derik Hammer explains and demonstrates the need to match the SQL text in order to get plan reuse. Turgay Sahtiyan echoes Derek, plus he is a Senior Premier Field Engineer (I was almost a PFE once). And last but certainly not least, Kimberly Tripp gave us some code to play with and a vigorous discussion with here post on multipurpose procedures. Of course, for all that good stuff we can use to avoid using dynamic SQL that is bad at execute time, our monitoring tools are probably only going to show us the generated SQL. As I tend to fall into the DBA roll, I make it a point to always attempt to optimize without changing the code that executes… at least when I don’t own it. But that is a topic for a whole other discussion perhaps on whether you can separate workloads into separate instances for ad hoc and prepared workloads.
The last obstacle is a matter of practical consideration. When we have multiple statements we might build and execute them separately. When we have nesting strings building dynamic SQL for us we might find it easier to have parts of our code in separate columns that copy as a row and paste as a single script. But what about when we are building dynamic strings that roll up some one to many values like columns and tables. Are we going to have to paste one long line and then format it? That would be another time suck to avoid. Enough teasing the title of the post said we were going to show an example of FOR XML and STUFF.
Well, that example makes a single line, and I don’t feel like formatting myself. Sure, I could use a tool to auto-format but I don’t have it installed and I really want to automate the generation of these scripts and output them directly to script files. What if we try to stick in carriage return and a line feed?
What the heck is this weird emoticon? Is it a winking fish or something? No, it’s proof enough that the output of the FOR XML PATH statement is typed as XML. That is to say its an XML escape character. Specifically it’s the XML break character for carriage return. What happened to the line feed? They don’t use it paired with carriage return in the UTF world so SQL Server was goodly enough to drop it. You can try using NCHAR instead of CHAR but the results are the same. What if we use the break character?
Darn, now the ampersand is being escaped. Ok, we’ll just have to replace it.
There you have it. FOR XML and STUFF you might not know.
This past week the hash tags were all chirping about how challenging this logic puzzle was. It seems that Singaporean middle school students are posed with a question that challenges their deductive reasoning as well as their ability to remain objective even when the posing of the question invites them to assume a subjective position within the problem. Technically, it is a contest question for the Singaporean and Asian School Math Olympiad.
You the solver are shown that one of two characters in the problem possesses enough information that they can deduce that the other does not in fact possess enough information to deduce the solution. This fact when announced provides the second character with enough information that he claims to now be able to deduce the answer. That proclamation of deducibility incites the first character to then declare that he too has deduced the answer.
The problem goes something like this:
Cheryl gives a list of dates that could be her birthday to Albert and Bernard. She tells Albert the month and Bernard the day. Albert declares that he does not know the date but that he does know that Bernard does not. Bernard declares that he didn’t know the date but now does. Albert then declare that he too knows. So what is Cheryl’s birthday? Since SQL is my hammer let’s see if this is a nail.
For starters, we have a short list of dates so let’s put them into a table variable.
Next, we can select those dates not in months with days that are unique to a month. We do this because if it were a day unique to the month it would be possible for Bernard to determine which. Albert has decided is impossible (we are assuming these characters are not wrong in their deductions).
After that we select those dates not sharing a day. This is because if it were one, it would be impossible to have deduced the answer knowing only the day and now that it is not in a month with a unique day. It was this second piece of information that enabled Bertrand to deduce the date already knowing the first.
What remains are three dates only one of which is unique to the month, which is a requirement to deduce it knowing only the month and that solution can be deduced knowing the day and that it is not in a month with a unique day. Gosh, Albert and Bernard are smart.
DECLARE @candidates TABLE (rowid INT IDENTITY(1,1), candiDATE DATE);
INSERT INTO @candidates VALUES
(‘May 15 2015’),(‘May 16 2015’),(‘May 19 2015’),
(‘June 17 2015’),(‘June 18 2015’),
(‘July 14 2015’),(‘July 16 2015’),
(‘August 14 2015’),(‘August 15 2015’),(‘August 17 2015’);
WITH COUNTS AS (
SELECT candiDATE, DATEPART(DAY, candiDATE) [Day], DATEPART(MONTH, candiDATE) [Month],
OVER (PARTITION BY DATEPART(DAY, candiDATE)) [IfItWereAMonthWithAUnqiueDayBernardCouldKnow]
) , NEXTCOUNT AS (SELECT candiDATE,
DATEPART(DAY, candiDATE) [Day],
DATEPART(MONTH, candiDATE) [Month],
OVER (PARTITION BY DATEPART(DAY, candiDATE)) [IfItWasNotAUniqueDayRemainingBernardWouldNotKnow]
FROM @candidates WHERE DATEPART(MONTH, candiDATE) NOT IN (SELECT [Month] FROM COUNTS WHERE [IfItWereAMonthWithAUnqiueDayBernardCouldKnow]=1)
SELECT candiDATE, COUNT(DATEPART(DAY, candiDATE))
OVER (PARTITION BY DATEPART(MONTH, candiDATE)) [IfBernardDidnotKnowItWouldnotBeUniqueTotheMonth]
FROM @candidates WHERE candiDATE IN
(SELECT candiDATE FROM NEXTCOUNT n WHERE n.[IfItWasNotAUniqueDayRemainingBernardWouldNotKnow]=1)
This morning while discussing things vaguely reminiscent of work with a colleague of mine the topic of Russell’s paradox was broached. Russell’s paradox relies the concept of a set of sets that do no contain themselves as members. It is a paradox because if that set does not contain itself it satisfies the requirement of membership and then must therefore contain itself, which would be a violation of the rules required for membership. Long story short humanity innovated and a guy named Ernst Zermelo contributed some rules that don’t allow such paradoxes. Zermelo actually discovered the paradox first but didn’t publish. That was confirmed a guy named David Hilbert who drew a really great looking space filling curve with a few simple steps repeated infinitely. The talk of Russell’s paradox prompted me to write the following SQL:
SELECT * INTO #temp FROM tempdb.sys.tables;
SELECT * FROM #temp;
I then posed the question: Will #temp appear as a member of the rows returned from #temp?
You can see an instance of #temp does in fact appear. The suffix appended to it uniquefies the name. This is what allows for multiple temp tables of the same name to be created. We can see this by running the above query in another tab (which opens another session).
The reason that we don’t see the table excluded from the results set is that in order for the temp table to hold the data from the sys.tables catalog view it must exist. And if it exists it will appear in the catalog. This helps to illustrate that SQL is not (always) a procedural language, or that INTO does not logically mean CREATE and INSERT.
SQL is usually typified of a declarative programming language. Being a declarative programming language means that we are not telling the database how to do what we want, only what we want. In this case we told it we want a table that matches the definition of sys.tables and we want it filled with the contents of sys.tables. Contrast this with the having asked SQL to take the contents of sys.tables and put it into a table that matches the definition of sys.tables. In the first case we have to first make the table which puts it into sys.tables, in the second we take the contents of sys.tables and then put it into our newly created table.
This might become clearer if we examine a query plan.
We can clearly see from the plan that sys.tables is a view that draws from the inaccessible mssqlsystemresource system database. It might look like all of those branches contribute to building the table to insert into, but that is just where the data comes from. A simpler select into plan can be seen if we select from a table valued function (TVF).
Here we can see that the Table Insert showplan operator occurs before the Select Into operator. That doesn’t actually create the table, though. The table is created by a Data Definition Language (DDL) operation. DDL operations don’t produce showplan operators. So there you have it, moderately conclusive proof that the table does exist before the read of data to insert occurs, and some vague syntactic hints at why.