Tuesday, August 19, 2008

Hidden Power of UNION

I used to think that UNION was useful only when you needed to combine the results of two queries from different sources (e.g., tables, views) into one result set. However, using UNION is sometimes the quickest way to select from just one table.

Suppose you need to retrieve all OrderIDs from the Northwind Orders table where the CustomerID is VICTE or the EmployeeID is 5:

SELECT OrderID FROM Orders WHERE CustomerID = 'VICTE' OR EmployeeID = 5

The Orders table has indexes on the CustomerID and EmployeeID columns, but SQL Server doesn't use them to execute the SELECT statement, as the following execution plan shows:

|--Clustered Index Scan(OBJECT: ([Northwind].[dbo].[Orders].[PK_Orders] ), WHERE:([Orders].[CustomerID]='VICTE' OR [Orders].[EmployeeID]=5))

Instead, SQL Server scans the clustered index PK_Orders. SQL Server uses this plan because the OR operator in the query's WHERE clause makes the result satisfy both conditions at the same time, so SQL Server must double-scan the table. Scanning a clustered index in this case is almost the same as scanning the entire table; the server goes though the table record by record and checks for the specified predicate.

To improve the query's efficiency, you could create a composite index on CustomerID and EmployeeID:

CREATE INDEX IdxOrders001 ON Orders (CustomerID, EmployeeID)

The Index Tuning Wizard advises you to create just such an index to improve performance of the SELECT statement. SQL Server can use the new index to find all the records in which EmployeeID = 5, then scan only the resulting range of records to return the required result. The estimated execution plan for the new query shows that SQL Server uses the composite index:

|--Index Scan(OBJECT:([Northwind]    .[dbo].[Orders].[IdxOrders001]),    WHERE:([Orders].[CustomerID]='VICTE' OR    [Orders].[EmployeeID]=5))

But an employee can make thousands of deals with thousands of customers. So even if you create a composite index, SQL Server will need to scan a range of records in the index and won't perform a seek operation, which uses indexes to retrieve records and is the fastest way for SQL Server to find information.

You need to make SQL Server use a seek operation instead of a scan, but SQL Server won't perform a seek when an OR operator is in the WHERE clause. You can solve this dilemma by using UNION to rewrite the SELECT statement:

SELECT OrderID FROM Orders WHERE CustomerID = 'VICTE' UNION SELECT OrderID FROM Orders WHERE EmployeeID = 5

SQL Server 2000's estimated execution plan for this statement is

|--Merge Join(Union)     |--Index Seek(OBJECT:([Northwind]        .[dbo].[Orders].[CustomersOrders]),        SEEK:([Orders].[CustomerID]='VICTE'        ) ORDERED FORWARD)     |--Index Seek(OBJECT:([Northwind]        .[dbo].[Orders].[EmployeesOrders]),        SEEK:([Orders].[EmployeeID]=5)        ORDERED FORWARD)

This execution plan looks longer than the original one, but both operators are index-seek operators. SQL Server doesn't use the composite index; instead, it uses two single-column indexes. You might think that two seek operations would cost more than one seek, but performance improves when you use this method. You can check performance by using SQL Trace to analyze the three versions of the SELECT statement. For the UNION query, my SQL Server 2000 system performed four reads to return the result. The first SELECT query required 23 reads, and the second SELECT statement, which created the composite index, required 11 reads.

This special use of UNION can help you avoid the OR operator's slow performance, but use it carefully. If you don't have the appropriate indexes (CustomersOrders and EmployeesOrders, in this example), you can double-scan the table.

Sharpen Your Basic SQL Server Skills: Database backup demystified

Q: What are the types of backup methods in SQL Server?
A: SQL Server has four backup methods: full, transaction log or incremental, differential, and file or file group. You should always perform a full backup, in which you make a complete copy of the database. The transaction log or incremental backup method copies all the modifications made to a database. A differential backup makes a copy of all the changes in a database since the last full backup. The file or file group backup method copies the actual database files on disk.

A full backup quickly restores a database to its original state and is the baseline for the other types of backups. A full backup contains all of a database’s data, structures, and security objects, plus the transaction log, which you can use to restore any changes made to the database while it’s being backed up. A database restored by the full backup method doesn’t contain any uncommitted transactions (transactions which have begun but to which the administrator hasn’t explicitly committed in the database either by turning on auto commit or by using the COMMIT statement). The following code sample shows a full backup:

BACKUP DATABASE AdventureWorks
TO DISK = ‘C:full_bk_AdventureWorks.bak’

The transaction log or incremental backup helps recover a database to its latest working condition. To use this method, you need to set the database to full recovery or bulk recovery mode. To restore a database to a point in time, you must have an unbroken chain of transaction log backups. You need to first restore the database from a full backup before you can restore from a transaction log backup. Finally, you must restore each transaction log backup in the order in which it was taken. The following code sample shows a transaction log backup:

BACKUP LOG AdventureWorks
TO DISK = ‘C:log_bk_AdventureWorks.trn’

Performing a differential backup saves time and media space compared with performing a full backup. If the database is set to full-recovery mode, restoring the latest differential backup restores the database state to the time the last differential backup was completed. A differential backup can’t be restored independently; it can be restored only after a full backup is restored. The following code sample shows a differential backup:

BACKUP DATABASE AdventureWorks
TO DISK = ‘C:diff_bk_AdventureWorks.dif’
WITH DIFFERENTIAL

The file or file group backup method is an alternative way to perform a full backup. This method lets you quickly restore a database to working condition. However, you can perform a file or file group backup only under certain conditions. To use this method, the database must have been created with multiple files or file groups. The following code sample shows a file backup:

BACKUP DATABASE AdventureWorks FILE =
‘AdventureWorks_data’
TO DISK = ‘C:file_bk_AdventureWorks.
data’

Q: What are the recovery models in SQL Server?
A: SQL Server has three recovery models: full, bulk-logged, and simple.

In the full recovery model, all changes to the database are logged, so the database can be restored to the point of failure with full backup and log files. Use the full recovery model for essential databases that are updated frequently. The following code sample shows a full recovery:

ALTER DATABASE AdventureWorks
SET RECOVERY FULL

In the bulk-logged recovery model, all changes to the database except high-speed bulk insert operations are logged. (Don’t let the name confuse you—this method doesn’t actually log bulk processes.) Database performance isn’t compromised when bulk operations are in process. The following code sample shows a bulk-logged recovery:

ALTER DATABASE AdventureWorks
SET RECOVERY BULK_LOGGED

The simple recovery model lets you restore a database to the point of its last full backup. A database that isn’t updated frequently is a candidate for the simple recovery model. The following code sample shows a simple recovery:

ALTER DATABASE AdventureWorks
SET RECOVERY SIMPLE

Q: Does the RESTORE HEADERONLY command change or restore the header of a BACKUP file?
A: This command doesn’t change anything in the header. The clause simply returns all the backup header information for all the backup sets on a particular backup device. The following code sample illustrates using RESTORE HEADERONLY:

RESTORE HEADERONLY
FROM DISK = ‘C:file_bk_
AdventureWorks.data’
WITH NOUNLOAD;

Q: What’s a tail-log backup, and when do you use it?
A: The tail-log backup requirement first occurs in SQL Server 2005 and applies to all subsequent versions. When you recover a database using the full or bulk-logged recovery model, before you restore it, you might find that some transaction log entries don’t yet have backups. Backing up this set of transaction log entries is known as a tail-log backup. SQL Server 2005 requires a tail-log backup operation on a database before it’s restored, to ensure that no data is accidentally lost. The following code sample shows a tail-log backup:

BACKUP LOG AdventureWorks
TO DISK = ‘C:taillog_bk_
AdventureWorks.trn’
WITH NORECOVERY

For more SQL Server tips, check out the blog at sqlmag.com/go/SQLskills.

Friday, May 2, 2008

Missed Virtual Tech Days? Catch up on the recorded sessions
Windows Server 2008 | Visual Studio 2008 | SQL Server 2008


Microsoft*


This event is now over.

Thank you for your overwhelming response that made the Microsoft Virtual TechDays a GRAND Success!

Held on April 9th & 10th, 2008 Microsoft Virtual TechDays provided an opportunity to the developer and IT Pro community to dive deep into the latest offerings from Microsoft - Windows Server 2008, Visual Studio 2008 and SQL Server 2008. During the 2-day event, Microsoft Experts conducted 32 LIVE training sessions on how to harness the power and possibilities of these products.
The recordings and presentations of these sessions are now available for download.


Note: To view the recorded sessions you need Windows Media Player 9.0 or above.

Day 1 - April 9, 2008

Windows Server 2008

SQL Server 2008 - Business Intelligence and Admin

SQL Server 2008-RDBMS

Visual Studio 2008 Fundamentals

Windows Server 2008 Technical Overview

Speaker:
Sathya Narayana KP

Download Recorded Session »

Download PPT »

Analysis Services redefined with SQL Server 2008

Speaker:
Sivakumar Harinath

Download Recorded Session »

Download PPT »

Compression Discovered with SQL Server 2008

Speaker:
Sunil Agarwal

Download Recorded Session »

Download PPT »

What’s new in VS 2008 for Developers?

Speaker:
Sarang Datye

Download Recorded Session »

Download PPT »

Windows Server 2008 Building High Availability Infrastructures

Speaker:

Ramnish Singh

Download Recorded Session »

Download PPT »

SSIS and RS improvements in SQL Server 2008

Speaker:
Kuldeep Chauhan
and Nirmal Kohli

Download PPT »

Upgrading to SQL Server 2008 - Things to keep track?

Speaker:
Vinod Kumar

Download Recorded Session »

Download PPT »

.NET 3.5 – Language Improvements, LINQ

Speaker:
Sarang Datye

Download Recorded Session »

Download PPT »

Web and Application Platform in Windows Server 2008

Speaker:

Parag Paithankar

Download Recorded Session »

Download PPT »

Manage SQL Server resources using Resource Governor

Speaker:
Vinod Kumar

Download PPT »

T-SQL Enhancements with SQL Server 2008

Speaker:
Praveen Srivatsav

Download Recorded Session »

Download PPT »

Visual Studio Team System 2008 Make the Most of VSTS in Real -World Development

Speaker:
Srikanth Ramakrishnan

Download Recorded Session »

Download PPT »

Windows Server 2008 Virtualization Technologies

Speaker:
Guruprakash S

Download Recorded Session »

Download PPT »

Partitioning Enhancements with SQL Server 2008

Speakers:
Vinod Kumar Jagannathan (IDC)

Download Recorded Session »

Download PPT »

Capturing changed data using SQL Server 2008

Speaker:
Srikant Jahagirdar

Download PPT »

Visual Studio Team System 2008 Team Foundation Server - Manage enterprise - level application development

Speaker:
Ajay Bhandari

Download Recorded Session »

Download PPT »


Day 2 - April 10, 2008

Windows Server 2008

SQL Server 2008 High Availability & Security

Visual Studio 2008 Web Applications

Visual Studio 2008 Smart Clients

Windows Server and Vista Solid Enterprise

Speaker:

Sudhir Rao

Download Recorded Session »


Download PPT »

Encryption enhancements with SQL Server 2008

Speakers:
Kuldeep Chauhan

Download Recorded Session »

Download PPT »

Intro to ASP.NET 3.5 using VS 2008

Speaker:
Chaitra Nagaraj

Download Recorded Session »

Download PPT »

WPF Development using Visual Studio 2008

Speaker:
Bijoy Singhal

Download Recorded Session »

Download PPT »

Centralised Application Deployment Using Terminal Services in Windows Server 2008

Speaker:
Shivesh Ranjan and Pushkar Chitnis

Download Recorded Session »

Download PPT »

Audit Logging SQL Server 2008 Databases

Speaker:
Karthik Bharathy (IDC)

Download Recorded Session »

Download PPT »

Developing Data Driven Applications using ASP. NET 3.5 Extensions

Speaker:
Yasin Mukadam

Download Recorded Session »

Download PPT »

Connected Applications using WPF, WCF and WF

Speaker:
Sarang Datye

Download Recorded Session »

Download PPT »

Windows Server 2008 Security and Compliance Technologies

Speaker:

Ravi Sankar

Download Recorded Session »

Download PPT »

Availability Enhancements with SQL Server 2008

Speaker:
L Srividya

Download Recorded Session »

Download PPT »

Rich Interactive Applications using ASP.NET AJAX 3.5 & Silverlight

Speaker:
Vasuki Parthasarathy

Download Recorded Session »

Download PPT »

Office Application Development

Speaker:
Praveen Srivasta

Download Recorded Session »

Download PPT »

Windows Server 2008 Identity and Access Technologies

Speaker:
Ravi Sankar

Download Recorded Session »

Download PPT »

SQL Server 2008 - Policy-Based Management

Speaker:
Karthik Bharathy (IDC)

Download PPT »

ASP.NET MVC Framework - Exploring the pattern

Speaker:

Sarang Datye

Download Recorded Session »

Download PPT »

Mobile Development using .NET CF 3.5

Speaker:

Abhinav Gujjar

Download Recorded Session »

Download PPT »




Thursday, February 28, 2008

[DRAFT]

Planning SQL Course



This list helps students to plan their SQL self-training. Introduction and good external links are also available here for each topic .


"adv:" means advanced sub-topics



database operations (create, alter, delete, drop, truncate, select, insert ...........)


Basic Topics

Database objects::::::::::::
0) database basics

introduction to various db objects: database objects (Database, Tables , Views Stored procedures, Triggers, Functions, Constraints (PK, FK, UK, DF, NULL...), Indexes, Statistics...)

DDL DML

Variables (declaring, assigning value, scope)
Constants http://www.sqlservercentral.com/articles/T-SQL/61244/
Data types
1) Tables (create, alter, drop, insert, update, delete, truncate operations)
identity property basics
adv: identity property advanced topics (various related functions, the behaviour of identity column on delete, truncate, insert, update operations)
adv: identity_insert
) select statement
from clause
where clause
order by clause
Aliases
Group by clause
Having clause
Operators used in where clause:
AND
OR
In
Between
=, <, >, <>
IS, NOT, EXISTS,
IN
) Insert statement
with list of values
with results from a select statement
adv: complex insert having joins
) Update statement
simple update
updating multiple columns in a single query
adv: complex insert having joins
) Delete

simple delete

adv: complex delete having joins
) Truncate
truncate vs delete
truncate vs delete vs drop


6) Constraints (PK, FK, UK, DF, NULL.......)
types of constraints

significance of constraints while operating on parent-child tables.

8) system functions and exercises on system functions (to test if the student is able to pick the right function for the right purpose)

String Functions

Date Functions

Aggrigate Functions

Exercises on system functions

* print the statements like this (given the DOB as input and Name)
My Birthday is on <date>

My Birthday falls on <week day>

My Birthday is on <day>th of <month>

This year my birthday is on <week day>

My age is <age>

My first name is (using sub string)

My last name is (using sub string)

My middle name is (using sub string)

Length of my name is <length>


9) Sub-queries

exercises on Aggregate Functions: running total and a running count using sub-queries

10) JOINS (self, cross, inner, left outer, right outer)

Various types of joins and their applications

Alternative queries to various types of joins

JOIN vs IN vs EXISTS: http://mrdee.blogspot.com/2007/05/sql-server-join-vs-in-vs-exists-logical.html


2) Views
create view
adv: various operations on view and their effect on the underlying tables
adv: schema binding
adv: cascading property

adv: system views

Exercises on System Views:

1) write a query to list all tables in the database which are having column name starting with 'emp'
2) write a query to list all tables and their columns in the database which are having int datatype column

3) Write a SP that generates the insert

3) Stored procedures

4) Triggers
trigger basics

types of events which can be triggered

virtual tables

Types of triggers

disabling and enabling a trigger

adv: trigger settings (db level and server level)

5) Functions
basic system functions (string functions, aggregate functions, ...)

functions basics

Write simple functions

* to add 2 values

* to find the reminder of the division

* to find the quotient of the division

Write moderate complex functions

* to return 'N' Perfect Numbers



Intermediate topics:

) Basics topics that are marked as "Adv:"

) user defined datatypes
) Union, Union ALL
) dynamic SQL
) Select Into vs Insert into
) Database (create, drop ... operations)

Assignment: design database for telephone bill loading system (example use my reliance mobile bill)

Question on Database Design: How can we keep track of inserts and updates of records in the database... like who inserted the record, who updated the record, when is this record inserted, when is this record
) cross db queries
) Corelated subqueries
) delete vs drop vs truncate

) DISTINCT vs GROUP BY

* www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html
* http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/

) Admin: Backup & Resore

) Error handling exercises:
modify the following previous code to include error handling
generate Prime numbers
generate Perfect numbers
generate maths tables


Advanced topics:
1) Indexes

2) Performance tuning
3) Statistics
4) cascaded views and other advanced topics related to views
5) linked server

6) COLLATE


Monday, February 25, 2008

SQL Optimization Methods

January 3rd, 2007

SQL optimization is a transformation of SQL statements, which produces exactly the same result, as an original query. This process requires a lot of creativity and intuition; it is more an Art than a Science.

It is impossible to teach it in a small article, so I don’t even try to do it. Instead, I want to limit myself by classifying types of optimization. So, when you are asked to optimize a database, you need to know, what to ask and what to expect.

The result, produced by the optimized code must be exactly the same as it was before: the second worst nightmare of a DBA (the first one is losing data) is someone running to him crying “you know, after you last optimization all customer balances became negative!” So I will focus on Potential danger on every level.

For that reason all types of optimization are sorted in order of increasing intrusiveness, and I begin from the non-intrusive optimization.

0. Non-intrusive optimization

It is upgrading SQL server hardware and changing of SQL server parameters. Adding more memory or processor power wont harm, you don’t need to test your product again.

Upgrading hardware you can spend a lot without any effect, because it is possible that your server is suffering from Locks, while you will be adding more memory – for vain. There is an easy way to determine, what (in general) is the most serious problem on the server: CPU power, IO, or Locks. To do it, use Lakeside Upgrade Advisor.

You can accelerate your IO by putting LOG and DATA on different devices, separating indexes from data, but unfortunately, it is more and more difficult to apply these old good recommendations in the modern world. In many companies ‘a server’ is built based on the same specification no matter if it is a file server or a database server. One big RAID-5 drive is almost standard for the small servers. If you ask them for separate drives for LOG and DATA, they will give you 2 separate LUNs on the same RAID-5 drive and on the same physical channel, which is absolutely useless.

For the bigger servers more and more companies use big disk arrays like SAN EMC – these arrays are very good for the OLAP systems, but are bad for the OLTP.

Changing SQL server parameters also relatively safe, because you don’t change the code. Unfortunately, there is no parameter ‘TURBO=TRUE’, so you can play with these parameters, but in most cases SQL server self-tuning is quite good, and effects of tuning can be beyond measurement.

For the given hardware, applying all the recommendations you can find in the Internet, you can gain 20-30%, even less if there is nothing really wrong with you server. So, if your queries are running several times slower then expected – then you don’t have other choice but change your TSQL code.

Potential danger: None

1. Indexes

On this level of optimization, we change the database by creating and changing indexes, but we don’t touch the code.

Of course, any DBA is very conservative about changing the code, especially on the production. In some cases it is possible to make some modifications which are not really considered as ‘changes’ by the conventional programming.

At first, all you can do with indexes, can affect the execution plan, but the result is guaranteed to be the same (unless there are bugs in the SQL server, but Microsoft spent much more time on testing it then it is spent on any query you work with, so below we can think about SQL server as a ‘bug-free’ product, which is not true, but is almost true in comparison with the quality of our code)

You can create, drop indexes, change index types (clustered/non-clustered), include or exclude columns from indexes. On SQL 2005, check a new index option ‘INCLUDING’. Don’t try to index everything – you will slow down updates. If SQL server does not use an index, there may be a very good reason behind that decision.

Potential danger: Is it absolutely safe? For most of the databases, it is safe. SQL server will automatically adjust plans based on new indexes. However, if your database uses not only DML but also a DDL, and it is generating tables and indexes on the fly, then it is possible that some

exec('drop index '+@indname)

would fail. So be careful.

2. Hints and options

To use a hint you should modify an existing SQL code. Hence, it is a modification, and theoretically, you need to retest all the system. But with an assumption that SQL server is almost bug-free, a query with any hint produces exactly the same result as without a hint.

So check the execution plan and check the following:

(index=…) – but be careful, usually, when SQL server ignores an index, there is a very good reason why.
OPTION (HASH JOIN) – try also other types of joins.
OPTION (FORCE ORDER)
OPTION (FAST 1) – these 3 options above are good candidates to try even without thinking, sometimes it helps.
(NOEXPAND) for indexed views
• Option WITH RECOMPILE for a stored procedure
and some others…

But if it does not help, there is no other choice then change the code

Potential danger:
• Some modifications can have a catastrophic impact on the performance.
• Some options can result in error type

‘Query processor could not produce a query plan because of the hints defined in this query.’

• Some queries can fail on some data sets, check

OPTION (ROBUST PLAN)

for details

3. Syntax-level optimization

The most primitive type of the optimization is based only on the text of a query. We don’t know even the underlying database structure, so we are limited by the subset of syntax transformations, which do not change the result in any case. For example:

select 'Low',A,B from T
union
select 'High',A1,B1 from T1

In this case we merge both datasets together and get rid of the duplicates. However, there are no duplicates because the first column is different for 2 sub-queries. Hence we can rewrite it as:

select 'Low',A,B from T
union all
select 'High',A1,B1 from T1

Another example.

update T set Flag='Y'
update T set Ref=0

In both queries we affect all the records, so we can change both columns at the same time:

update T set Flag='Y',Ref=0

We can merge together updates in even more complicated cases:

update T set Flag='Y'
update T set Ref=0 where Comment like '%copied%'

Both updates use a full table scan, so lets do both updates during the same pass:

update T set Flag='Y',
Ref = case when Comment like '%copied%'
then 0
else Ref -- keep the same
end

A Trap:
Such type of optimization can be dangerous. For example,

update T set Ref=0
update T set Amount=StoredValue

It is almost the same case as we had before. So we can rewrite it as

update T set Ref=0,Amount=StoredValue

Right? Yes, this transformation is correct… in 99% of cases, but not always. The possible problem is that StoredValue can be a calculated column, which depends on Ref. For example, lets say it is defined as StoredValue=Ref+1

The original statement sets Ref to 0, and Amount to 1. Our optimized query works differently, as UPDATE at first calculates all right parts and only then assigns them to the columns. Therefore, it will set Ref to 0, but Amount will be set to the old value of Ref plus 1.

Potential danger: As it was demonstrated above, we should test the code after that modification. All levels above are intrusive and require verification.

4. When we know database schema

When we know database scheme and we have a list of indexes we can make more serious changes. For example, when we see:

update T set Flag='A' where Opt=1
update T set Flag='B' where Opt=2

We can check if there is an index on the column Opt. Assuming, that an index is created on a column with a high selectivity, we can leave there statements as is. However, if there is no index on Opt, we can rewrite it as:

update T
set Flag=case when Opt=1 then 'A' else 'B' end
where Opt in (1,2)

to make an update during the same scan.

Potential danger: Testing is required

5. When we have an access to the actual data and know the meaning of the data

We can do it only when we have an access to the actual data. In the example above, we could determine if Opt is a column with a high selectivity. To do it we should compare the results of 2 queries:

select count(*) from T
select count(distinct Opt) from T

What is more important, we can find values with irregular selectivity and handle them properly. Check this article to read more about irregular selectivity: http://www.lakesidesql.com/articles/?p=8

But sometimes table and column names help, so we understand the situation even without looking at the actual data. In what case, do you think, SQL server uses an index and where it uses a full table scan?

update Patients set Flag='Y' where Gender='F'
update Sales set Flag='N' where Id=13848

In the first case SQL server uses a table scan, because Gender column (M/F) has extremely low selectivity (about 50%). We can even say, that Flag is a low selectivity value in both cases with values ‘Y’,”N’ and probably few others. But syntaxically, both queries are identical.

Another example. Assuming that Gender is NOT NULL, does this statement update all the records?

update Patients set Flag=0 where Gender in ('M','F')

Yes. Now let’s change the table name:

update Nouns set Flag=0 where Gender in ('M','F')

We update a noun, which have a grammar category of Gender, which may be ‘M’, ‘F’ or Neutral! As you see, the only difference is a table name, so we actually guessed the meaning.

We can base our optimization based on many real world constraints, like
• age is less then 255 and non-negative,
• mass is positive,
• InternalId int identity correlates with a date of the document
• SSN is unique.

Potential danger: Requires not only testing but also verification, that constraints are not violated now (such type of verification is possible by querying the existing data) and in the future (not all data may be inserted). You can find that:
• Special value of -1 is used for age as some weird flag
• mass can be negative because sometimes it is delta from the actual and expected value
• InternalId int identity correlates with a date of the document except for the documents, imported from another database a year ago
• SSN is unique, except for the value ‘N/A’, used for all illegal immigrant workers.

5. Actual or expected workload.

Some solutions have 2 sides: they accelerate data retrieval, for example, but slow down the modification. Very good example is a technique of ‘indexed’ or ‘materialized’ views. Indexed views can accelerate selects, but they slow down any updates significantly, especially massive updates.

So before you make any modifications, you should know if data is relatively static or not, how man by updates you expect per one select etc.

On the actual data load the best way to make such analysis is to use Lakeside Trace Analyzer.


When SQL Server Query Optimizer Is Wrong

http://www.lakesidesql.com/articles/?p=18

In most cases, SQL Server Optimizer generates optimal plans. It is impossible to compete with its internal knowledge of average disk access cost, record length or page fill ratio. But, there is one area where human expertise is always superior.

To follow my example, execute the following scripts on an empty database to create 2 tables (you can skip this step and look directly at the results):
Each Sale (master table) is associated with the details table – SalesItems.

create table Sales (
Id int primary key,
Amount money not null,
Date datetime not null,
Comment varchar(128)
)
GO
create table SaleItems (
SaleId int,
ItemId int not null,
Quantity int
)
GO

The Sales table (master table) will have 10’000 random records. The SalesItems table (details table) will have 1 to 10 items for each sale record in the Sales table.

Execute the following scripts to populate both tables and create indexes:

set nocount on
GO
declare @n int, @m int
set @n=10000
while @n>0 begin
insert into Sales select @n,$10.0*(@n%100)+$100.,dateadd(hh,-@n,'20070507'),
'This is sale N '+convert(varchar,@n)
set @m=@n%10
while @m>0 begin
insert into SaleItems select @n,@m,@m+@n
set @m=@m-1
end
set @n=@n-1
end
GO
create index SaleItems_Id on SaleItems (SaleId)
GO
create index Sales_Dates on Sales (Date)
GO

Note that Sales.Date varies from 2006-03-16 08:00 to 2007-05-06 23:00.

Now, we make a first attempt to write a stored procedure that retrieves the maximum quantity of sales for a specified period of time.

create procedure GetMaxQuantity
@p1 datetime, @p2 datetime
as
select max(Quantity) from SaleItems where SaleId in (
select Id from Sales where Date between @p1 and @p2)
GO

Before running the procedure, let’s enable IO Statistics by executing:

set statistics io on

It is also important that we always clear the buffer cache before executing the procedure. Otherwise, the number of physical reads will not be accurate:

dbcc dropcleanbuffers

Now, let’s execute the stored procedure:

exec GetMaxQuantity '20070301','20070302'

In my test environment, I get the following IO statistics:

Table 'SaleItems'. Scan count 25, logical reads 191, physical reads 1, read-ahead reads 4.
Table 'Sales'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0.

Let’s again execute the procedure with a different date range:

dbcc dropcleanbuffers

exec GetMaxQuantity '20060301','20070302

This time, the following IO statistics are returned:

Table 'SaleItems'. Scan count 8417, logical reads 55564, physical reads 90, read-ahead reads 173.
Table 'Sales'. Scan count 1, logical reads 17, physical reads 1, read-ahead reads 16.

It is important that you execute the first query before the second. There is no surprise that the second execution, for the whole year, requires much more logical and physical reads. However, 55564 logical reads is too high. So what happened?

Let’s examine the execution plan:

SQL server provides also many counters, like:

While for most of these numbers we can just trust SQL Server, there is one value we can verify – the estimated row count:

SQL Server thinks that when we select data from Sales, we get around 23 records, and then, after joining it to SaleItems we’ll have:

Fortunately, it can be verified:

select count(*) from Sales where Date between '20060301' and '20070302'

In fact, there are 8417 records instead of 23, which was expected by SQL Server’s query optimizer, and

select count(*) from SaleItems where SaleId in (
select Id from Sales where Date between '20060301' and '20070302')

37884 records instead of the expected 118! What a mistake. If only SQL Server knew it, it would have generated a totally different execution plan:

select max(Quantity) from SaleItems where SaleId in (
select Id from Sales where Date between '20060301' and '20070302')

As you can see, when constants are supplied, SQL Server can obtain information from the statistics tables. When it realizes there are too many records, it changes the execution plan and, instead of costly Lookups, uses a Hash Join.

The SQL Server query optimizer is a black box. However, we can sometimes make some interesting experiments.

Execute the following queries:

select * from Sales where Comment like '%A%'
select * from Sales where Comment like '%AA%'
select * from Sales where Comment like '%AAA%'
select * from Sales where Comment like '%AAAA%'
select * from Sales where Comment like '%AAAAA%'
select * from Sales where Comment like '%AAAAAA%'
select * from Sales where Comment like '%AAAAAAA%'
select * from Sales where Comment like '%AAAAAAAA%'

All of them do not return any rows, and therefore all estimations are absolutely wrong. But SQL Server’s query optimizes gives an estimation of the number of returned rows:

As you can see, it is nothing more than pure guess and a heuristic formula.

This is the most common scenario: SQL Server underestimates the number of rows, and queries tables thousands or even millions of times. On the other hand, it can overestimate the number of rows, and uses a Hash Join and reads the whole table when it actually needs only a few rows.

SQL Server knows the physical layout of data much better than us. It even knows some primitive statistics based on columns. SQL Server 2005 can even trace the correlation of data in different tables. However, don’t expect too much from it. It would never understand, that Name LIKE '%Smith%' returns more rows than expected, or joining data from 2 sales tables of 2 different departments would hardly give us any matches; and we did it just to confirm that fact.


All the notes to learn SQL Server. includes my learnings too..