Problem:
How much of an improvement will statistics do in terms of performance.?
In my last few posts,
I was more into details, how to create, update and check the statistics of a
table or indexed view columns.
DBCC SHOW_STATISTICS
displays current query optimization statistics for a table or indexed view. The
query optimizer uses statistics to estimate the cardinality or number of rows
in the query result, which enables the query optimizer to create a high quality
query plan. For example, the query optimizer could use cardinality estimates to
choose the index seek operator instead of the index scan operator in the query plan,
improving query performance by avoiding a resource-intensive index scan.
Let’s see how we can
improve and what makes a difference after we update statistics on a table or
column.
Example:
Example 1:
In this example I
would like to show how we can make use of statistics to improve performance and
what will happen if we keep AUTO_UPDATE_STATISTICS option set to OFF.
Step 1: setting
the database statistics update off.
ALTER DATABASE DB SET AUTO_CREATE_STATISTICS
OFF
ALTER DATABASE DB SET AUTO_UPDATE_STATISTICS
OFF
ALTER DATABASE DB SET AUTO_UPDATE_STATISTICS_ASYNC
OFF
If you go to properties of a database now e can see these are turned to false as a sign of the updates are turned off.
Step 2: Insert few thousand rows into Employees table and
check again if the statistics go the density and ranges correct.
insert into dbo.Employees([Name]
,[PhoneNumber],Company_Id)
select
substring(cast(newid() as varchar(68)), 1, 6) as Name
, (
SELECT
TOP (1) cast (c1 as int)
FROM (
VALUES ('1234564'), ('2456464'), ('464646'), ('4644949'), ('74997979')
) AS T1(c1)
ORDER
BY ABS(CHECKSUM(NEWID()))
) as Ph
, (
SELECT
TOP (1) cast (c1 as int)
FROM (
select
number from master..spt_values
) AS T1(c1)
ORDER
BY ABS(CHECKSUM(NEWID()))
) as Company_Id
GO 100000
DBCC SHOW_STATISTICS(Employees,'employeeid')
This is how it looks after i have inserted a million rows still the rows doesnt show up in the statistics as we dont have update statistics switched on.
Step 3: Update Statistics to refresh the index statistics
on the database.
exec sp_updatestats
OUTPUT:
Updating [dbo].[Employees]
[_WA_Sys_00000001_2057CCD0], update is not necessary...
[_WA_Sys_00000004_2057CCD0] has been updated...
[_WA_Sys_00000002_2057CCD0] has been updated...
[_WA_Sys_00000005_2057CCD0] has been updated...
[_WA_Sys_00000003_2057CCD0] has been updated...
4
index(es)/statistic(s) have been updated, 1 did not require update.
Step 4: Check to see if there are any improvements in the density and
ranges.
DBCC SHOW_STATISTICS(Employees,'employeeid')
Step 5: Time to test some query performances.
I choose to run a query with between values and so it
falls in ranges and scan the rows better.
select * from
employees where employeeid >=100021 and
employeeid <=102380
And I would like to compare the number of rows in the query
costs i.e estimated rows vs actual number of rows as this makes a lot of
difference in the query plans.
So After comparing the same query execution plans before and after updating the stats the number of rows inside the scan where actual vs estimated had a huge differences. I feel like updating the statistics on a index or table column where it is used predicates in a number of queries at regular intervals is good for a healthy execution plans.
Example 2:
DBCC SHOW_STATISTICS (Employees,'ix_name_phonenumber')
WITH HISTOGRAM;
RANGE_HI_KEY
|
RANGE_ROWS
|
EQ_ROWS
|
DISTINCT_RANGE_ROWS
|
AVG_RANGE_ROWS
|
1981
|
26767.57
|
592.936
|
45
|
594.835
|
1997
|
8255.367
|
587.2891
|
15
|
550.3578
|
2042
|
26438.72
|
525.1719
|
44
|
600.88
|
4096
|
8351.755
|
4833.841
|
10
|
832.2198
|
8275
|
8680.609
|
1129.402
|
7
|
1235.528
|
16389
|
8068.261
|
649.4062
|
8
|
1004.882
|
17222
|
8374.435
|
1157.637
|
9
|
927.1665
|
17985
|
8062.591
|
1174.578
|
7
|
1147.564
|
18259
|
8436.804
|
1253.636
|
7
|
1200.827
|
19523
|
8476.493
|
1112.461
|
7
|
1206.476
|
20047
|
9423.365
|
1067.285
|
8
|
1173.657
|
20816
|
8334.746
|
632.4651
|
8
|
1038.073
|
21318
|
9661.501
|
1287.518
|
9
|
1069.663
|
21581
|
7682.708
|
1129.402
|
7
|
1093.495
|
22099
|
9814.588
|
1168.931
|
8
|
1222.383
|
8388608
|
8674.939
|
1208.46
|
11
|
785.8628
|
1073741824
|
3498.325
|
1293.165
|
5
|
696.9646
|
1469283328
|
2.392718
|
643.7592
|
0
|
741.0588
|
Here is a small visualization on what a histogram means in terms of the RANGES and rows.
Let’s compare to see if we query between the RANGEHKEY where we have
less number of distinct rows and compare to where we have more distinct rows to
compare the same query execution costs and how they differ in execution plans.
select * from
employees where company_id =1469283328
OPTION(RECOMPILE);
As we can see the cost is 2.02 based on the Range HKey there are only 2.39 range rows that makes a cut on the cost of execution.
select * from employees where company_id =4096
OPTION(RECOMPILE);
As we can see the cost is 5.4817 based on the Range HKey there are only 8351 range rows that makes a hit on the cost of execution.
So Depending up on the RANGE they fall in the cost of a query becomes cheaper or expensive. The statistics play a vital role when they get into executions and cost of query optimizer.
0 comments:
Post a Comment