SQL Server MS Blogs

How It Works: FileStream (RsFx) Garbage Collection–Part (2)

SQL Server SQL CSS - Fri, 04/04/2014 - 08:45

In a previous post I outlined the basics of File Stream, garbage collection: http://blogs.msdn.com/b/psssql/archive/2011/06/23/how-it-works-filestream-rsfx-garbage-collection.aspx

This post continues the discussion, outlining specific details as to how the garbage collection progresses.

A Single GC Thread

The instance of SQL Server contains a single, background worker performing the garbage collection activities.   The FSGC worker, wakes up every ~5 seconds and loops over each database, managing the tombstone table entries in small batches. 

foreach database
      Do File Stream Garbage Collection (batch size)

Fixed Batch Size

The batch size for the background, file stream, garbage collection (FSGC) worker is currently 20 rows (a hard coded value.)   This allows FSGC to remain a background and unobtrusive process.

Note:  The type of tombstone row and status determines the action FSGC takes.   For example a delete of a row enters a single delete entry into the tombstone table.  A truncate table may enter a single row in the tombstone table and FSGC understands the single file delete vs truncate status and may take broader action.

sp_filestream_force_garbage_collection (http://technet.microsoft.com/en-us/library/gg492195.aspx)

If you are like me the first thing I did was the math on the 20 rows per batch every 5 seconds and determined there is a finite number of rows in a 24 hour period the FSGC can accomplish.  Enter sp_filestram_force_garbage_collection in SQL Server 2012 as a new feature.   

The procedure allows you to execute the FSGC as a foreground process without the batching limits.   This can be helpful if you have a large number of changes (inserts/updates/deletes) to file stream data where the tombstone rows and associated disk files have grown to a large number you wish to aggressively cleanup.

Delete and Status

When you do an update to a file stream column or perform a delete of the row an entry for the delete is placed in the tombstone table.

For example:

update tblDocuments set FSData = ‘New Image’ where PK = 1

Checkpoint and Log Backup: The ‘Orig Image’ can’t be deleted until properly synchronized with the backup stream activities.   The first log backup after the update secures the Orig and New Image.  This allows point-in-time restore capabilities, just before the update during restore.

If you select from the tombstone table before FSGC and proper checkpoint/log backup has been executed the status column for the entry will contain a (####7) such as 17 or 217.

Prepare for delete (7): The status column is a combination of bits used by FSGC to determine the state of the row.   A value of (7) in the lower position is an indication of prepare for delete to the FSGC.

Do physical file deletion (8):  FSGC must transition the entry from the status of 17 or 273 to 18 or 283, requiring a second checkpoint and log backup take place before the physical file can be removed from disk storage.

Confused yet?  I was so let me walk you through the activity (checkpoints, log backups and status changes.)

  1. Update – Enters new entry with status #7 into tombstone
  2. FSGC runs – can’t do anything no log backup and checkpoint has advanced the checkups and backup LSNs (you can see with dbcc dbtable).  Entries stay at #7 status.
  3. Checkpoint/Log backup executes
  4. FSGC runs – finds first batch (20) rows by lowest LSN values and attempts appropriate actions.  Finds entry of status #7 with proper checkpoint/log backup of status #7 and updates status to #8 so next FSGC after another checkpoint/log backup can do physical removal of file.  (*See below for update details.)
  5. FSGC runs – can’t do anything no log backup and checkpoint has advanced the checkups and backup LSNs (you can see with dbcc dbtable).  Entries stay at #8 status.
  6. Checkpoint/Log backup executes
  7. FSGC runs – finds first batch (20) rows by lowest LSN values and attempts appropriate actions.  Finds entry of status #8 with proper checkpoint/log backup of status #8; removes the physical file and deletes tombstone entry.

* The update of #7 to #8 status is not an in-place update.   The update is a delete (old tombstone row) and insert (updated) row pairing.    This matters to you because of the following.

FSGC is looking at the top 20 rows during each batch.   When it transitions the 1st row (shown above) from #7 to #8 status the row is added to the end of the tombstone table because it is the newest LSN. 

In this example if you delete 10,000 rows from the base table you get 10,000 entries in the tombstone table.   The physical, file deletion won’t occur until the FSGC has first transitioned all 10,000 entries from #7 to #8 status.  Since FSGC runs every ~5 seconds in batches of 20 the math is really 10,000 * 2 (because you have to go through #7 to #8 status changes and then divide by 20 * 5 or  (20000/20) = 1000 FSGC batches @ 5 seconds per batch = 5000 seconds or ~83 minutes (assuming proper checkpoints and log backups) to physically remove the original files.

In Summary

Understanding the transitions states, batch sizes and requirements for log backups and checkpoints should allow you to maintain your databases with file stream data better.

  • Make sure you have regular log backups
  • If database is inactive you may want to run manual checkpoint(s)
  • Consider sp_filestream_force_garbage_collection on SQL Server 2012 and newer versions

Bob Dorr - Principal SQL Server Escalation Engineer
     

Categories: SQL Server MS Blogs

I think I am getting duplicate query plan entries in SQL Server’s procedure cache

SQL Server SQL CSS - Thu, 04/03/2014 - 10:12

Before the post dives into the subject I need to point out that Keith did most of the work.  I just kept pestering him with various scenarios until he sent me the e-mail content I needed.   Thanks Keith –

Keith devised a set of steps that you can use to collect information about the plans and the associated plan, key attributes.  Using these queries you can track down entries in procedure cache with the same handles and determine what attribute is different, indicating why there appears to be duplicate entries for the same query.    It is often as simple as a SET statement difference.

From: Keith Elmore

Bob asked me to take a quick look and see if I could make some headway on understanding why there appears to be duplicate plans in cache for the same sql_handle and query_hash. In researching this, if you call a procedure that references a temp table created outside of that scope, the cached plan has the session_id as part of the cache key for the plan. From http://technet.microsoft.com/en-us/library/ee343986(v=SQL.100).aspx

If a stored procedure refers to a temporary table not created statically in the procedure, the spid (process ID) gets added to the cache key. This means that the plan for the stored procedure would only be reused when executed again by the same session. Temporary tables created statically within the stored procedure do not cause this behavior.

Because the customer is invoking this query via sp_executesql and the temp table is created outside of the sp_executesql the above condition applies, and the theory is that this could be causing the larger number of entries even though the sql_handle and query_hash are the same. But in order to confirm this theory we need some additional data. If the customer wants to pursue this, the following queries is what I’d want to run:

1. A single execution of this query from SSMS.

-- Look and see if there is any hash bucket with a large number of entries (> 20)

-- which may cause slower lookup of entries

select p1.* from sys.dm_exec_cached_plans p1

join (select bucketid, count(*) as cache_entries, count(distinct plan_handle) as distinct_plans from sys.dm_exec_cached_plans p

group by bucketid

having count(*) > 20) as p2 on p1.bucketid = p2.bucketid

2. Run the following query from SSMS, which will save all of these "duplicate" queries into a permanent table that we’ll retrieve.

-- Save all of the "duplicate" plans for this specific query in a table in tempdb

select qs.sql_handle, qs.statement_start_offset, qs.statement_end_offset,

qs.creation_time,

qs.execution_count,

qs.plan_generation_num,

p.*

into tempdb..DuplicateCachePlans

from sys.dm_exec_query_stats qs

join sys.dm_exec_cached_plans p on qs.plan_handle = p.plan_handle

where qs.sql_handle = 0x0200000093281821F68C927A031EDA1B661FC831C10898D0

and qs.query_hash = '0x07BD94E2146FD875'

3. From a command prompt, bcp out the data from the table above, as well as the plan_attributes data for each of these plans (add appropriate server name with –S parameter and optionally add path to where you want the file written, filename highlighted in yellow below)

bcp "select * from tempdb..DuplicateCachePlans" queryout cached_plans.out -n –T

bcp "select p.plan_handle, pa.* from tempdb..DuplicateCachePlans p cross apply sys.dm_exec_plan_attributes (p.plan_handle) as pa" queryout plan_attributes.out -n –T

4. Back in SSMS, you can drop the temp table created in step 2

drop table tempdb..DuplicateCachePlans

-Keith

 Bob Dorr - Principal SQL Server Escalation Engineer

Categories: SQL Server MS Blogs

SQL Server 2014’s new cardinality estimator (Part 1)

SQL Server SQL CSS - Tue, 04/01/2014 - 12:14

One of the performance improvement in SQL Server 2014 is the redesign of cardinality estimation. The component which does cardinality estimation (CE) is called cardinality estimator. It is the essential component of SQL query processor for query plan generation. Cardinality estimates are predictions of final row count and row counts of intermediate results (such as joins, filtering and aggregation). These estimates have direct impact on plan choices such as join order, join type etc. Prior to SQL Server 2014, cardinality estimator was largely based on SQL Server 7.0 code base. SQL Server 2014 introduces new design and the new cardinality estimator is based on research on modern workloads and learning from past experience.

A whitepaper planned by the SQL Server product team will document specific scenarios where new and old cardinality estimators differ. We will follow up with a later blog post when that paper is released. Additionally, Juergen Thomas has posted an overview of the feature on "Running SAP on SQL Server blog".

In this blog, we will provide a quick overview about controlling the SQL Server 2014 feature, guidelines on troubleshooting issues. We have plans to release more blog posts related to SQL Server 2014 new cardinality estimator in the future.

One of the goals for this blog post is to help make customers aware of this feature for upgrades and new deployments as query plans may change. We encourage users test sufficiently prior to upgrading to avoid performance surprises.

New deployments vs upgrade

SQL Server 2014 uses database compatibility level to determine if new cardinality estimator will be used. If the database compatibility level is 120, new cardinality estimator will be used. If you create a new database on SQL Server 2014, compatibility level will be 120. When you upgrade or restore from a previous version to SQL 2014, a user database compatibility level will not be updated. In other words, you will continue to use old cardinality estimator in upgrade and restore situations by default. This is to avoid plan change surprises for upgrades. You can manually change the compatibility level to be 120 so that new cardinality estimator can be used. Please refer to online documentation on how to view and change database compatibility level. Be aware that changing database compatibility level will remove all existing query plans from the plan cache for the database.

Please note the following:

  1. Which version of cardinality estimator to use is based on current database context where the query is compiled even if the query references multiple databases. Let's assume you have db1 with compatibility level of 120 and db2 with compatibility level of 110, and you have a query that references two databases. If the query is compiled under db1, new cardinality estimator will be used. But if the query is compiled under db2, old cardinality estimator will be used.
  2. Regarding system databases and upgrade, the compatibility levels of model, msdb and tempdb will be changed to 120 following the upgrade to SQL Server 2012.  But the master system database retains the compatibility level it had before upgrade per online documenation.  Therefore in upgrade scenarios, if your query is compiled under the context of master, old cardinality estimator will be used, but new cardinality estimator will be used for a query compiled under the context model, msdb, or tempdb.

  3. If your query references temporary tables, the database context under which the query is compiled determines which version of cardinality estimator to be used. In other words, if your query is compiled under a user database, the user database compatibility level (not tempdb) will determine which version of cardinality estimator to be used even though the query references temp table.
How to tell if you are using new cardinality estimator

There are two ways you can tell if new cardinality estimator is used.

In the SQL 2014 XML plan, there is a new attribute in StmtSimple called CardinalityEstimationModelVersion. When the value is 120, it means the new cardinality estimator is used. If the value is 70, it means the old cardinality estimator is used. This new XML attribute is only available for SQL 2014 and above (see screenshot below).

If you start capturing a new SQL Server 2014 XEvent called query_optimizer_estimate_cardinality, this event will be produced during compilation if new cardinality estimator is used. If the old cardinality estimator is used, this XEvent won't be produced even if you enable the capturing (see a screenshot below). We will talk more about how to use this XEvent to help troubleshoot cardinality issues in future blogs.

Additional ways to control new cardinality estimator

In addition to database compatibility level, you can use trace flags 2312 and 9481 to control if new or old cardinality estimator will be used. Trace flag 2312 is used to force new cardinality estimator while 9481 is used to force old cardinality estimator regardless of the database compatibility level setting. If you enable both trace flags, neither will be used to determine which version of cardinality estimator. Instead, database compatibility level will determine which version of cardinality estimator to be used. When such a case occurs, a new XEvent "query_optimizer_force_both_cardinality_estimation_behaviors" will be raised to warn user (if you enable this XEvent).

You can enable these trace flags at server, session or query level. To enable the trace flag at query level, you use QUERYTRACEON hint documented in 2801413. Below is an example query

select * from FactCurrencyRate where DateKey = 20101201 option (QUERYTRACEON 2312)

Precedence

Since we have multiple ways to control the behavior, let's talk about order of precedence. If the query has QUERYTRACEON hint to disable or enable the new cardinality estimate, it will be respected regardless of server/session or database level settings. If you have a trace flag enabled at server or session level, it will be used regardless the database compatibility level setting. See the diagram below.

 

 

 

Guidelines on query performance troubleshooting with new cardinality estimator

When you run into issues with new cardinality estimator, you have a choice to revert to the old behavior. But we encourage you spend time troubleshooting the query and find out if the new cardinality estimator even plays a role in terms of your slow query performance. Basic troubleshooting query performance stays the same.

Statistics

Regardless of the versions of cardinality estimators, the optimizer still relies on statistics for cardinality estimate. Make sure you enable auto update and auto create statistics for the database. Additionally, if you have large tables, auto update statistics threshold may be too high to trigger statistics update frequently. You may need to schedule jobs to manually update statistics.

Indexes

You may not have sufficient indexes on tables involved for the slow query. Here are a few ways you can help tune your indexes.

  1. XML Plan will display missing index warning for a query.
  2. Missing index DMVs. SQL Server tracks potential indexes that can improve performance in DMVs. This blog has sample queries on how to use the DMVs. Additionally, SQL Nexus also has a report on missing indexes server wide.
  3. Database Tuning Advisor (DTA) can be used to help you tune a specific query. Not only can DTA recommend indexes but also recommend statistics needed for the query. Auto create statistics feature of SQL Server doesn't create multi-column statistics. But DTA can identify and recommend multi-column statistics as well.
Constructs not significantly addressed by the new cardinality estimator.

There are a few constructs that are known to have cardinality estimate issues but are not addressed by the new cardinality estimator. Below are a few common ones.

  1. Table variables. You will continue to get low estimate (1) for table variables. This issue is documented in a previous blog.
  2. Multi-statement table valued function (TVF): Multi-statement TVF will continue to get estimate of 100 instead of 1 in earlier version. But this can still cause issues if your TVF returns many rows. See blog for more details.
  3. Behaviors of Table valued parameter (TVP) and local variables are unchanged. The number of rows of TVP at compile time will be used for cardinality estimate regardless if the rows will change for future executions. Local variables will continued to be optimized for unknown.
References
  1. New cardinality estimator online documentation.
  2. Juergen Thomas's blog on New cardinality estimator and SAP applications

In the future blogs, we will document more on how to use new XEvent query_optimizer_estimate_cardinality to troubleshoot query plan issues and how plan guide may be used to control the new cardinality estimator behavior.

Many thanks to Yi Fang, a Senior Software Design Engineer from SQL Server Query Processor team at Microsoft, for reviewing and providing technical details on this blog.

 

Jack Li - Senior Escalation Engineer and Bob Ward - Principal Escalation Engineer, Microsoft SQL Server Support

Categories: SQL Server MS Blogs

Don’t Rely On a Static IP Address for Your SQL Database

SQL Server SQL CSS - Thu, 03/27/2014 - 13:34

I’ve seen a number of customers open support incidents because they couldn’t connect to their SQL Database server which was ultimately due to the incorrect assumption that the server’s IP address is static. In fact, the IP address of your logical server is not static and is subject to change at any time. All connections should be made using the fully qualified DNS name (FQDN) rather than the IP address.

The following picture from the Windows Azure SQL Database Connection Management Technet article shows the network topology for a SQL Database cluster.

Your logical server (e.g., with a FQDN of xyz.database.windows.net) resides on a SQL Database cluster in one of the backend SQL Server nodes. Within a given region (e.g., North Central US, South Central US, North Europe, etc) there are generally many SQL Database clusters, as required to meet the aggregate capacity of all customers.  All logical servers within a cluster are accessed through the network load balancer (the single blue block with the note saying “Load balancer forwards ‘sticky’ sessions…” in the diagram) via a virtual IP address.

If you do a reverse name lookup from your server’s IP address you will actually see the name of the cluster load balancer. For example, if I try to ping one of my servers (whose actual server name starts with ljvt in the screenshot below) you will see that the displayed name associated with the IP address is instead data.sn3-1.database.windows.net, where the sn3-1 portion of the name maps to the specific cluster in the region (South Central) hosting this server.

Microsoft may do an online migration of your logical server between clusters within a region, load balancing capacity across the clusters within the region. This move is a live operation and there is no loss of availability to your database during the operation. When the migration completes, existing connections to your logical server are terminated and upon reconnecting via fully qualified domain name your app will be directed to the new cluster.  However, if your application caches or connects by IP address instead of FQDN then your connection attempts will fail.

A migration moves all of your settings, including any SQL Database firewall rules that you have.  Consequently there are no Azure-specific changes that are required in order to connect.  However, if your on-premise network infrastructure blocks/filters outgoing TCP/IP traffic to port 1433—the port used for SQL connections—and you had it restricted to a fixed IP address then you may need to adjust your client firewall/router.  The IP address of your SQL Database server will always be a part of the address ranges listed in the Windows Azure Datacenter IP Ranges list.  You should allow outgoing traffic for port 1433 to these address ranges rather than a specific IP address.

Keith Elmore – Principal Escalation Engineer

Categories: SQL Server MS Blogs

The Data Science Laboratory Series is Complete

Bucky Woody - Mon, 03/24/2014 - 05:27

I wrote a series of articles on creating a Data Science Laboratory over on Simple-Talk - you can find the complete list of articles below. The series covers installing various software tools and packages on a Virtual Machine running the Windows operating system. I think there's no substitute for installing, configuring and experimenting with various systems to learn how they work, and when you should use them.

Just having tools isn't enough, of course - you should combine these tools with a solid understanding of data science, including statistics, programming, and analysis. With what you'll find in these article, you should be able to implement that learning on your own system.

 

Categories: SQL Server MS Blogs

Cumulative Update #16 for SQL Server 2008 SP3

SQL Server Release Blog - Mon, 03/17/2014 - 16:46
Dear Customers, The 16 th cumulative update release for SQL Server 2008 Service Pack 3 is now available for download at the Microsoft Support site. Cumulative Update 16 contains all the hotfixes released since the initial release of SQL Server 2008...(read more)
Categories: SQL Server MS Blogs

Cumulative Update #9 for SQL Server 2012 SP1

SQL Server Release Blog - Mon, 03/17/2014 - 16:45
Dear Customers, The 9 th cumulative update release for SQL Server 2012 SP1 is now available for download at the Microsoft Support site. Cumulative Update 9 contains all the SQL Server 2012 SP1 hotfixes which have been available since the initial release...(read more)
Categories: SQL Server MS Blogs

SQL Nexus 4.0 Released to codeplex

SQL Server SQL CSS - Tue, 03/11/2014 - 13:07

 

We have just released SQL Nexus 4.0 (https://sqlnexus.codeplex.com/) which supports latest SQL Server (2012) with enhanced reports. 

In addition to read release notes, make sure you also read the top issues. Please report any issues under the Issues page under https://sqlnexus.codeplex.com/.

New Release Notes (4.0.0.64):

You must meet the following requirements:

Categories: SQL Server MS Blogs

SQL Server 2008 R2 SP2 Cumulative update #11

SQL Server Release Blog - Wed, 03/05/2014 - 15:46
Dear Customers, The 11 th cumulative update release for SQL Server 2008 R2 SP2 is now available for download at the Microsoft Support site. Cumulative Update 11 contains all the SQL Server 2008 R2 SP2 hotfixes which have been available since the initial...(read more)
Categories: SQL Server MS Blogs

sp_reset_connection – Rate Usage (Don’t fight over the grapes)

SQL Server SQL CSS - Mon, 03/03/2014 - 09:53

Discussions surrounding sp_reset_connection behavior and usage come up all the time and over the years I have discussed the subject 100s of times with customers.  Blogs, API documentation and Books Online describe SQL Server, pooled connections behavior.  

Under the covers SQL Server uses the sp_reset_connection logic to ‘reset’ the connection state for the SQL Server, which is faster than establishing a completely new connection.  Older drivers send the procedure call as a separate, TDS, round-trip to the SQL Server.   Newer client drivers add a flag bit along with the next command, avoiding the extra network round-trip.

The discussions quickly turn to: “What is a reasonable rate of sp_reset_connections on my server?” As usual the answer is always it depends.

However, a cursory inspection of the rate usually reveals itself as I am not that worried or my stomach hurts. – The simple, old, smell test works pretty well for getting a high level understanding of your system.  Then you can work with your application developers to tune the behavior accordingly.

The documentation refers to the concepts of Open Late and Close Early.  Meaning you open the connection right before you need it and you release the connection as soon as it is no longer needed.   This allows the connection pool to work optimally by sharing the connection whenever possible.

The problem I most often see is that this behavior is taken to the extreme.   The development team often cookie cutters functional logic.

MyFunction
{
   Get Connection
   Run Query
   Release Connection
}

The function uses Open Late / Close Early just like the documentation pointed out.  Now imagine you have dozens or 100s of these functions in the application.  The problem is that seldom does logical business activity call a single function activity.

LoadMyPage
{
    Call Func1
    Call Func2
    Call Func3
}

In this example the application drives the connection pool 3 times, resulting in 3, sp_reset_connection operations.   This is the worst case scenario with a 1:1 ratio of commands to sp_reset_connection invocations.

Don’t fight over the grapes

You may have wondered why this was in the title.  It is because I was recently sitting in an airport and two little girls gave me an analogy for sp_reset_connection.  

They looked to be about 3 or 4 years old and were sharing a bag of grapes.   They started out very polite, each taking just one grape at a time.   However, they kept waiting on each other and as time went on they got a bit more combative until the mother finally told them not to fight over the grapes.   Instead of taking one at a time take a handful.  Then they would not be constrained all the time waiting to get access to the bag of grapes. 

If you will, she optimized their activity.

The grapes analogy was perfect.   If each command in the application acquires a connection, executes and releases you are placing pressure and resource constraints on the connection pool.  Each time the sp_reset_connection executes it uses resources on the SQL Server and client.

It is far better to write the application logic to avoid contention points, and align with logical units of work.   This maintains the concepts of Open Late and Close Early while reducing overhead and improving performance.

LoadMyPage
{
    conn = Get Connection
      Call Func1(conn)
      Call Func2(conn)
      Call Func3(conn)
    Release conn
}

In this example the connection spans the logic to load the page.  The connection to command ratio goes from 1:1 to 1:3 and in doing so removes 2 of the sp_reset_connection activities.

Simply put you need to find a healthy sp_reset_connection to command ratio for your environment.   I can tell you that 1:1 is poor and the applications that I see functioning well are usually in the 1:8, 1:10, 1:15 range.

There is not a hard and fast rule but using performance monitor you can quickly compare the overall batch rate to the reset connection rates.   When I see the rate start to climb above 15% it is a pretty good indication that the application may need to be revisited and tuned a bit.

It is true that the connection reset may not be driving your CPU load (sp_reset_connection has been tuned and is lightweight in general) as compared to the overall work done by the TSQL commands executing on the SQL Server.   You should think of this more as a gate than a CPU consumer.   Before the command, that is doing the work you need it to do can execute, the reset must complete.  While the delay is small in wall-clock time the overall performance of the application could be better with strategic use of the reset behavior.

The load page examples above will simply run faster with the 1:3 ratio because it avoids the 2 trips though the pooled connection logic.

With all this said, you need to be careful that you don’t extend the ratio too far.   Keeping the connection when it is not needed will increase the overall number of connections using more client and SQL Server overhead.   You need to find the sweet spot that optimizes the client and SQL Server resources and maximizes the application performance capabilities.

You may also consider recent changes that reduce the overhead of sp_reset_connection at the SQL Server

http://support.microsoft.com/kb/2926217 

The following are additional references pertaining to the sp_reset_connection subject

http://blogs.msdn.com/b/psssql/archive/2010/12/29/how-it-works-error-18056-the-client-was-unable-to-reuse-a-session-part-2.aspx

http://blogs.msdn.com/b/psssql/archive/2013/02/13/breaking-down-18065.aspx

http://support.microsoft.com/kb/180775

Bob Dorr - Principal SQL Server Escalation Engineer

Categories: SQL Server MS Blogs

Cloud projects and cross-country moves

Bucky Woody - Thu, 02/27/2014 - 08:46

I tend to take real-life events and correlate them to computational scenarios. It’s a hazard of being a technical professional, I suppose.

In the United States, it’s kind of a law that you have to move your family across the country at least once in your life. I’ve done it a lot – when I was young, then when I was in the military (even cross-continent moves a few times) and recently I decided to move again. I loaded up a U-Haul trailer, hitched it to my pickup truck, put the two cats in pet carriers, opened the door for my wife and drove from Covington Washington, to Safety Harbor Florida – a distance of 3,723.2 miles.

The area surrounding my Washington home is near the Cascade mountain range, and our circumstances dictated that we move in the winter. We decided against going directly across the Snoqualmie Pass on Interstate 90, as that would take us across the plains states, which were (and are, as of this writing) undergoing one of the worst winters in some time. Instead, we drove south along Interstate 5 to California, then at Los Angeles took Interstate 10 through Arizona, New Mexico, Texas, Louisiana, Mississippi, Alabama, and then around the panhandle of Florida to Tampa.

The trip was fraught with peril, and we had many adventures along the way. Those adventures map quite well to software projects in general, and cloud computing software projects in specific.

Lesson 1: Planning is essential, plans are useless

We planned very carefully. I’m a bit of a “Type-A” personality, so I tend to over-plan…well, everything. I planned the routes. I planned the emergency routes. I planned what we would do if something took more time than we thought, and what to do if it took less time than we thought. We planned for everything from snacks to CD’s, from MiFi devices to in-car converters for power.

And of course almost nothing went to plan. The first three hours outside of Seattle, near Portland Oregon, we hit an ice-storm – the biggest they’d had in decades. The (highly overloaded) trailer hit a patch of ice, and took the truck across three lanes of traffic into a guardrail. I managed to patch up the truck (on the Interstate, no less) to a drivable state and then limped to the nearest hotel. They closed Interstate 5, because they had 600 wrecks between Portland and Salem. We got stuck for 3 days at that hotel until the roads cleared. Our plans and timelines evaporated.

Cloud projects can be like that. If you’re designing one for the first time, you’re headed in an unknown (for you) area. But that shouldn’t prevent you from planning. You see, planning – even when it doesn’t work out like you thought it would – leads to preparation, and preparation will help you pivot when things don’t go as you want. When you’re creating your specifications, do a “pre-mortem” – an exercise where you imagine the project failing, and what you’ll do if it does. It will help you identify processes and procedures you can fall back on.

Lesson 2: Prepare well – you’ll probably need the emergency stuff

As the truck careened across the highway, I was glad that we had packed the way we did. The cats were secure, we had our seatbelts on, we were moving pretty slowly. As wrecks go, this was a mild one. I had put two things in the (very tight) cab space: tools, and orange safety vests (we wear them when we ride our motorcycles). We needed both, three hours into the trip.

After we finally got out of Portland, we crossed many mountain ranges. The truck was overloaded as well as the trailer (did I mention that?) and we had to move VERY slowly up the mountains, even with a full-sized V8 engine. 18-wheel Trucks were passing me. It was embarrassing.

The next day we got up in Sacramento, California, where the roads are essentially made from pot-holes so deep you need to turn your headlights on to get out of them, periodically connected with thin strands of asphalt. The trailer, being a bit overloaded, snapped the ball of the trailer hitch off, and since there was no shoulder (California roads are not designed for anyone to have any problems on them) I had to:

  • Position Marjorie with her orange vest on behind the truck to wave rush traffic over so as not to wreck the truck any further
  • Run across the highway
  • Climb a 6’ fence
  • Run to an “Ace Hardware” store, startling the staff who watched me vault a fence and run into their establishment
  • Purchase a replacement ball (they had one, and only one, but thank heavens they had one)
  • Run back, vault the fence
  • Jack up the trailer
  • Replace the ball on the hitch
  • Micro-back up the truck multiple times underneath the (fully laden, un-movable) trailer (Marjorie was a little busy and could not help me)
  • Lower the jack
  • Retrieve Marjorie (who was a really good sport about all this, considering. Well, at least we’re speaking again. Mostly.)
  • Re-enter rush-hour traffic
  • Continue on to Los Angeles, hitting it at rush hour instead of the planned noon

The correlation here to software projects here is that you should invest in, spend time on, and have ready the emergency things you need in case the project goes wrong. Not just the whole thing, but each part. It took up valuable space to have the tools in the cab, and orange vests are kind of…weird looking. But having them saved us, quite literally. You should think the steps through and put systems into place if things go wrong – retry logic, local caching in case the line is down, and anything else you can think of. Odds are you’ll need them.

Lesson 3: Everything takes longer and costs more than you thought

It goes without saying (but of course I’ll say it anyway) that our schedule was blown to bits. What I had planned to take 5 days took 8. I had to drive slowly (we were overloaded heavily) and of course those pesky ice-storms, bad roads, and many other factors made my average speed 43.5 miles per hour. Even with driving 14-16 hour days, it was really delayed. Interestingly, because of the very weather we were trying to avoid, trucking and other traffic had been re-routed to – wait for it – Interstate 10. Essentially it was a 5-day rush hour, all day, every day.

I didn’t know we would stay in the hotel for three days, need to buy chains for the truck and the trailer, fix a ball, get 10.7 miles per gallon (we were a bit overloaded so that trashed my planned gas mileage) or the other expenses we incurred. But because we had planned for things to go wrong, I had the money handy. I even had more cash than Marjorie normally allows me to carry in case we had to use that to get what we needed.

In software, cost and time estimation is really hard to get right. In the case of something new, make your best generous estimate, and then double it. Don’t let people pressure you into shortening it – you’ll regret it later. No, this isn’t to say cloud projects are costly or risky – this holds true for any project, and yet I see the same mistakes being made in this area over and over.

Lesson 4: Keep a positive, long-term view

So what have you learned from my tale? Not to move yourself across country in the winter with a heavily overloaded truck and trailer? Perhaps. But even with the extended time and cost, it was cheaper and faster for us to do it ourselves – we did the numbers, and with all the problems it was still 1/5th the cost of having someone else do it and us flying the cats with us home. And since Marjorie is talking to me again, we’re closer. (Actually, she was amazing the whole trip, and super-supportive of her risk-loving husband. We talked the whole time, mostly “LOOK OUT IS THAT AN ICE PATCH SLOW DOWN THAT WALL IS REALLY CLOSE YOU MISSED THE EXIT I WANT OFF”)

You might think you shouldn’t embark on a cloud project – it’s too new, too risky, and there will be too many problems. But you’ll miss out. And your company will miss out. So plan. Get the truck – maybe don’t put as many requirements on your project as you might – those roads can be plenty icy. But when you see the sun set over the water as you pull into your new driveway in 85-degree winter weather, you’ll be glad you did it.

Categories: SQL Server MS Blogs

Update on the Service Pack plans for SQL Server.

SQL Server Release Blog - Thu, 02/13/2014 - 18:38
Dear C ustomers, We would like to share some updates on our planning for SQL Server Service Packs. We have seen the feedback on Connect (feedback IDs 814658, 814656, 814600) and the number of votes they each have received. We will continue...(read more)
Categories: SQL Server MS Blogs

Can’t Connect to SQL when provisioning a Project App in SharePoint

SQL Server SQL CSS - Fri, 02/07/2014 - 08:25

The customers issue was that they were trying to provision a Project site within the Project SharePoint Application. This was done via a PowerShell script that they ran on one of the SharePoint App Servers.

They had two SharePoint App Servers – AppServerA and AppServerB. They had indicated that the provisioning would fail on either App Server and it started failing around November of last year (4 months ago). The error that they would see when the failure occurred was the following from the SharePoint ULS Logs:

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        880i        High        System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        880j        High        SqlError: 'A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)' Source: '.Net SqlClient Data Provider' Number: 53 State: 0 Class: 20 Procedure: '' LineNumber: 0 Server: ''        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        tzkv        High        SqlCommand: 'CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END ' CommandType: Text CommandTimeout: 0        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Provisioning        6935        Critical        Error provisioning database. Script: C:\Program Files\Microsoft Office Servers\14.0\Sql\Project Server\Core\addqueue1timesheetsps12.sql, Line: 0, Error: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server), Line: CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END .        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.89        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Provisioning        6971        Critical        Failed to provision site /CMS with error: Microsoft.Office.Project.Server.Administration.ProvisionException: Failed to provision databases. ---> Microsoft.Office.Project.Server.Administration.ProvisionException: CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) at

One thing they had mentioned was that if they increased the Connection Timeout to 60 seconds, it would sometimes work. My thought process on this is that if connection timeout would sometimes allow it to work that we may have had a timeout when actually connecting to SQL Server, but that wasn’t the error.

Looking at the actual error we can draw some conclusions.

provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server

By default, we should be using TCP. If there is a serious error with that, we will use Named Pipes. The error Named Pipes got back was that we couldn’t open the connection. Not a timeout. Think of this as “SQL Server does not exist or access denied”. SQL Server in this case was also a default instance Cluster. Not a Named Instance, so SQL Browser was not coming into the picture. This is a straight shot to port 1433 via TCP.

Which machine was getting the error?

For troubleshooting, we need to consider which machines are involved. One thing that we noticed over the course of troubleshooting was that the error always occurred on AppServerB and we were always starting the script from AppServerA. If you think about how SharePoint works with its App Servers, when a service is running, you can have it started on individual App Servers and control the load.  The fact that we were always seeing the error on AppServerB led me to believe that the Project Application Server Service was only started on AppServerB and not AppServerA.  Looking in Central Admin, this was correct.  So, we want to concentrate data collection from AppServerB.

Network Traces

The first thing that was looked at was getting a network trace. We collected network traces from AppServerB and the SQL Server.  If we go back to error that was happening, we recall that we know that TCP was not working as expected and then Named Pipes was failing.  Named Pipes uses the SMB protocol to talk.  This will first reach out to TCP port 445.  We didn’t see any traffic in the Network trace going to that.  We also didn’t see any SMB traffic that was relevant to the error.  We only saw browser announcements which had nothing to do with us.  This tells me that we never hit the wire.  So, the network traces wouldn’t be helpful.

BIDTrace

Enter BIDTrace.  BIDTrace is really just diagnostic logging within our client providers and server SNI stack.  Think Event Tracing for Windows (ETW).  I’m not going to dive into how to set this up as it would take its own blog post.  You can read more about it in the following MSDN Page:

Data Access Tracing in SQL Server 2012
http://msdn.microsoft.com/en-us/library/hh880086.aspx

Typically I won’t go this route unless I know what I’m looking to get out of it.  It is actually pretty rare that I’ll jump to this.  In this particular case, it was an excellent case.  We have some evidence that we are not getting far enough to hit the wire, and we know we are getting an error when trying to make a connection to SQL.  So, what I’m looking for here is if there is some Windows Error that we are getting that wasn’t presented in the actual exception.

Here is the Logman command that I used to start the capture after getting the BIDTrace items configured.

Logman start MyTrace -pf ctrl.guid -ct perf -o Out%d.etl -mode NewFile -max 150 –ets

A few things I’ll point out with this comment.  The output file has a %d in it.  This is a format string because we will end up with multiple files.  -mode is used to tell it to create a new file after hitting the max size that is listed.  We then set –max to 150 which means that we want to cap the size of the file to 150MB in size.  I did this because when we first went for it with a single file, the ETL file was 300MB and when I went to convert it to text it was over 1GB in size.  That’s a lot to look through.  I also had troubles opening it.  So, I decided to break it up.  Of note, it took about 4-5 minutes to reproduce the issue.  That’s a long time to capture a BIDTrace.  When you go to capture a BIDTrace, it is better to get a small window to capture if you can.  These files fill up fast.

Here is the ctrl.guid that I used to capture.  This is effectively the event providers that I wanted to capture:

{8B98D3F2-3CC6-0B9C-6651-9649CCE5C752}  0x630ff  0   MSDADIAG.ETW
{914ABDE2-171E-C600-3348-C514171DE148}  0x630ff  0   System.Data.1
{C9996FA5-C06F-F20C-8A20-69B3BA392315}  0x630ff  0   System.Data.SNI.1

The capture will produce ETL files which are binary files.  You need to convert them after you are done.  I use TraceRPT to do this.  It is part of Windows.  Here is the command I used to output it to a CSV file to look at.

TraceRPT out5.etl –of CSV

In our case, it had generated 5 etl files – remember the %d?  So, we grabbed the last file that was produced which was out5.etl and converted it.  Although at first, I didn’t know it was out5.etl.  I actually started with out4.etl.  One problem is though is I didn’t have timestamps within the CSV output.  I had clock CPU time which is hard to visualize compared to an actual timestamp.

Enter Message Analyzer! Message Analyzer is a replacement for Network Monitor.  But it has another awesome ability in that it can open ETL files.  One other thing I had was the timestamp of the error from the SharePoint ULS Log on the attempt that we made when we captured the BIDTrace.

02/06/2014 13:14:20.55     OWSTIMER.EXE (0x2024)                       0x1C4C    Project Server                    Database                          880i    High        System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)     at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)     at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)     at System.Data.SqlClient.TdsParser.Connect(ServerInfo serverInfo, SqlInternalConnectionTds connHandler, Boolean ignoreSniOpenTimeout, Int64 timerExpire, Boolean encrypt, Boolean trustServerCert, Boolean integratedSec...    bc7aaa60-93fc-4873-8f75-416d802aa55b

02/06/2014 13:14:20.55     OWSTIMER.EXE (0x2024)                       0x1C4C    Project Server                    Provisioning                      6993    Critical    Provisioning '/Test3': Failed to provision databases. An exception occurred: CREATE PROCEDURE dbo.MSP_TimesheetQ_Get_Job_Count_Simple   @correlationID UID ,    @groupState int ,    @msgType int  AS BEGIN    IF @@TRANCOUNT > 0    BEGIN              RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1)       RETURN    END     SELECT COUNT(*) FROM dbo.MSP_QUEUE_TIMESHEET_GROUP        WHERE CORRELATION_UID = @correlationID       AND   GRP_QUEUE_STATE = @groupState       AND   GRP_QUEUE_MESSAGE_TYPE = @msgType END .    bc7aaa60-93fc-4873-8f75-416d802aa55b

Our issue occurred at 1:14:20.55 Server Time. We can also see the statement it was going to try and run.  If we open the ETL file within Message Analyzer, we can see the timestamps that are covered within the file. 

We can see that this went up to 12:14:04 local time.  We were looking for 12:14:20.55.  So, out4.etl was not the file I was looking for.  Which left Out5.etl.  Technically you can read the data within Message Analyzer as you can see from the lower right of the screenshot.  It’s unicode data, and we see l.e.a.v.e.  I still prefer the output from TraceRPT when going to CSV as I can get the readable text from that.  It is just a little easier to work with.

So, I have the CSV output from out5.etl, but what do we look for?  Well, we know the statement that it was trying to make, so lets look for that - MSP_TimesheetQ_Get_Job_Count_Simple. We get a hit and it looks like this:

System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213871,       7080,      21510,        2, "<sc.SqlCommand.set_CommandText|API> 4187832#, '"
System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213910,       7080,      21510,        2, "CREATE PROCEDURE dbo.MSP_TimesheetQ_Get_Job_Count_Simple   @correlationID UID ,    @groupState int ,    @msgType int  AS BEGIN    IF @@TRANCOUNT > 0    BEGIN              RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1)       RETURN    END     SELECT COUNT(*) FROM dbo.MSP_QUEUE_TIMESHEET_GROUP        WHERE CORRELATION_UID = @correlationID       AND   GRP_QUEUE_STATE = @groupState       AND   GRP_QUEUE_MESSAGE_TYPE = @msgType END "
System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213935,       7080,      21510,        2, "' "

Not the prettiest, but when looking in notepad or some other text reader, we can just go over to the right to get a better view.

The first time you look at this it can be a little overwhelming.  Especially if you aren’t familiar with how SNI/TDS works.  If we go through the results, we’ll see a few interesting things.

<prov.DbConnectionHelper.ConnectionString_Set|API> 4184523#, 'Data Source=<server>;Initial Catalog=<database>;Integrated Security=True;Pooling=False;Asynchronous Processing=False;Connect Timeout=15;Application Name="Microsoft Project Server"' "

<GetProtocolEnum|API|SNI>

<Tcp::FInit|API|SNI>

<Tcp::SocketOpenSync|API|SNI>

<Tcp::SocketOpenSync|RET|SNI> 10055{WINERR}

<Tcp::Open|ERR|SNI> ProviderNum: 7{ProviderNum}, SNIError: 0{SNIError}, NativeError: 10055{WINERR} <-- 10055 = WSAENOBUFS

<Np::FInit|RET|SNI> 0{WINERR}

<Np::OpenPipe|API|SNI> 212439#, szPipeName: '\\<server>\PIPE\sql\query', dwTimeout: 5000

<Np::OpenPipe|ERR|SNI> ProviderNum: 1{ProviderNum}, SNIError: 40{SNIError}, NativeError: 53{WINERR} <-- ERROR_BAD_NETPATH = network path was not found

We can get the Connection string, which was also available in the SharePoint ULS Log.  We will also see some entries around Protocol Enumeration.  This is where we look at the Client Registry items to see what Protocols we will go through and in what order (TCP, NP, LPC, etc…).  Then we see TCP trying to connect.  You’ll recall I mentioned that we try TCP first by default.  We then see that this received a Windows error of 10055 (WSAENOBUFS).  We then see Named Pipes fail with Error 53 which is ERROR_BAD_NETPATH.  We got what we were looking for out of the BIDTrace.

WSAENOBUFS is the key here.  It is a WinSock error which we actually have a KB Article on.

When you try to connect from TCP ports greater than 5000 you receive the error 'WSAENOBUFS (10055)'
http://support.microsoft.com/kb/196271

There is a registry key called MaxUserPort which can increase the number of dynamic ports that are available.  In Windows 2003, this was under 5000.  Starting in Windows 2008, this was increased as we use to see a lot of problems here.  Especially when connection pooling was not being used.  Here is the port range on my Windows 8.1 machine.

And for a Windows 2008 R2 Server, which the customer was using:

I have 64510 ports available.  On the customer’s machine, they had mentioned that for a previous issue, the engineer had asked them to add this registry key, and they set the value to 4999.  By setting it to 4999, we are effectively limiting the number of ports that would have otherwise been available.  If you look back at the connection string, you can see that Pooling was set to False.  This means we are turning off connection pooling, and every time we go to connect, we will establish a new hard connection.  This eats up a port.  You can look at NETSTAT to see what it looks like.  We did then when running the provisioning scripts and we saw it get up to around 3000 or so before it was done.  You will also see a lot of ports in a TIME_WAIT status.  When you disconnect and the port is released, it will go into a TIME_WAIT state for a set amount of time.  The default of which is around 4 minutes.  That’s 4 minutes you can’t use that port.  If you are opening and closing connections a lot, you will run out of ports because a lot will be in the TIME_WAIT state.  That’s typically when we would bump up the number of ports using the MaxUserPort registry key.  However, this is never really a fix, you are just putting  a bandaid on without understanding the problem.

End result…

In our case, Project Server was turning off connection pooling.  I don’t know why they are doing that, but that, in conjunction with the MaxUserPort being set to 4999, was causing this issue.  We had removed the MaxUserPort registry key and rebooted AppServerB, and it started working after that.  Of note, we had also started the Project Application Server on AppServerA and cleaned up the TCP registry keys on that machine as well so that they could effectively balance their load on the SharePoint App Server.

 

Adam W. Saxton | Microsoft SQL Server Escalation Services
http://twitter.com/awsaxton

Categories: SQL Server MS Blogs

As The World Turns: SQL Server NUMA Memory Node and the Operating System Proximity

SQL Server SQL CSS - Tue, 02/04/2014 - 13:56

It felt a bit like ‘As The World Turns’ as I unraveled how the following worked so the title is a bit of a tribute to my grandmother.  She could not miss here ‘stories’ in the afternoon.

Proximity

Before I dive into the details I would like to talk about NUMA node proximity.    The idea of proximity is how close is one NUMA Node to another based on the memory layout of the system.

For example Node 00 is 0 steps from itself.   However, it is 3 steps from Node 03.   Node 01 is 2 steps from Node 3 and so forth.

Node 00 01 02 03 00 00 01 02 03 01 01 00 01 02 02 02 01 00 01 03 03 02 01 00

K-Group (Processor Group)

Windows attempts to gather the nodes, based on proximity, into the same K-Group

Examples

Here is a snippet from the SQL Server error log, 128 CPU, 8 Node system.  

Notice nodes 0,1,2 and 3 are aligned to K-Group = 0.

Notice the change in node to group alignment on this same system.

So What Does This Mean?

For this specific system the second configuration, aligning nodes 0,1,4 and 5 is not optimal.   This system’s true (SRAT) proximity is the layout from the first example.   Nodes 4,5,6 and 7 are in a separate blade, using NUMA node glue based architecture.    Memory accesses between the hardware blades can be slower then local blade access.

The SRAT table provided by the hardware level, BIOS contains the proximity information.   Prior to Windows 2012 Server the Windows memory manager performs memory access tests, attempting to determine optimal proximity responsiveness and can override the SRAT information.

The second example is after the machine had a hardware failure on nodes 2 and 3.   The nodes were taken offline and Windows adjusted.  However, once the problem was corrected the system maintained the adjusted proximity layout.

The hardware manufactures are aware of this behavior and have specific tuning instructions per system, CPU type, memory layout, etc… that establishes the appropriate override of the Windows Server behavior using the Group Affinity registry key: http://support.microsoft.com/kb/2506384

These advanced configurations can involve additional processor groups to further sub-divide the current SRAT layout in order to obtain optimal performance.

For SQL Server installations, running on NUMA hardware I recommend you contact your hardware manufacture and obtain the optimal proximity settings.

Reference: CoreInfo.exe  http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx 

Bob Dorr - Principal SQL Server Escalation Engineer

Categories: SQL Server MS Blogs

Cumulative Update #8 for SQL Server 2012 SP1

SQL Server Release Blog - Mon, 01/20/2014 - 18:45
Dear Customers, The 8 th cumulative update release for SQL Server 2012 SP1 is now available for download at the Microsoft Support site. Cumulative Update 8 contains all the SQL Server 2012 SP1 hotfixes which have been available since the initial release...(read more)
Categories: SQL Server MS Blogs

Cumulative Update #15 for SQL Server 2008 SP3

SQL Server Release Blog - Mon, 01/20/2014 - 18:45
Dear Customers, The 15 th cumulative update release for SQL Server 2008 Service Pack 3 is now available for download at the Microsoft Support site. Cumulative Update 15 contains all the hotfixes released since the initial release of SQL Server 2008...(read more)
Categories: SQL Server MS Blogs

Using Outlook Tasks for a Performance Review

Bucky Woody - Tue, 01/07/2014 - 09:24

At Microsoft, we have a performance review twice a year to ensure we're on track towards meeting the goals that we set up at the beginning of the year with our manager. I've been a manager multiple times in my career, so I've been on both sides of the table during this review.

The results of a review shouldn't come as a surprise to either party. If, that is, you've been communicating to your boss on a regular basis about the tactical things you're doing that lead to the strategy of fulfilling the goals. But how do you track what you're up to, and how do you communicate them to your boss?

If you're like me, you have several "project-level" assignments, meaning that you have lots of things to track.

(Not everything is a project or requires this amount of tracking, so this isn't for every work item, obviously)

I work with hundreds of people and dozens of teams here at Microsoft, and so my day can get very randomized if I'm not careful. To help keep me on track, and my boss informed, and to provide a paper-trail of my activities for review time, I use Microsoft Outlook Tasks. Here's why I do that:

  • Common tool I use every day
  • Has status fields, and rich-text fields for lots of data
  • Has the ability to send status reports - some of which I can send to my boss
  • Has multiple views
  • With Exchange Online, I can get to these even on the web from any device that has a browser or Outlook app
  • Even when closed, they are retained for historical purposes at review time

Here's my process:

  1. Whenever I get a project to work on, I make a task
  2. I enter all data in that task as I work on it, setting the "Start Date" and "End Date" to today.
  3. Each morning, I run down my tasks - if there is anything I need to work on for that project, I do that work. When I'm done with that work or if there is nothing due, I set the Start and End dates to tomorrow.
  4. Rinse, lather, repeat
  5. Each Monday (or if something significant changes in the task) I hit the "Status Report" button and mail it to all concerned parties. If it's something my boss needs to know about (like the task being assigned, or finishing), I include him/her. I put in the amount of time, the percentage complete and so on. I do this each Monday even if nothing has changed, with a status of "Checking status - anything needed from me?" to make sure nothing falls through the cracks.
  6. Using the Status field in the Task, I set it to "Completed" when I'm done, and send out a Project Post-Mortem e-mail to all parties.
  7. At 1/1 time, I pull up the View in Outlook of "Completed Tasks" and walk down the tree with my boss, making sure it leads to hitting our goals

Using this process, there are no surprises at review time We've both been kept in the loop, even if my boss is remote or I work remote. And everyone on the teams knows about the project and where my part of it is at. I can't tell you how many times this has been useful - but it's been so popular I was asked to write this post. :)

Here's a sample task layout (click for detail)

 

And here's a fake status update (click for detail):

Categories: SQL Server MS Blogs

Have a mentor, be a mentor

Bucky Woody - Wed, 01/01/2014 - 12:59

 The word "Mentor" actually comes from Greek literature - it's a character from "The Odyssey". He advised Odysseus and Telemachus, and the word seems to derive from a term meaning "intent, purpose, spirit, passion". We often think of a mentor in the business sense - but some folks I've talked with aren't sure what a mentor is or what they do. Most think it's someone that tells you what to do to get ahead in your career.

Part of that is true - a mentor can indeed help your career. But rather than telling you what to do, they are actually more effective when they help you discover what to do. They are a guide - someone who has an outside view of your career, isn't your boss, and in the best of cases has been down the path you want to go. There is a formal process to being a mentor, and in fact there are concrete steps you have to take to be mentored. It's a partnership, with a beginning and an end. I'm a mentor for a couple of folks, one of whom works at Microsoft. I've also got a mentor, and have had them in the past as well starting at my days working at the NASA facility in Florida. That was my introduction to the process as a young man, and it helped my professional career immensely.

On January 8th 2014 in Redmond Washington I'll be giving a presentation on how to get and work with a mentor, and how and why to become one yourself. You're welcomed to attend, and I'll probably give this session again in locations around the world - we've opened this meeting up to anyone who wants to come, and you can find the details on that here: http://pnwsql.sqlpass.org/

There's homework (of course there is, I'm a college teacher after all) in the form of some reading list and sites you can visit. Here are some of the sources I used for this session:

Web:   Books (this list is partially drawn from an internal Microsoft site for our formal mentoring process that I belong to):
  • Coaching: Evoking Excellence in Others,3rd Edition by James Flaherty
  • 101 Tough Conversations to Have with Employees: A Manager's Guide to Addressing Performance, Conduct, and Discipline Challenges
    by  Paul FALCONE
  • The Truth About Personal Performance (Collection) by James O'Rourke
  • 5 Business Skills Every Professional Must Master (Collection) by Terry J. Fadem
  • Essential Rules from Richard Templar (Collection) by Richard Templar
  • The New Mentors and Protégés: How to Succeed with the New Mentoring Partnerships by Linda Phillips-Jones (2001 update)
  • The Mentee's Guide: Making Mentoring Work for You by Lois Zachary (2009)
  • Coaching, Counseling and Mentoring: How to Choose and Use the Right Technique to Boost Employee Performance by Florence
    Stone (2007)
  • Teach What You Know: A Practical Leader's Guide to Knowledge Transfer Using Peer Mentoring by Steve Trautman (2007)
  • Monday Morning Mentoring: Ten Lessons to Guide You up the Ladder by David Cottrell (2006)
  • Power Mentoring: How Successful Mentors and Protégés Get the Most Out of Their Relationships by Ellen Ensher and Susan
    Murphy (2005)
  • Mentoring 101: What Every Leader Needs to Know by John Maxwell (2008)
  • The Elements of Mentoring by W. Brad Johnson (2008)
  • The Art of Mentoring: Lead, Follow and Get Out of the Way by Shirley Peddy (2001)
  • Managers as Mentors: Building Partnerships for Learning by Chip R. Bell (2002)
Categories: SQL Server MS Blogs

SQL Server 2008 R2 SP2 Cumulative update #10

SQL Server Release Blog - Mon, 12/16/2013 - 15:47
Dear Customers, The 10 th cumulative update release for SQL Server 2008 R2 SP2 is now available for download at the Microsoft Support site. Cumulative Update 10 contains all the SQL Server 2008 R2 SP2 hotfixes which have been available since the initial...(read more)
Categories: SQL Server MS Blogs

SQL Server 2012 RTM Cumulative Update 11

SQL Server Release Blog - Mon, 12/16/2013 - 15:44
Dear Customers, The 11 th cumulative update release for SQL Server 2012 is now available for download at the Microsoft Support site. Cumulative Update 11 contains all the SQL Server 2012 hotfixes which have been available since the initial release...(read more)
Categories: SQL Server MS Blogs