SQL Server Find Unused Data Files

Dec 6, 2018

You know that old SQL Server you’ve left running the last 5 years and had numerous databases dropped and restored to? Have any databases been detached/restores failed part way through and data files just been left behind unused?

Depending on how many databases you have it can be a bit of a pain to go through the files in each one and compare that to the files in your data/log directories. I recently wrote a bit of a hacky script to do just this and list any MDF/LDF files that are not attached to a database on a given instance. It is a bit of a hack and requires you have xp_cmdshell enabled which I’ll leave you to read about separately and I would definitely not advise enabling it just for this purpose. But if you have it enabled the script may serve some use to you. If you have chosen the enable xp_cmdshell then the following code will turn it on…

EXEC sp_configure 'show advanced options',1
GO
RECONFIGURE 
GO
EXEC sp_configure 'xp_cmdshell', 1
GO
RECONFIGURE
GO

To get a list of unused data files set the path in the below script to be the location of your data/log or a directory above them (It searches subdirectories)

DECLARE @Path NVARCHAR(200) = 'D:\Databases\'
DECLARE @CmdFileList VARCHAR(150) = 
    'cmd  /c ' + /* Run x_cmdshell */
    '"cd /d ' + @path + ' && ' +  /* change directory */
    'dir /b /s *.mdf, *.ldf ' /* List filenames only (/b) and subdirs (/s)  Lfd and Mdf*/

CREATE TABLE #SqlFiles ([filename] VARCHAR(1024))
INSERT INTO #SqlFiles 
   EXEC xp_cmdshell @CmdFileList

DELETE FROM #SqlFiles WHERE [Filename] IS NULL OR [Filename] = 'File Not Found'
SELECT [filename] FROM #SqlFiles
EXCEPT
SELECT physical_name FROM sys.master_files

Any filenames returned from the above are not attached to any databases on the instance you are connected to.

Indexing In Memory OLTP Tables

Dec 4, 2018

Indexing on In Memory OLTP tables is a little different from your traditional on-disk rowstore tables…

In Memory Differences…

  • There is no clustered index
  • The nonclustered index still exists but its structure is quite different.
  • There is a new hash index ideal for unique single record lookups

Below I’m going to go over these points with demos, these demos are all run on a data dump of the Stack Overflow Database from 2010 which can be downloaded from BrentOzar.com. They all revolve around creating the following in memory table and copying the relevant fields out of the Stack Overflow user table to test out different indexing strategies…

CREATE TABLE People
(
   Id INT,
   Name NVARCHAR(200),
   Location NVARCHAR(200),
   CONSTRAINT pk_people PRIMARY KEY NONCLUSTERED (id)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY=SCHEMA_AND_DATA) 

No Clustered Index

In Memory OLTP table indexes are just pointers to the in memory data structure for the row and all its columns, this means that from any item in an index you have full access to all the columns in the row with no lookups needed. This effectively makes a nonclustered index on an in memory table act like a clustered index in that it will be a covering index for any query that uses it.

NonClustered Index

These are quite different on in memory tables to on disk tables.

If you query an on-disk table index the query will via a seek or a scan end up on a leaf node which will contain the fields in the index and an identifier (RID or clustered index key) back to the underlying table for lookups to any additional fields the query needs.

In Memory indexes have eliminated the need for locking due to the way they work, traditional indexes use a B-Tree whereas In Memory indexes are something the SQL Server team coined as BW-Tree, you can read more about BW-Trees in this research paper.

The end result is much the same as the indexes you know and love with classic on-disk nonclustered indexes. One thing to note is that in memory table schema cannot be altered once it is created so and all indexes must be created inline with the table creation. Let’s create our in memory people table with a clustered index on Id (Primary Key) and Location…

CREATE TABLE People
(
   Id INT,
   Name NVARCHAR(200),
   Location NVARCHAR(200),
   CONSTRAINT pk_people PRIMARY KEY NONCLUSTERED (id),
   INDEX ix_people_location NONCLUSTERED(Location)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY=SCHEMA_AND_DATA) 

Lets then test our a couple of queries…

SELECT * FROM People WHERE id = 1

Index Seek

Notice anything odd about that plan? Yup, no key lookup. It’s used the index ix_people_location which only has ID on it and no include fields yet we’ve managed to do a SELECT * with no key lookups! Result!

‘Lets look at another query…

SELECT * FROM People WHERE id < 100

Index Seek

Again the exact same plan but this time we’ve returned 100 records with no key lookups, probably not surprising given the previous query but I just wanted to highlight it anyway.

Lastly, let’s look at a query with multiple matches on a single predicate

SELECT * FROM People WHERE Location = 'Brighton'

Index Seek

Basically the same plan again with the only difference we’re now going against the location index.

Now let’s look at something else that does behave quite differently…

SELECT TOP 100 * FROM People ORDER BY Location

Orders Using Index

Looks ok right, it’s returned the ordered data using our location index with no sort needed. Well, what happens if we want it descending?

SELECT TOP 100 * FROM People ORDER BY Location DESC

Orders Using Index

Hmmm, Not Good. It’s performed a manual sort even though we have an index on the field we want sorted, an on disk nonclustered index would have no trouble with this. This is because on disk indexes use a doubly linked list to scan leaf nodes (Allowing 2 way traversal) whereas in memory tables do not. If you need to return data ordered in more than one direction then you need to define multiple indexes to store the data ordered.

New Hash Index

The Nonclustered index works great for finding ranges of records but when the SQL Server team were changing so much with In Memory OLTP they also designed a new type of index to speed up seeks to single unique records, e.g find me the record with id x. It’s actually just what it sounds like, SQL Server will hash the fields in the index and store that hash along with a pointer to the in memory location of the row(s).

Hash indexes perform great when the index is unique or very close to it, if however you start to get lots of records in the index hashing to the same value inserts and seeks and updates can slow down massively and you should think about using a nonclustered index instead.

When you define a hash index you have to tell it how many hash buckets you want, this should be a figure 1-2* the amount of unique values you plan to have in the index. A hash bucket defines how many possible values SQL Server will calculate when hashing your index. For example, if you define a hash bucket count of 10 will only ever give 10 possible hashes even if you store a million different values, in this case each bucket will end up with something like 100K rows which will all need to be scanned when you are seeking a record. Ideally, you want a single record per hash bucket then you can just hash the predicate in your query and go directly to the record. Something to watch out for here is that the values within a hash bucket are stored in order (This makes scans within a bucket faster) so if you under define your bucket count and end up with a many rows in a hash bucket then inserts\updates can take a big hit as once they’ve found the correct hash bucket they need to then scan through it to find the correct place to insert\move new row.

Because hash indexes are really designed for retrieving single rows they do not worry about sorting the data so if you’re doing anything that requires a range of data to be returned a hash index is not a good fit e.g WHERE x > y.

With the above in mind lets try a single record lookup, First, recreate the People table with a hash index on the ID field…

CREATE TABLE People
(
   Id INT,
   Name NVARCHAR(200),
   Location NVARCHAR(200),
   CONSTRAINT pk_people PRIMARY KEY 
      NONCLUSTERED HASH(Id) WITH (BUCKET_COUNT=1000000)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY=SCHEMA_AND_DATA) 

Now lets run our single record lookup from before…

SELECT * FROM People WHERE id = 1

Index Seek

Pretty much the same query plan as the non clustered version however if we compare the cost of this one it’s gone from 0.0004596 to 0.0000175 so as expected the hash match for single record on a properly sized bucket far outperforms the nonclusterd index.

The End

With the above information, you should be armed with everything you need to create optimized indexes on your in memory tables.

Turbo Charged Staging\Caching Tables With In Memory OLTP

Dec 3, 2018

Since SQL 2012 some really awesome new technologies have been introduced into the engine that are massively underused. Everyone is familiar with the traditional row store tables that SQL Server uses and people either don’t know about or are scared off by Column Store and In Memory OLTP. In this post I’m going to demo a particular use case suited to In Memory OLTP and show the benefits versus the traditional on disk Row Store.

The below demo is all done using SQL Server 2019 and the WideImporters sample database but you should be able follow along on older versions and different databases with minor adjustments.

Lets imagine we’re loading a throwaway staging table as an intermediate step in part of our ETL warehousing process. In Memory OLTP tables allow us to set their durability, if we set this to SCHEMA_ONLY then no data is ever persisted to disk, this means whenever you restart your server all data in these tables will be lost. This may be fine for any staging tables that are part of your ETL process.

Typically your ETL process will be pulling data in from all sorts of places which is hard to demo in a short blog post so in this example I’m just going to take 3 tables in the WideWorldImporters table and separately insert/update them into a staging table. The reason I’m not just joining them upfront is because I’m trying to simulate that this data could be coming from different places that cannot be joined e.g in an SSIS workflow from CSV, SQL Table and an Oracle linked server.

If your database isn’t already setup for In Memory OLTP tables then you’ll first need to create an in memory filegroup and add a file to it….

ALTER DATABASE MyDatabase 
ADD FILEGROUP FG_MyDatabaseMemoryOptimized CONTAINS MEMORY_OPTIMIZED_DATA

ALTER DATABASE MyDatabase
ADD FILE (name = MyDatabaseDemoMemoryOptimized, filename = 'c:\MyDbs\')
TO FILEGROUP FG_MyDatabaseMemoryOptimized

Once this is done you’re free to create in memory tables. To set us up for this demo let’s create 2 staging tables, One traditional and one In Memory SCHEMA_ONLY…

CREATE TABLE StagingTransactions
(	
   Id INT IDENTITY PRIMARY KEY NONCLUSTERED,
   Date DATETIME,
   StockItemId INT,
   StockItem NVARCHAR(300),
   Quantity DECIMAL(18,3),
   CustomerId INT,
   Customer NVARCHAR(100),
   PhoneNumber NVARCHAR(100),
   Website NVARCHAR(300)
)

CREATE TABLE StagingTransactionsInMemory
(	
   Id INT IDENTITY PRIMARY KEY NONCLUSTERED,
   Date DATETIME,
   StockItemId INT,
   StockItem NVARCHAR(300),
   Quantity DECIMAL(18,3),
   CustomerId INT,
   Customer NVARCHAR(100),
   PhoneNumber NVARCHAR(100),
   Website NVARCHAR(300)
)
WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY)

I’ve deliberately set both primary keys to non clustered because In Memory tables can’t have a clustered index and I wanted to keep them both a close to the same as possible.

Lets now run the following batch of 3 queries to bring in Transactions then customers and lastly stock item names simulating them all coming from different sources…

INSERT INTO StagingTransactions(Date,Quantity,StockItemId,CustomerId)
SELECT 
   TransactionOccurredWhen,
   Quantity,
   StockITemId,
   CustomerId
FROM 
   Warehouse.StockItemTransactions

UPDATE t  
SET 
   t.StockItem = s.StockItemName
FROM
   StagingTransactions t
   INNER JOIN Warehouse.StockItems s ON s.StockItemId = t.StockItemId

UPDATE t  
SET 
   t.Customer = c.CustomerName,
   t.PhoneNumber = c.PhoneNumber,
   t.Website = c.WebsiteURL
FROM
   StagingTransactions t
   INNER JOIN Sales.Customers c ON c.CustomerId = t.CustomerId

Then run it again but swap the table names for your in memory table.

On my machine this runs in about 20 seconds for the disk table and 3 seconds for the in memory table (This can be massively optimized with indexes but that’s not what I’m trying to show here). Lets compare the client statistics for the disk table vs in memory…

Table ‘StagingTransactions’. Scan count 0, logical reads 949566, physical reads 0, read-ahead reads 468, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘StagingTransactions’. Scan count 1, logical reads 1250360, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘StagingTransactions’. Scan count 1, logical reads 1472818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

You can see we’re doing quite a large amount of IO with all those reads, now lets look at the in memory version…

Table ‘StockItemTransactions’. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 127, lob physical reads 0, lob read-ahead reads 0.

Table ‘StockItemTransactions’. Segment reads 1, segment skipped 0.

That’s basically removed all of our IO! Magic right?

This speed difference comes from a few places…

  • Memory is WAY faster than disk
  • We’ve sacrificed persistence
  • In Memory OLTP uses a completely new storage structure (There be no pages here)
  • It’s completely lockless (Optimistic Concurrency, even for Write, Write scenarios)

It’s not a tool for every job as it comes with a number of limitations but staging tables are a place where it can really shine (providing you have enough memory).

Because we’ve used SCHEMA_ONLY another benefit is nothing that happens in this table touches the transaction log and as a result of that it wont hold up your HA/DR solution as nothing in this table will be shipped to the DR site. For example when inserting into our on disk staging table lets say we insert 2 million records, transform them then move them to our warehouse and then truncate the staging table the insertion and deletion of these records will be replicated on the DR site and could massively increase your RTO for bringing a DR site online.

I’ll try to cover some of the other places where in memory tables can be a good fit in future posts.


subscribe via RSS