7月18日
Sampling SQL Data with TABLESAMPLE
Sampling Data with TABLESAMPLE
Sampling, in demography and statistics, is the process of selecting units (e.g., people, organizations) from a population of interest so that by studying the sample we may fairly generalize our results back to the population from which they were chosen.
SQL Server 2005 introduce TABLESAMPLE function to sample data from a table. With the advent of very large database, such those of the datawarehouse, sampling maybe of great use to developers and data professional alike who may need only a random sample of the data.
Syntax
TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] )
[ REPEATABLE (repeat_seed) ]
Sample TSQL codes:
USE AdventureWorks ;
GO
SELECT *
FROM Sales.Customer TABLESAMPLE SYSTEM (10 PERCENT) ;
Syntax explanation
TABLESAMPLE SYSTEM returns an approximate percentage of rows. It generates a random value for each physical 8 KB page in the table. Based on the random value for a page and the percentage specified in the query, a page is included in the sample or excluded. Each page that is included returns all rows in the sample result set. For example, when specifying TABLESAMPLE SYSTEM 10 PERCENT, SQL Server returns all the rows on approximately 10 percent of the specified table's data pages. If the rows are evenly distributed on the pages of the table, and if there is a sufficient number of pages in the table, the number of rows returned should approximate the sample size requested. However, as the random value generated for each page is independent of the values generated for any other page, it is possible that a larger, or smaller, percentage of pages than requested are returned. The TOP(n) operator can be used to limit the number of rows to a given maximum
REPEATABLE indicates that a selected sample should be returned again. When specified with the same repeat_seed value, SQL Server returns the same subset of rows as long as no changes have been made to the table. When specified with a different repeat_seed value, SQL Server will likely return a different sample of the rows in the table. The following actions to the table are considered changes: insert, update, delete, index rebuild, index defragmentation, database restore, and database attach.
NOTES ON TABLESAMPLE:
1. TABLESAMPLE cannot be used with views.
2. cannot be used with derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML
3. compatibility level of 9.0 is a must