Asp Forum - Indexing strategy for a very large table? - microsoft.public.sqlserver.programming

MikeJohnson

3/27/2007 7:07:00 PM

Hi,

We have a very large SQL Server database of around 2-3 Billion
records.
Each table holds hundreds of millions of records.

Each table has the same simple schema:

emailsender: varchar (255),
emailreceiver: varchar (255),
date: datetime
id: bigint (identity key- unused by us)

The tables won't change much once populated.

Search performance is paramount.

Queries will be like: "Were there emails from Joe to Mary in 2003?"

Could you please recommend the best indexing strategy?
We are happy to use more storage in return for faster searches.

Would you have any other hardware and/or software recommendations?

The hardware is a 4-CPU dual-core Windows Server 2003 machine.
Because of the amount of data, storage is on a fiber-attached SAN.

Thanks very much for any advice or recommendations!

Sincerely,
Mike Johnson

12 Answers

Rafael Lenartowicz

3/27/2007 9:12:00 PM

I might be off base, but without knowing the data (uniqueness of the data in
each column) it's hard to guess...
the best way would be to have a stress-file of your most likely query and
run it through the
tunning advisor - after all why reinventing the wheel ;-)
only true stress file can show for sure if you're better off with composite
index or separate indexes on your
three key columns.
r

"MikeJohnson" <mikejohnson@volcanomail.com> wrote in message
news:1175022390.586577.35490@o5g2000hsb.googlegroups.com...
> Hi,
>
> We have a very large SQL Server database of around 2-3 Billion
> records.
> Each table holds hundreds of millions of records.
>
> Each table has the same simple schema:
>
> emailsender: varchar (255),
> emailreceiver: varchar (255),
> date: datetime
> id: bigint (identity key- unused by us)
>
> The tables won't change much once populated.
>
> Search performance is paramount.
>
> Queries will be like: "Were there emails from Joe to Mary in 2003?"
>
> Could you please recommend the best indexing strategy?
> We are happy to use more storage in return for faster searches.
>
> Would you have any other hardware and/or software recommendations?
>
> The hardware is a 4-CPU dual-core Windows Server 2003 machine.
> Because of the amount of data, storage is on a fiber-attached SAN.
>
> Thanks very much for any advice or recommendations!
>
> Sincerely,
> Mike Johnson
>

David Portas

3/27/2007 9:17:00 PM

On 27 Mar, 20:06, "MikeJohnson" <mikejohn...@volcanomail.com> wrote:
> Hi,
>
> We have a very large SQL Server database of around 2-3 Billion
> records.
> Each table holds hundreds of millions of records.
>
> Each table has the same simple schema:
>
> emailsender: varchar (255),
> emailreceiver: varchar (255),
> date: datetime
> id: bigint (identity key- unused by us)
>
> The tables won't change much once populated.
>
> Search performance is paramount.
>
> Queries will be like: "Were there emails from Joe to Mary in 2003?"
>
> Could you please recommend the best indexing strategy?
> We are happy to use more storage in return for faster searches.
>
> Would you have any other hardware and/or software recommendations?
>
> The hardware is a 4-CPU dual-core Windows Server 2003 machine.
> Because of the amount of data, storage is on a fiber-attached SAN.
>
> Thanks very much for any advice or recommendations!
>
> Sincerely,
> Mike Johnson

Can't you do something about the design? Replace the email addresses
with surrogate keys. Drop the ID if it isn't used. Then consider
creating a partition scheme based on date.

HTH

--
David Portas, SQL Server MVP

Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.

SQL Server Books Online:
http://msdn2.microsoft.com/librar...(en-US,SQL.90).aspx
--

MikeJohnson

3/27/2007 9:48:00 PM

Thanks for your reply Rafael.
The data in each (non-identity-key) column isn't unique- email
addresses can occur more than once.

Could you please point me at the documentation where I could find
information on creating a stress-file and the tuning advisor?

I am relatively new to SQL Server.

Thanks, Mike

MikeJohnson

3/27/2007 9:52:00 PM

Hi David,

Thanks very much for your reply.
I will try dropping the ID column as you suggested.
Being new to SQL Server,
could you please point me at some information about "surrogate keys"?

When you say "partition scheme" is this literal technical terminology?
We current have one table to for each year's worth of data.

My thanks,
Mike

Mr Tea

3/27/2007 9:59:00 PM

Hi Mike

If search performance is paramount and storage space is unimportant and
there will be minumum inserts/updates on the tables then I would be tempted
to blanket index the table, examples of queries that will use each index are
supplied. indexing in this fashion will reduce the need for sorting and
bookmark lookups(possibly reverting to table scans) which would be the most
likely performance hindering operations on your large simple tables.
Although you are trading off storage space and write performance for these
gains.

e.g.

PK (clustered): id
(a clustered-unique index will minimise the space required by subsequent
indexes)

IX_1: emailsender, emailreceiver, date
(existance/counts of records by sender to named receivers, optionally by
date)
IX_2: emailsender, date, emailreceiver
(lists/counts of records by sender by date or ordered by date)

IX_3: emailreceiver, emailsender, date
(existance/counts of records by receiver from named senders, optionally by
date)
IX_4: emailreceiver, date, emailsender
(lists/counts of records by receiver by date or ordered by date)

IX_5: date, emailsender, emailreceiver
(lists/counts of records by date)
IX_6: listed but not required due to granularity of the date field

If you can afford the drive space covering indexes in this fashion, that I
would hope that your queries would be blisteringly fast on almost any
hardware, and make sure you are making good use of the memory capacity you
have available.

Regards
Lee

"MikeJohnson" <mikejohnson@volcanomail.com> wrote in message
news:1175022390.586577.35490@o5g2000hsb.googlegroups.com...
> Hi,
>
> We have a very large SQL Server database of around 2-3 Billion
> records.
> Each table holds hundreds of millions of records.
>
> Each table has the same simple schema:
>
> emailsender: varchar (255),
> emailreceiver: varchar (255),
> date: datetime
> id: bigint (identity key- unused by us)
>
> The tables won't change much once populated.
>
> Search performance is paramount.
>
> Queries will be like: "Were there emails from Joe to Mary in 2003?"
>
> Could you please recommend the best indexing strategy?
> We are happy to use more storage in return for faster searches.
>
> Would you have any other hardware and/or software recommendations?
>
> The hardware is a 4-CPU dual-core Windows Server 2003 machine.
> Because of the amount of data, storage is on a fiber-attached SAN.
>
> Thanks very much for any advice or recommendations!
>
> Sincerely,
> Mike Johnson
>

Gert-Jan Strik

3/27/2007 10:06:00 PM

I agree with David. You should normalize the table. In this case, that
would mean you should probably have a table with e-mail addresses, and a
mailtransactions table. Your structures could look like this. The
declared constraints will cause the preferred indexes to be created.

CREATE TABLE email_addresses
(email_id int NOT NULL IDENTITY PRIMARY KEY CLUSTERED
,email_address varchar(255) NOT NULL UNIQUE
)

CREATE TABLE emails
(sender_id int NOT NULL REFERENCES email_addresses(email_id)
,receiver_id int NOT NULL REFERENCES email_addresses(email_id)
,"date" datetime NOT NULL
,CONSTRAINT PK_emails PRIMARY KEY CLUSTERED (sender_id, receiver_id,
"date")
)

HTH,
Gert-Jan

MikeJohnson wrote:
>
> Hi,
>
> We have a very large SQL Server database of around 2-3 Billion
> records.
> Each table holds hundreds of millions of records.
>
> Each table has the same simple schema:
>
> emailsender: varchar (255),
> emailreceiver: varchar (255),
> date: datetime
> id: bigint (identity key- unused by us)
>
> The tables won't change much once populated.
>
> Search performance is paramount.
>
> Queries will be like: "Were there emails from Joe to Mary in 2003?"
>
> Could you please recommend the best indexing strategy?
> We are happy to use more storage in return for faster searches.
>
> Would you have any other hardware and/or software recommendations?
>
> The hardware is a 4-CPU dual-core Windows Server 2003 machine.
> Because of the amount of data, storage is on a fiber-attached SAN.
>
> Thanks very much for any advice or recommendations!
>
> Sincerely,
> Mike Johnson

David Portas

3/27/2007 10:14:00 PM

On 27 Mar, 22:51, "MikeJohnson" <mikejohn...@volcanomail.com> wrote:
> Hi David,
>
> Thanks very much for your reply.
> I will try dropping the ID column as you suggested.
> Being new to SQL Server,
> could you please point me at some information about "surrogate keys"?
>
> When you say "partition scheme" is this literal technical terminology?
> We current have one table to for each year's worth of data.
>
> My thanks,
> Mike

Can I suggest that you consider hiring someone with some expertise in
SQL Server database design and scalability. When dealing with billions
of rows, any mistakes you make could turn out to be very expensive
ones. You will surely benefit from getting the right assistance.

Partition schemes
http://msdn2.microsoft.com/en-us/library/ms1...

Surrogate keys will be explained in any good book on database
concepts.
"Introduction to Database Systems" by Chris Date is a good place to
start.

HTH

--
David Portas, SQL Server MVP

Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.

SQL Server Books Online:
http://msdn2.microsoft.com/librar...(en-US,SQL.90).aspx
--

MikeJohnson

3/27/2007 11:06:00 PM

My thanks to David and Gert-Jan for your replies.

I think I will at least initially follow Lee "Mr. Tea's" advice
and attempt to blanket the tables with indexes (indices?),
as this is least disruptive to our existing querying code.

Would you recommend I keep the Id identity primary key column,
(even though it is unused in our queries)
in order to save space for the non-clustered indexes,
or should I just drop this column?

My thanks,
Mike

Alejandro Mesa

3/28/2007 1:56:00 AM

Mike,

A good start could be partitioning that data.

SQL Server2000

Using Partitioned Views
http://msdn2.microsoft.com/en-us/librar...(SQL.80).aspx

SQL Server2005

Partitioned Tables and Indexes in SQL Server 2005
http://msdn2.microsoft.com/en-us/library/ms3...

AMB

"MikeJohnson" wrote:

> Hi,
>
> We have a very large SQL Server database of around 2-3 Billion
> records.
> Each table holds hundreds of millions of records.
>
> Each table has the same simple schema:
>
> emailsender: varchar (255),
> emailreceiver: varchar (255),
> date: datetime
> id: bigint (identity key- unused by us)
>
> The tables won't change much once populated.
>
> Search performance is paramount.
>
> Queries will be like: "Were there emails from Joe to Mary in 2003?"
>
> Could you please recommend the best indexing strategy?
> We are happy to use more storage in return for faster searches.
>
> Would you have any other hardware and/or software recommendations?
>
> The hardware is a 4-CPU dual-core Windows Server 2003 machine.
> Because of the amount of data, storage is on a fiber-attached SAN.
>
> Thanks very much for any advice or recommendations!
>
> Sincerely,
> Mike Johnson
>
>

Gert-Jan Strik

3/28/2007 4:26:00 PM

MikeJohnson wrote: [snip]
> I think I will at least initially follow Lee "Mr. Tea's" advice
> and attempt to blanket the tables with indexes (indices?),
> as this is least disruptive to our existing querying code. [snip]

True. On the other hand, if there are many combinations of the same two
e-mail addresses, then the normalized structure will probably be faster,
because determining the result would require less I/O. There is no need
to change the esting query if you define a view with the name of the
original table.

Let's say your original table's name was "my_table". Then, after you
have created the tables email_addresses and emails (see my previous
posting), and loaded them with data, then all you need to add is:

CREATE VIEW my_table AS
SELECT sender.email_address AS emailsender
, receiver.email_address AS emailreceiver
, "date"
, 0 AS id -- could be left out
FROM emails
INNER JOIN email_addresses AS sender
ON sender.email_id = emails.sender_id
INNER JOIN email_addresses AS receiver
ON receiver.email_id = emails.receiver_id

After that, you can run all the original queries, for example:

If EXISTS (
SELECT *
FROM my_table
WHERE emailsender = 'Joe'
AND emailreceiver = 'Mary'
AND "date" >= '20030101'
AND "date" < '20040101'
)
Print 'There were emails from Joe to Mary in 2003'
Else
Print 'There were no emails from Joe to Mary in 2003'

Gert-Jan

microsoft.public.sqlserver.programming

Indexing strategy for a very large table?

MikeJohnson

Rafael Lenartowicz

David Portas

MikeJohnson

MikeJohnson

Mr Tea

Gert-Jan Strik

David Portas

MikeJohnson

Alejandro Mesa

Gert-Jan Strik

x Login to ForumsZone