I am building a data store for the content of multiple blog sites which have been scraped. Each of these sites is going to have an entry in a Blog table
BlogId Url Version
int, PK, identity varchar(2500), not null int, not null
----------------- ----------------------- -------------
1 'http://site1.com/ 1
2 'http://site2.com/ 1
3 'http://site2.com/ 2
Then there will be a Post table:
BlogId Url PostId
int, FK, not null varchar(2500) not null ?
For most blogs that I'm looking at, I can easily discern a PostId value that is unique for them.
For all blogs, the Url will be unique for a BlogId value. However, the Url is frequently subject to change.
However, there are a few blogs that I can't discern a good PostId value when scraping the HTML. In those cases, I can get the datetimeoffset that a post was published on. The accuracy of the publishing frequency is on the minute level.
The question is, I need to build a composite natural key (I won't use it as a primary key because I'll need to relate to other tables and don't want to carry this along to those tables) using the BlogId and the PostId but I'm not sure what I should type the PostId as. I've been thinking of two options:
sql_variant- I could use anintfor most of my sites, and that would be fine and then usedatetimeoffsetfor the ones where I can't easily discern a unique numericPostIdbigint- ThePostIdwould fit in here and I can encode thedatetimeoffsetas abigint(2011-11-11 13:59would become201111111359)- Or some other approach I haven't thought about.
Some other key facts:
- The
Posttable will only have a few thousand records in it initially (if there are more than five thousand, I'd be surprized) - There are related tables for tags, categories, etc, which will be joined with this table (maybe a few thousand records at most)
- The growth of any of these tables won't be astronomical. Maybe a another five thousand records a year (that's being conservative on my part but you never know)
- This key will only be used to create an index/constraint, currently, I don't foresee the use of it in queries (there are other attributes that I am linking to which are of more concern).