1

I manage a few tables with more than 1 billion records. Daily, their ETLs run, reading data from flat files and inserting a few thousand records on each of them.

Now I need to keep these same tables also on another DB, and of course they need to have the same data. Data will be transferred using Linked Server.

The easiest way would be to truncate the destination tables, select on source inserting to destination, and then rebuild its index. But I don't like the idea of copying the same data every day.

Another solution I see is to query both source and destination tables, distincting by date dimension, and then copy only records from dates that are missing. I'd then create a monitor to see if the amount of records per day are equal and highlight not matching days.

Is this second solution good enough, or is there a best practice for this situation?

Aaron Bertrand
  • 180,303
  • 28
  • 400
  • 614
Hikari
  • 1,553
  • 11
  • 27
  • 41
  • Well, a solution you haven't considered (not sure why): run your identical ETL process on both copies of the database. Another: just keep the copy maintained using log shipping, mirroring, Availability Groups, etc. – Aaron Bertrand May 31 '16 at 17:42
  • lol yes I didn't even considered it. I develop my ETL based on Kimball principles. So, I don't wanna run the ETL process twice, I want a place with the official data after ETL runs and then deliver this data to anywhere else it's needed. This way, with this delivery/synch step, I can assure data is always the same. The question here is what's the best way of making this synch step. – Hikari May 31 '16 at 17:48
  • What do you mean by log shipping, mirroring, Availability Groups? Sorry I've never heard of them. You can add an answer with some link explaining them. – Hikari May 31 '16 at 17:49
  • @Hikari check my answer here to see if it helps - esp. Using SSIS incremental loading of your data. – Kin Shah May 31 '16 at 20:29

1 Answers1

2

If I had to manage this, I'd look at what, if anything, modifies the data in the tables outside of the ETL process.

If nothing else besides your ETL modifies the data, I would simply update the ETL process to insert the finished data at both locations (and likewise carry out whatever index maintenance you're doing in both places).

If something else updates this data, but only one one server, then transactional replication is probably the most lightweight way to get the data to the secondary server. Even if the data isn't being modified outside of the ETL, then this wouldn't be a terrible alternative to modifying the ETL process to update two targets. It sounds like a relatively small percentage of data that's being inserted daily.

If the data is being modified on both servers, then you'll probably want to consider merge replication. The simplicity of this will largely depend on if the table has any identity columns.

db2
  • 9,658
  • 3
  • 34
  • 58
  • Thanks, gonna read about transactional replication. As I said, I don't want ETL to separately feed both tables. I wanna run ETL only once and on its last step, delivery, I copy the data. – Hikari Jun 01 '16 at 16:47