What is the best way to archive data. When and how should we archive?

DebugHorror · August 17, 2023, 4:32am

This more is a discussion topic. The main focus is:

What is a archival process?

What type of data needs to be archived?

When we need to run the archival process?

What should be done for setting up archival process?

@Vikas_Dhillon @naveen.gupta @shivang.garg @Mayank @tushar.jain

DebugHorror · August 23, 2023, 3:10pm

@naveen.gupta @Vikas_Dhillon any suggestions?

shivang.garg · August 25, 2023, 7:56am

What is a archival process?
Archival process means the process of moving your rare use data into any backup storage.

What type of data needs to be archived?
It is always suggested that the transactional and log data of any system is something which can make system slow due to its high rate of generation. So the data which is older than few months and which is either no longer required or required but very very rare instance that should be archived.

When we need to run the archival process?
Lets take an example to understand this point. lets assume that our system generate or capture data at a very high rate but that data is required only for last 6 months. the data which is older than 6 months is no longer required or if required it will be just for audit purpose means no changes will be made on that data after 6 months. So, it is recommended that on 7th month from the month that record is created it should be archived. But the decision of archival job time should be taken very consciously with consideration of few points

What should be the archival interval?
Archival interval means the duration of the data which needs to be archive in a single job. the basic values it can have is
- Yearly
- Quarterly
- Monthly
- Weekly
- Daily
There are many other options but these are one most commonly used intervals
In which part of the interval the system will have a very high load?
This question will help us to understand the job execution time because whenever the system have load then near to that time the archival job should not run because it can impact system performance at that time.
For example:
If we have a monthly archival interval and at month end my system will always have a very high load then it doesn’t mean we should run the job every month end but we need to run our job in any either the first week or second week of the next month.
What should be the archival mechanism we should use?
Archival mechanism means what should be the archival process we should take. So, In market usually it is always taken care that archival process should not include any database DML operation on main tables it should always be DDL operations on main tables because DML operation can lock the main table which can lead to query timeout of any actual user retrieving any information from the database table.

Other point that needs to be taken care while performing archival process is that archival policy should be made in such a way that it should always be independent of business logic. this means that all records should be archive with same logic instead of different different logic. This is must because if we archive records based on different business logics then at the time of unarchiving it would be difficult to identify the actual month in which it was archived.

Now question comes in mind that how can we read or delete any information from main table without performing any DML operation. So, answer to that that point is using Partitions. We should partitioned our main table based on our archival intervals. The main advantage of doing that is we can attach and detach partition with main table and that will be a DDL operation instead of DML. Then we can read and archive data from detach partition using select queries which will not impact our main table after detachment. Also for deletion we can directly drop the partition which is again a DDL operation.

What should be done for setting up archival process?

Enable partition in each table which needs to be archive
Logs table should be created which should be updated against each partition name about the status of partition archival for easier debug if anything goes wrong within archival process
If you are using any managed Database service like RDS then you should have always take a snapshot of database before any archival process.

@Note: If you are using any managed database service then almost in all of the managed database services there are tools available which will help you in setting up the complete archival process. Like in AWS RDS there is a toll named partman which will help you in setting up and running this archival process based on your requirement as configuration. It is recommended to use those kind of tools instead of setting up or managing everything from scratch on your own.

DebugHorror · August 25, 2023, 4:41pm

@shivang.garg archival could be any type of data

Images
Documents
RDBMS
Logs
S3

Your solution is good db but what about other type of data?