We generate large data sets containing 100,000+ files and up to 1 TB in total size that are deployed to Windows file servers. We have approximately 10 of these different data sets that we would like to manage through Octopus Deploy. Does anyone have advice on how to version control and manage the deployment for large data sets?
Wow that is a lot of data to be deploying. A few questions:
- What generates the data sets? An application? Users? Developers? How are changes being tracked currently?
- Are data sets unique per environment? Meaning, a data set is only valid for development while another data set is valid for production?
- How are you deploying those data sets currently?
- Are these files text based (JSON, XML, TXT) or some other format like images or database backups?
- In any given deployment, of the 100,000+ files, how many of them are changing?
Thank you for your reply and any help is appreciated. My responses are outlined below.
The datasets are catalogs of weather events such as hurricanes or floods. We generally would finalize most of the data before the initial changeset, but as new events occur over time, we will append the new events to the catalogs. Currently,
we keep a ZIP of the change sets in a cold-storage location and manually deploy the new data and manually update the associated metadata files.
The data sets are not unique to an environment. Once a change set is finalized, it will be deployed to all environments.
Currently we manually copy files to internal environments and use S3 buckets or secure FTP for external environment deployment. All of the metadata files also have top be manually updated, which is error-prone.
A majority of the data is a binary file format and most files are around 1MB. The metadata files are text or json. I can see an immediate improvement to our deployment process using variable replacement in Octopus Deploy.
For incremental deployments, we typically deploy under 1,000 files and most commonly under 100 files.
Thank you for following up and providing that context. I talked to my team about this and we landed on two options.
You package up the deltas. I wouldn’t package up the entire data set, as the upload time/extraction time is going to take ages. You use the package in script step feature in your deployment process which will upload the package to a worker, extract, and you can then push those files to your destination of choice (s3, ftp, file share).
You have a local copy on a file share. Your deployment process doesn’t deploy any packages (you don’t have to deploy anything in our deployment process, you can just run scripts). Your deployment process runs a S3 sync command or it calculates the delta for your FTP. I do something very similar to backup my file on my local file share up to Azure File Storage. It has 30,000+ files and 150+ GB of data on it. After the initial upload, the delta process typically takes 5ish minutes.
I hope that provides a bit more context for you! If you have questions please let me know!
Thank you Bob. I will forward to my team and we will give it a shot!
Happy to help, please reach out if you need clarification questions.