Hi,
I have a company who receives occasional large csv files 90M records from their clients. When I uploaded the CSV using the FTP extractor (file was on an FTP server), it took 1.5 hours to upload.
What would you recommend to shorten the 1.5 hour process? Use a different method to store the file? Use preprocessors? etc..?
Please note that this process needs to be automated and I can instruct the company to adopt a new process to accepting large CSV files from their client.
Is the file compressed? If not, try compressing it before uploading to the FTP. Also the FTP server network bandwidth can be limiting. What is the actual size of the file?
It isn't problem per se, but looking to make process faster.
What percent of the load time is due to file transfer (from file location to Keboola) and what is due to loading into table? It sounds like the load is where the bulk of time is being spent. If so, you're saying it will be the same regardless the method (FILE,DB,etc..)?
I would encourage you to take a look at the job of the execution. You should see when the file gets picked up versus when it starts to import. That would be the time it takes for the file transfer I guess. Then you can see between the start of the import to when it actually gets imported (highlighted green job details) and that would be the import time. If you click on the import event (green highlihgt) you can check the performance of the import, called "import duration".
This is helpful thanks Marcus!
Looks like when adding file name and row number to file, each of those processors takes 30min. The creation of the table and adding the metadata take up another 30min.
So a CSV upload for 90M files should take 30min if you keep default settings.
yea the processors are separate containers, makes sense that it would add a lot of extra time. seems unnecessary if it's just one big file ... those processors are normally used when you get a new daily file so it's easier to track where rows come from. If there is just one (or a few) files, it's more obvious where to find the row in the file storage.
Hi Leonard,
If it's just an "Occasional", what is the issue with just one time upload of 1.5 hours? 1.5 hours is quite reasonable IMHO for 90M records. What is the real issue with 1.5 hour extractor? I've seen many examples where extractors take longer.
S3 might be faster though. Your "company" can use cloudgates.net, which essentially sets up FTP credentials on top of a S3 bucket. So essentially the data will land into S3, but to anyone else it can be accessed via FTP.
I don't know whether it will be actually faster but it is conceivable. Can't know without testing.
Regards,