Humans have generated more data in the past two years than the whole previous human history combined. It is estimated that there are 2.5 quintillion bytes of data created each day and it is increasing. Data is commonly shared using large files from source to target applications. These large data files need to be processed quickly and efficiently without errors. Failures in transferring and processing of large files have caused significant information loss to the business, causing outages and delays to mission critical applications.
Large file processing is a common use case across verticals. While processing large files, a number of factors need to be considered.
With all these considerations in mind, we have developed an efficient solution to process large files in Mule 4. Mule 4 allows streaming and processing large files with out of the box capabilities and it is easier than ever!
Consider this use case: A large file (1GB+) with customer data needs to be picked from a file location and sent over to Salesforce Marketing Cloud (SFMC). If one tries to read this file, perform basic transformations and attempt to send it to SFMC without any processing strategy, the Mule application will most definitely run out of memory and crash. In order to achieve this, we must use streaming and design the process in a way to process the data in “chunks” rather than the full file.
For the solution, I have used a simple Scheduler to trigger the flow. I initially had an “On New or Updated File” to trigger the flow, but once I deployed the application to CloudHub, I noticed the trigger would intermittently throw errors. In order to solve this, I added in a Fixed Frequency Scheduler, then listed the available files in the directory I was polling as the first step in the flow. This strategy worked as the flow would only continue if there were files to process.
Once the files were ready to process, I read the files from the directory using a SFTP Read and chose the Repeatable file stream store option:
I have found depending on the use case, both the default Repeatable file store stream, which will stream the file and only keep a customizable chunk in memory and allows you to read the payload multiple times through the processing, and the Non repeatable stream, which will simply stream the file only once for you to do the processing, are the best options. Which option you choose will depend on individual cases. If you have logic prior to processing the file, you may want to use a non repeatable stream to trigger the event, save the file path and read the file when you are ready to do processing on the file data.
Once you are ready to process the file, your file is at this point in a stream. Once you trigger a transformation, the data is loaded in its entirety which can cause major issues trying to transform a large amount of data. There are several options to get around this however, if you must transform the data in its entirety, there is an option to stream the transformation as well, using @StreamCapeable in the transform:
This can bring a difference of being able to process 1GB of data in about a minute. We will discuss optimization a little later.
To then process this transformed data using an HTTP Request connector, depending on what the target system or System API will accept, we may not be able to send the data all at once. If you using the HTTP RQ object, there is an option to stream the data:
In addition, we must process the data in chunks by either using a “For Each” with an optimized collection size OR a Batch Process. I have found both to be useful, however, it depends on the use case. The magic of how fast and how efficiently the data is processed lies in the batch size or collection size.
If you choose to use a Batch Process, I highly recommend putting the transformation inside the batch process if the use case allows it:
At the end, each case is different, depending on the use case, however, streaming in Mule 4 and processing large files has become much easier with out of the box capabilities in most of the connectors provided.