Chunk based S3 file processing.

Vikas Singh
3 min readJan 31, 2022

Serverless has been the buzzword for a long time now. The term originated from the idea that the infrastructure used to run your backend code does not need to be provisioned and managed by you and your team.

In this blog, I will walkthrough, how Stepfunction and Lambda can be used to process the large file in minimal resources and without timing out.

Lambda is the favorite choice for processing. But it has a limitation of 10 GB memory and a max 15 min run time.

If we have not designed our system properly, this limitation could be a road blocker. We may choose to run such processes on Glue Python also. Which could solve the timeout issue but not memory issues and it may be bit expensive in some cases.

Situation:

Recently I had given prerequisites, where the source would be sending deltas every 15 min. 95–98% time these files would be
in a few Mbs. But a few times, it could be in GBs also(Yes, it’s a crazy source system 😆).

Solution:

We could write processing in Lambda but for large files, it would fail. We may choose to give larger memory to lambda but it’s not good resource utilization in the case of smaller files.

To handle this, I decided to chunk the files into smaller sizes and process them. Boto3 API document is not very helpful in explaining how we can use get API to read a chunk of a block of S3 file. We could pass offset and bytes to read in range parameters.

It follows the HTTP specification. https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.

Below is the proposed solution flow

proposed_solution
Proposed Solution

This API returns StreamingBody, which can be used to stream the chunks.

Below is the high-level approach to accomplish it.

  1. Start Step Function.
  2. Trigger Lambda with offset as 0 and other parameters.

2.1. Invoke Boto3 S3 API with parameter range as value 0 and 50*1024*1024 (read from env config). For more detail on range refer example in https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.

2.2. Get the StreamingBody object and read data.

2.3. Set has_next to False if it has returned fewer bytes.(As nothing to read further)

2.4. Create an array of bytes by using splitlines API.
2.5. Iterate over the lines[:-1]. Skip last line, as it may be partially read.

2.5.1 Convert them into UTF-8. Execute business logic.
2.5.2 Update offset for next invocation.

2.6. Return offset and has_next

3. If has_next is True, again trigger lambda with new Offset.

4. Using Map state of Step Function, execute business logic parallel on all generated files.

Below is the Step Function flow.

Large file processor
Large File processor

Let’s jump into Python implementation. To manage the opening and closing of streams, S3ChunkReader class will use Python Context manager.

Lambda will use this class using with the statement. Below is the input JSON for Lambda.

Below is the lambda. This will use S3ChunkReader and create a new file in S3.
It will keep appending the generated file. This will be used in the next step for parallel processing.

After completion of the file splitter step function. Below would be the output.

generated_file_list attribute could be used as input for record processor map state. Map State helps to execute a max of 40 instants of Lambda or configured state.

Below is the Step function definition.

The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn and relearn — Alvin Toffler

Please share your thought on this, Happy Learning.

--

--