System Communication for Transactional events on AWS

Vikas Singh
4 min readAug 14, 2022

--

Building blocks of an Enterprise/Complex system are multiple sub Systems.

The top two ways for a system to interact with each other are polling and event/push-based. These methodologies can be implemented in a variety of ways, including file sharing, APIs, event/messaging, and so on.

This blog will cover event-based interactions. If these systems are housed in the same Cloud, the pub-sub approach would be the preferred option. To accomplish it, all cloud providers provide a variety of services. AWS provides SNS-SQS, Kinesis, and Kafka hosted on ECS as options.

Depending on the system deployment structure, such as multicloud or cloud and on-premises, API calls or Webhooks may be viable options.

While system-designing recently, I came across a scenario in which one system was deployed on Azure and another system was deployed on AWS.

Brief system introduction: We were attempting to implement a frictionless store in order to improve the customer experience ( very much similar to Amazon Go)

The system’s tailored HLD is shown below.

Trailered version of Enterprise design

Brief on source system (ML Engine):The Source System records user shopping behaviours using a variety of data sources, such as video feeds and weight sensors. The final event is generated after ML corelates all of these events.

These events are received and processed by the Retail Cloud system. These are shopping events that must be handled in the order listed.

Ex:

1 — The Customer picked 5 Coca-Colas

2 — The Customer returned 3 Coca-Colas

Below is the proposed design to integrate these two systems (this flow does not cover the unhappy scenario).

The source system will push the data using the AWS HTTP API. These data will be received by the FIFO SQS. It is similar to the payload shown below.

{
"eventSequenceID":"", # Unqiue number generated by Source system
"storeId":"",
"shopperNumber":"" # Source system assign unique shopperNumber to each customer
.....
.....
}

eventSequenceID is mapped to the de-duplication ID and the storeId mapped to the SQS groupId (these are two values used by SQS FIFO to enforce exactly-once processing). These events can’t be processed concurrently because they are related to customer shopping ( i.e. transactional).

One can argue that the groupId field should be customerId or shopperNumber. This approach will be more parallel than the previous one.🤨

The consumer system recognises customers based on the email address they registered with the app, and two people can use the same app to enter the store at the same time (the app generates QR codes that require to be scanned at the entry gate). Using the camera, however, the source system generates a unique shopperNumber for each customer.

As a result, a source system may have two distinct customers, but both may be the same registered user on a consumer system.

It’s a story for another blog about how these customers’ IDs are linked/stitched in two different systems. 😅.

Event Flow:

  1. ML Engine generates event and sent it to Retail cloud HTTP API.
  2. HTTP API pass this event to FIFO SQS and response success event.
  3. SQS push this data to lambda based on FIFO logic.
  4. Lambda parse data, add message attributes and push it to FIFO SNS

Below is the SAM template used for this.

Scalability of System:

  • API Gateway: AWS HTTP Gateway is a serverless, highly scalable service. It’s not concerning pointabout it.
  • FIFO SQS: According to AWS documentation, SQS FIFO can support up to 3,000 messages per second, per API function (SendMessageBatch, ReceiveMessage, or DeleteMessageBatch). Which is more than enough for the system’s needs. How? These are extremely sophisticated stores and requires good investment, so on higher side consider 25 % of stores modernize in each country, there will be 500 stores per country. (The blog will go into detail about why counting is done per country rather than globally.)
    The maximum number of concurrent customers in these modest retail establishments would be ten (if they all got lucky at the same time :D).
    Each customer generates events (product pickup or putback) every five seconds on average.

Putting all number together (500 *10) / 5 => 1000 event per second per country.

  • Lambda: This is a scalable service; the top limit of the AWS account should be properly set. It can be increased from the initial 500 if 80% of it is spent.

Limitation: Currently, HTTP API gateway and SQS integration do not support passing path parameters. Below is organization’s standard URL format.

ex: HTTPS://{host}/{platform}/{service}/{version}/{country}/{resource}
Country value will not be transferred to SQS and, as a result, to additional micro services.
To address this, each route was assigned a unique SQS.
A naming convention can be used to identify the country, and the country name can be appended to the end of the SQS name.
It is possible to accomplish this in a clean manner by creating SQS for each country and then creating a lookup DB or Store parameters to store the SQS to country mappings.
This appears to be cleaner, but it adds milliseconds to the lookup time in Lambda. To save time, a convention-based approach can be chosen over a configuration-based approach.

Thats it for today. Thanks for reading.

--

--