Common Data Sync Strategies for Application Integration
Even though data integration is a term commonly prevalent in the industry for a considerable amount of time, yet people tend to interchangeably use it with application integration. Data integration is a technique that is used to synchronize information silos, while application integration is a broader term and involves a lot of different techniques present in middleware. The common myth that data can be easily integrated without prior knowledge or one can easily generalize the integration techniques, is often proved to be wrong over time. Data itself represent different meaningful insights which if not considered properly, might result in poor or malfunctioned system.
In this article, we will cover how data synchronization techniques can help citizen integrators to ensure the business applications can bridge the information properly and the data updated to the applications represent the same meaning when referred again. There are many considerations which one can apply while developing an integration between two or more applications, I will consider their mutual pros and cons so as to make you understand what is suitable for your case.
While in most cases of application integration, you might opt for an e-way sync strategy only, it is best to understand what approach you should take when you create the integration. Applications are different in terms of architecture or even in terms of their APIs or data formats. Depending on the APIs available, we choose one that is suitable for your business case.
For one way sync we consider 3 scenarios:
- Record “Flag” and validate
- Remember “Last Modified Time”
- Capture Data
1. Record “Flag” and Validate
In this approach, the records are extracted from the source application based on some “Flag” value. Upon completion of successful record sync to the other application, we consider updating the flag on the source application again so as to ensure the synchronization does not re-capture the same data again. A “flag” field can be of many forms, some consider just a bit field representing true/false or some application does provide a status field naturally. We consider updating this filed just after the record is successfully captured. We might want to consider the default value of a particular data object to be as “Not synced” or we might also want to consider to define the initialization value. We would also consider updating the status of the flag when there is any change made to the source data again. Hence if you try to follow this approach, you might need to develop some kind of logic around the application so as to ensure the application works correctly as expected.
Pros and Cons
- In this approach the integration allows the record to be synched without much data dependency. Such that if one part of a record is synched and others in the group fail, the integration can still partially work.
- If the flag field is exposed to the user through the user interface, the user can also trigger sync just by updating the record.
- The integration remains stateless.
- Requires source application to be updated, and hence you cannot execute the same process parallelly.
- If the source application does not support field-level customizations or expose the field from the layer, you might not find this approach an option.
2. Remember “Last Modified Date”
If you don’t have an option to create or alter the field in the application end, you should consider the timestamp to capture data change. In this approach, the delta is captured using the Last update date. The process records the most recent record and stores it in persistent storage such that the next ran filters again based on the time saved. Generally, this approach is best suited when the API provides a filter criterion for record retrieval and storing the time from the retrieved data eliminates any calculation regarding the server time differences.
This approach captures the timestamp (most probably the Last updated time) to ensure the integration works best without major duplicate values. You can even choose the current system time or your server time or even store the current time in GMT, but if there is an option to choose the time from the retrieved record, the data should not be missed out. In this approach, the most critical thing is to handle the errored resync data.
Pros and Cons
- This is a much simpler integration design because there is no extra step to update something to the source but this approach makes integration “stateful” which is sometimes troublesome when you want to load balance integration agents.
- There is no need to do application end customizations. But as there is persistent storage involved, the data persisted within the integration platform that needs to be backed up considerably and needed to be retrieved back during disaster recovery.
- This process does not work best if your application does not provide filter based on date and time in terms of milliseconds and also it does not work when there are data imports on the application which results in the generation of the same DateTime in multiple records.
- The process sometimes gets complex when there are multiple parameters involved, like pagination, sort criteria, etc.
3. Change Data Capture
In case your API does not expose DateTime or you do not have an option to update flag field on each record, only option that you have in such scenario is to record some kind of checksum or hash for each record on the integration platform to ensure the duplicate checks could be easily identified even though duplicate data is retrieved each time. In addition to storing the unique hash generated on the record, it is also important to store the record id, such that one can easily identify data modifications.
Generally, this is a rare scenario in modern-day applications, but for legacy application when there is file system or FTP based integration involved or in case of stock or product catalog updates, you might go with this scenario. This approach is suitable mostly for master data, and should not be used for transactional records.
Pros and Cons
- The most important pros for this approach are that there is no integration step involved to do such a step. The integration platform can provide a generalized option to handle this type of integration.
- This approach becomes very stateful and might sometimes require the entire record needs to be stored in the integration platform.
- With growing data size, the checking of each record can be cumbersome and time-consuming.
In case you need to define an integration where both applications are getting parallelly updated, you need to take approaches that are suitable for handling two-way synchronization. Practically two-way sync is most often required in situations like customers, contacts, vendors, etc, rather than invoices, quotes, orders, etc. Bidirectional updates are sometimes critical and require manual conflict management which in most cases not recommended.
As two-way sync requires the same recordset to be updated multiple times, there are some challenges that need to be kept in mind before doing the actual integration.
- Conflict Management Technique: As two applications involved in sync operation can both update the same record, there could be conflicts between them. In scenarios where the same record or same data has been updated in both the places before you start the sync operation, it would be an anomaly to decide which one we should take and which one to reject. In such cases, we mark one application as master and another as a Slave, the integration tries to merge unique values together while it automatically rejects the slave updates (if any) if there is any conflict between the two apps. When two-way sync is performed, the data which is in the conflict in a slave is rejected and updated with the one in master.
- Circular Update Management: Sometimes when an application always finds a new update for the same record in place, there might be a situation of circular update. Let us suppose that an update to a record generates new modified time for a record. The integration looks for new changes and finds the record again to update. It then updates the application to the other application and changes the modified date again. This process continues.
To solve these challenges, we can device two methods :
- Distinctly identify updates from User Vs Integration
- Automatically merge data with manual conflict management
Distinctly identify updates from User vs Integration platform
This method enhances the flag based approach and stores some additional data that is triggered only when the data is modified or added through the integration platform. In such a case the platform knows the context and does additional stuff that ensures the flag is unset only when the data operation is performed from user interfaces.
Pros and Cons
- The approach is best suited for Two-way sync and addresses all the challenges. It enhances the Flag based approach to ensure it identifies the user context.
- The approach makes the integration little complex as both side customizations needed.
- It does not require additional cost but not all applications support this.
Automatically merge data with Conflict management
Sometimes it is important for the integration platform to decide which data to update and which not. In merging the cells, the integration platform allows to choose the right value based on the record update or based on the Master-Slave relationship mentioned beforehand. It is important to note, this kind of approach also produces conflict and the conflict is being put into the data bucket for a manual fix.
Pros and Cons
- Unlike an er-based or integration platform deciding an update, this approach can be applied to a broader spectrum of applications and is scalable.
- This approach requires manual intervention in some cases, which might delay the record update of certain conflicting recordsets.
- As conflicts are managed automatically by merging data together, there might be some cases where the data is wrongful.
In the case of Real-time sync, the data is never pulled from the application end, but rather it is pushed automatically using a Webhook component from the application. The request is made directly to the integration platform to do actual integration. The process is called real-time because the data is pushed once and when the data is put into the application. For the security of data failure, many integration providers give a proxy component that stores the messages into a persistent store before pushing it to the integration platform.
Here the requester is the application that pushes data to the integration platform using the request channel. The message is transformed and a reply is created to indicate whether the process is successfully executed or not.
There are two patterns that one can follow while developing real-time sync:
- Synchronous Pattern
- Asynchronous Pattern: Fire and Forget
This pattern is based on the Request / Reply model. When an end-user makes any changes on the application, the data is pushed to a URL, the integration then performs the transformation and sends it to the destination application. Once the transaction is complete, the integration creates a reply and respond back to the requesting channel.
Pros and Cons
- The synchronous pattern is safe and the source application can identify whether the process has completed successfully or not.
- Synchronous pattern responds at a slow rate and hence the source application might need to wait longer.
Asynchronous Pattern: Fire and forget
In the case of an asynchronous pattern for real-time processing, the response is immediately sent back by creating a Co-Relation Id. The fire and forget pattern requests the integration provider when there is an update in the source application. The integration receives the request, validates whether the data is perfect and then responds back as Accepted with a generated Co-relation id. After that, the data is being processed and the actual output or response for that particular integration point is written for that particular referenced co-relation id. The source needs to call a special API again to retrieve the actual response for that particular request.
Pros and Cons
- Fire and forget pattern is ideal when there is a large amount of load in the source application and it does not bother whether the data is actually processed or not.
- Fire and forget pattern will have overhead of calling another API after actual execution to identify whether the data is successfully processed or not.
- Fire and forget needs a Queue to back up in the integration platform, to ensure the messages are processed in correct order.
APPSeCONNECT is a smart and robust business application integration platform that seamlessly connects all your business applications with each other to streamline operations and facilitate the free-flow of data across the platforms. By moving into the region of iPaaS, APPSeCONNECT proves to be a best-in-the-class platform that easily connects systems and automates the business process.
Now, you can easily connect all your business applications under one single platform to automate the business process!
You may also like:
10 Benefits Of iPaaS That May Change Your Perspective
Webinar: Accelerating Digital Transformation Process with iPaaS/Data Integration
Why You Need an API Integration Platform – Ultimate Guide