This document describes a metadata-driven data loading framework that aims to simplify and optimize the onboarding of data applications at Walmart. The key points are:
1) The framework provides a centralized platform with plug-and-play onboarding capabilities to abstract away the complexities of integrating various data sources, sinks, and processors.
2) It utilizes metadata to configure applications and optimize resource allocation and scheduling based on priority. Connectors provide ready-to-use integrations and custom SQL UDFs allow flexible querying.
3) An orchestrator builds optimized execution plans and schedules application runs, while a scheduler optimizer prioritizes high-priority applications by dequeuing lower-priority jobs if needed.
4. Personalization @Walmart
• Our Customers are becoming increasingly
omni channel
• ~220M Customers & Members visits ~10,500
stores & clubs under 46 banners in 24 countries
& eCommerce websites in a week
• Billions of product impressions served every
week which generates events in petabytes
• We at FE team, run thousands of data
applications to generate features that
powers the personalized recommendations to
our customers
source
Walmart
General
Merchandise
+Walmart
Grocery, Store
Pickup &
Delivery
+Walmart
Stores
5. Personalization|Data Landscape
Persoalization | Data
Landscape
User Experience & Access Control
Security
Logging
Alerting
Telemetry
Data Engineers Data Scientists Data Analysts
Data Apps | Data Loader Platform
Muti – DC and Public Cloud
Streaming | In Memory | No SQL | Analytical
6. • Data applicationonboardingrequires a lot of manualhand coding and developers need time to
develop,integrate, and test code to solve the underlying complexities
• Buildingfunctionalityrich applicationneeds integrationwith variousbig data technologies,wide
array of datasources, sinks and data processors
• Difficult to control the resource allocation/usageand do the retrospection
• Competing high and low priority applicationsare introducingthe latency to the serving layers
Challenges
7. Challenges | New App Onboarding | Cumbersome & Fragile
Integrate
Data App 1 Integrate Develop Implement Enable
Source System Target System Processor Security Telemetry
Test and Deploy
Integrate
Data App 2 Integrate Develop Implement Enable Test and Deploy
Integrate
Data App 3 Integrate Develop Implement Enable Test and Deploy
Integrate
Data App 4 Integrate Develop Implement Enable Test and Deploy
Integrate
Data App N Integrate Develop Implement Enable Test and Deploy
Allocate
Resource
Allocate
Allocate
Allocate
Allocate
8. Data Loader Simplifies the onboarding
Configure
Data App 1
Source System Target System Processor Security Telemetry
Test and Deploy
Configure
Data App 2
Test and Deploy
Configure
Data App 3
Test and Deploy
Configure
Data App 4
Test and Deploy
Configure
Data App N
Test and Deploy
Resource
-Data Loader Platform-
An abstract layer equippedwith standardparsers
and connectors
9. • A centralized metadatadriven dataloading platform with plug and play onboardingcapability
• An abstractionlayer to buildthe workflow orchestrationwhich simplifies the complex service
integrationsand faster time to deployment
• A compelling UI that dramaticallyincreases the developer’sproductivityby providingready-to-use
connectorsto configure the business logic
• An IntelligentSystem to provide optimized recommendationbased on the previousruns
• Smart run schedule pool to enqueue and dequeue the run instances based on priority
Solution Approach
12. Connectors
• Framework is equipped to parse and handle all the data formats like JSON, AVRO, Parquet
and CSV
• Users can pick the existing connectors supporting different source and target systems like
Kafka, Cassandra and BQ.
• Metadata stores the system and application specific resource configuration to optimize
the resource allocations
• Abstract layer bundled with Custom UDFs that provides user flexibility to query the
systems like Kafka and Cassandra with SQL
13. Sample Domain API call in SQL UDF
• Accessing new domain APIs requires lot of engineering effort to integrate it in any data
applications
• Creating UDFs for Domain APIs and use these APIs in parallel computational engine like Spark
where it accepts UDFs usage in SQL
spark.sql("select getAccountStatus('cust_id:xxxxxxxxx') as is_active from table limit 1").show(false)
+------------------------------+
|is_active |
+------------------------------+
|Y|
+------------------------------+
14. Orchestrator
• Builds the optimized execution plan based
on the application configs from the
metadatastore
• Responsible for generating the run
instances based on the app priority and
source systems
• Executors picks the optimized execution
plan during the execution
Metadata
Store
Executors
Read App Config
Job Optimizer
Generate Run
Instance
Run Scheduler
Orchestrator
15. • Smart priority groups assigned to each loader for all the applicationsbased on the criticality
• Top priority jobstake precedence over the already scheduled lower priority
ones by dequeuing them
• Automatic resumption of the lower priority jobs once all the top priority and SLA bound jobs
are complete
Schedule Optimizer
17. • Real-time dashboardsthat provide run time statisticsfor each application
• Insightful experience to deep dive on various metrics
• Alerting and notificationmechanism to let app owners know about any erroneous or fault
scenarios
• Consolidatedview of all applicationswith corresponding success/failure ratio
Telemetry
18. Putting the pieces together
Self Service
Metadata Store
Multiple
Execution
Engines
E2E App Life
Cycle
Management
Multiple
Source & Target
Systems
Telemetry
Version Control
& CI/CD
Cloud Native
Plug & Play
Low or No code
19. • Quick turnaroundtime from few months to weeks
• Developer productivityexpected to increase by multiple folds
• Non-Engineeringteams can also leverage this framework to buildfunctionalapplicationswith
basic knowledge of SQL
• Intelligentapp execution based on the app priority compared to non-SLA applications
Outcome