Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

Automating Federal Aviation
Administration’s (FAA) System Wide
Information Management (SWIM)
Data Ingestion and Analysis
Dr. Mehdi Hashemipour, Data Scientist, Bureau of Transportation Statistics
Marcelo Zambrana, Cloud Solutions Architect, Microsoft
Sheila Stewart, Solutions Architect, Databricks

Agenda
Mehdi Hashemipour, PhD
SWIM Overview
Marcelo Zambrana
Automating Infrastructure Architecture
Sheila Stewart
SWIM Data Processing

Objectives and Benefits
Objectives:
Using FAA flight data to build a Commercial Flight Database to
validate airline data and support the BTS mandate to measure and
report aviation system performance.
Potential Benefits:
▪ Enable timely estimates of enplanements and on-time
performance
▪ Provide a point of validation for airline-submitted data
▪ Expand BTS’s analytical capabilities and breadth of reporting
▪ Support special aviation studies
▪ Provide source of data to aviation dashboards and other
statistical products
▪ Serve as the aviation component of the Transportation Disruption
and Disaster System

The System Wide Information Management (SWIM)
SWIM service provides a single interface point to multiple data
services including airport, flight, aeronautical and weather data.
STDD Stream :
Access to data from over 200 airports.
Data from over 400 individual systems.

Potential BTS Use Cases for SWIM Data
• Airport Time Delays
• Ground Stop History, Status and Impact
• On-time Performance estimate by causes
• System Passenger Loading
• Airline Data Quality Assurance Check
• OAG Replacement
Airport/Airline
Performance
• Freight Aircraft Location On Ground
• Air Cargo Patterns and Seasonality
• Multi-modal Cargo Movement
AirCargoTraffic
• Planned vs. Actual Flight Path Analysis
• Actual flight path deviations from the “norm”
• Fuel Cost and Ticket Price Correlation
• Financial Impact of Delays
EconomicImpactof
Delays/Diversions/
Cancellations
• Gate availability
• Flight pattern interruption
• Late Arriving flight pattern
• Morning Flight Delay Impact
Operational Impact of
Delays/Diversions/
Cancellations
• Re-direct diverted passengers
• Passenger Impact of Cancellations
Passenger Impact of
Delays/Diversions/
Cancellations

Data Lake
BTS Conceptual SWIM Architecture
BTS
SWIM
MSG
Service
SWIM
Data
Msg
Service
XML MSG
Processing
Economic
Impact
Weather
Impact
Ground
Movement
ITWS
Kafka
…
TFM
Flight
TFM
Flow
TBFM
FDPS
ITWS
ITWS
FAA SWIM Data Service Bureau of Transportation Statistics (BTS)
Temp Raw
XML File
Storage
Performance
Air Cargo
Traffic
DOT Virtual
Machine
FAA NW
Gateway
Mapping &
Animation
…
Data AnalyticsXML Message Handling Data Transformation
and Storage
DOT Cloud Computing Environment
Data
Analyst
Data Lake
8

Initial Goals
▪ Automate as much as possible
▪ Infrastructure.
▪ Server Configuration.
▪ Databricks resources.
▪ Multiple Ingestion Sources
▪ SWIM offers multiple types of sources.
▪ On-prem data sources.
▪ Security and Scalability
▪ Internal traffic only.
▪ Multiple environments.

Automating the Environment
▪ Initial Networking
▪ Security
▪ VMs
▪ Storage
▪ Databricks Workspace
▪ Software Requirements
▪ Solace Connector
▪ SWIM Access configuration
▪ Kafka Configuration
Kafka ClusterInfrastructure
▪ Cluster Creation
▪ Libraries Configuration
▪ Notebooks
▪ Secrets
Databricks Cluster

Terraform
▪ Infrastructure as a Code
▪ Helps to automate infrastructure management.
▪ Understanding infrastructure changes before they
are applied.
▪ Allows to build, change and version infrastructure.
▪ Multi-cloud
▪ Common language for different providers.
▪ Feature rich
▪ Module Registry.
▪ Providers.
▪ Workspaces.
▪ Variables.
# Project Structure
├── LICENSE
├── README.md
├── main.tf
├── networking.tf
├── outputs.tf
├── security.tf
├── storage.tf
├── variables.tf
├── vm.tf
└── workspace.tf
# Common Commands
terraform fmt
terraform init
terraform validate
terraform plan
terraform apply

Configuration Management Ansible/Chef
▪ Consistency
▪ No more snowflake servers.
▪ Version Control of all configurations.
▪ Replicated environments.
▪ Scalability
▪ Add more SWIM source configurations.
▪ Easy to deploy new environments.
▪ Documentation
▪ Building-up knowledge.
▪ Change History.

Databricks CLI
▪ Easy Interface to Databricks
Platform
▪ Open source.
▪ Built on top of Databricks REST API.
▪ Allows you to interact with: workspace, clusters, fs,
groups, jobs, runs, libraries, and secrets.
▪ Supports multiple profiles.
▪ Experimental
▪ Still under active development.
# Create Databricks Cluster
databricks clusters create --json-file config/cluster.json
# Import Libraries
databricks libraries install --cluster-id CLUSTER_ID --maven-coordinates
com.databricks:spark-xml_2.11:0.9.0
# Import Notebooks
databricks workspace import -l PYTHON -f DBC Notebooks/TFMS.dbc
/Users/USER/tfms
#Secret Management
## Create secret scope
databricks secrets create-scope --scope swim --initial-manage-principal users
## Create new secret
databricks secrets put --scope bts-swim --key bts-swim-sp --string-value my-value
databricks secrets put --scope bts-swim --key bts-swim-sp --binary-file config/SP.txt
databricks secrets put --scope bts-swim --key bts-swim-sp

GitHub – GitHub Actions
▪ Automate from code to Cloud
▪ Workflow Automation
▪ Any OS, any language, and any
cloud.

Lessons Learned
▪ Infrastructure and Configuration as a Code
▪ Initial setup takes time.
▪ Test, fail and improve faster.
▪ Learning curve.
▪ Version Control
▪ Easy to review changes.
▪ Helps on-boarding new developers.
▪ Security
▪ Internal Network only.
▪ Limited access.

https://github.com/Chambras/SparkSummit2020

Future State Architecture
SWIM Data Lake Architecture with Streaming SWIM into Databricks
21
Predictive Analysis and
Advanced Analytics
Bronze
Oracle
Sybase
Adhoc & Graph AnalysisSpark ETL
Silver Gold Summary/Platinum -
optional
Enrichment
OperationsSWIM DataLake
Tableau
Dashboards and Apps
Data Stores
Streaming
SWIM-TFMS
Azure Data Lake Storage
Batch
Raw XML data:,
Staging Batch
data
Parsed XML
data, Schema
Validation with
spark-xml
Joined and
Aggregated data
Potential Further
Aggregations
SWIM-Other topics
Streaming
Ingress Data ETL, Stream, and Store Data Build JIT Data Warehouse Analytics and BI
Streaming SWIM Data to Databricks
RUNTIME

Lessons Learned
▪ spark-xml is improving
▪ Need to investigate new features for mitigating
complex nested XML schemas
▪ XML Schema Validation
▪ Copying schema to executors mitigates File I/O
latencies by making use of memory for fast validation
▪ XML Schema Inference
▪ Batch processing of XML data at hourly or daily
periodicity based on SLAs mitigates allows for more
accurate inference

Next Steps
▪ Validate SWIM data against
data provided by airlines
▪ Deeper dive into predictive
modeling to gather insights on
flight delays and passengers
affected
▪ Open up data pipeline to more
SWIM data feeds

Contact Us
marcelo.zambrana@microsoft.com
@ch4mbr4s
sheila.stewart@databricks.com
m.hashemipour@dot.gov

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

Similaire à Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis