SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Building a scalable analytics platform for
personal financial planning
May 23, 2013 - Open Analytics
Cameron Sim - RoundArchIsobar (www.isobar.com)
Wednesday, May 22, 13
Agenda
About LearnVest
Architecture
Data Capture
Packaging
Data Warehousing
Metrics
Finishing up
Wednesday, May 22, 13
LearnVest Inc.
www.learnvest.com
Company
Founded in 2008 by AlexaVon Tobel, CEO
50+ People and Growing rapidly
Based in NYC
Platforms
Web & iPhone
Mission Statement
“Aiming to make financial planning as accessible as having a gym membership”
Key Products
Account Aggregation and Management
(Bank, Credit, Loan, Investment, Mortgage)
Original and Syndicated Newsletter Content
Financial Planning
(tiered product offering)
Stack
Operational
Wordpress, Backbone.js, Node.js
Java Spring 3, Redis, Memcached,
MongoDB,ActiveMQ, Nginx, MySQL 5.x
Analytics
MongoDB 2.2.0, Hadoop, Pig, Java 6, Spring 3
pyMongo
Django 1.4
Wednesday, May 22, 13
LearnVest.com
Web
Wednesday, May 22, 13
LearnVest.com
IPhone
Wednesday, May 22, 13
Conversion Funnels
Web IOS Tele-Sale, scheduled call
Account Creation
Free Assessment
Paid Product
Wednesday, May 22, 13
Component Architecture
AnalyticsProduction
Wednesday, May 22, 13
High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
Philosophy For Data Collection
Capture Everything
• User-Driven events over web and mobile
• System-level exceptions
• Everything else
Temporary Data
• Be ‘ok’ with approximate data
• Operational Databases are the system of record
Aggregate events as they come in
• Remove the overhead of basic metrics (counts, sums) on core events
•Group by user unique id and increment counts per event, over time-dimensions
(day, week-ending, month, year)
Wednesday, May 22, 13
Philosophy For Data Collection
Logical Separation
Events
• Core use cases (forms, conversion paths)
• UI Actions (button clicks, swipes, views, forms)
• HttpRequest level analysis (user-agent, ios version upgrades etc)
User
• Has a status/rating (Account Creation, Linked Bank Account, Paid Products)
• Source and Conversion Path (how was the user acquired)
• Quantified Actions (User completed x, y, z conversion actions when & how?)
• Social Interactions (Facebook,Twitter)
• Email Interactions (stats & emails for support@learnvest.com)
Wednesday, May 22, 13
Data Capture
IOS
- (void) sendAnalyticEventType:(NSString*)eventType
object:(NSString*)object
name:(NSString*)name
page:(NSString*)page
source:(NSString*)source;
{
NSMutableDictionary *eventData = [NSMutableDictionary dictionary];
if (eventType!=nil) [params setObject:eventType forKey:@"eventType"];
if (object!=nil) [eventData setObject:object forKey:@"object"];
if (name!=nil) [eventData setObject:name forKey:@"name"];
if (page!=nil) [eventData setObject:page forKey:@"page"];
if (source!=nil) [eventData setObject:source forKey:@"source"];
if (eventData!=nil) [params setObject:eventData forKey:@"eventData"];
[[LVNetworkEngine sharedManager] analytics_send:params];
}
Wednesday, May 22, 13
Data Capture
WEB (JavaScript)
function internalTrackPageView() {
var cookie = {
userContext: jQuery.cookie('UserContextCookie'),
};
var trackEvent = {
eventType: "pageView",
eventData: {
page: window.location.pathname + window.location.search
}
};
// AJAX
jQuery.ajax({
url: "/api/track",
type: "POST",
dataType: "json",
data: JSON.stringify(trackEvent),
// Set Request Headers
beforeSend: function (xhr, settings) {
xhr.setRequestHeader('Accept', 'application/json');
xhr.setRequestHeader('User-Context', cookie.userContext);
if(settings.type === 'PUT' || settings.type === 'POST') {
xhr.setRequestHeader('Content-Type', 'application/json');
}
}
});
}
Wednesday, May 22, 13
Bus Event Packaging
1.Spring 3 RESTful service layer, controller methods define the eventCode via @tracking
annotation
2.Custom Intercepter class extends HandlerInterceptorAdapter and implements
postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher
3.EventPublisher publishes to common event bus queue with multiple subscribers, one of
which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest
Service
Wednesday, May 22, 13
Bus Event Packaging
1) Spring RestController Methods
Interface
@RequestMapping(value = "/user/login", method = RequestMethod.POST,
headers="Accept=application/json")
public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,
HttpServletRequest request);
Concrete/Impl Class
@Override
@Tracking("user.login")
public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,
HttpServletRequest request){
//Implementation
return event;
}
Wednesday, May 22, 13
Bus Event Packaging
2) Custom Intercepter class extends HandlerInterceptorAdapter
protected void handleTracking(String trackingCode, Map<String, Object> modelMap,
HttpServletRequest request) {
Map<String, Object> responseModel = new HashMap<String, Object>();
// remove non-serializables & copy over data from modelMap
try {
this.eventPublisher.publish(trackingCode, responseModel, request);
} catch (Exception e) {
log.error("Error tracking event '" + trackingCode + "' : "
+ ExceptionUtils.getStackTrace(e));
}
}
Wednesday, May 22, 13
Bus Event Packaging
2) Custom Intercepter class extends HandlerInterceptorAdapter
public void publish (String eventCode, Map<String,Object> eventData,
HttpServletRequest request) {
Map<String,Object> payload = new HashMap<String,Object>();
String eventId=UUID.randomUUID().toString();
Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request);
//Normalize message
payload.put("eventType", eventData.get("eventType"));
payload.put("eventData", eventData.get("eventType"));
payload.put("version", eventData.get("eventType"));
payload.put("eventId", eventId);
payload.put("eventTime", new Date());
payload.put("request", requestMap);
.
.
.
//Send to the Analytics Service for MongoDB persistence
}
public void sendPost(EventPayload payload){
HttpEntity request = new HttpEntity(payload.getEventPayload(), headers);
Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);
}
Wednesday, May 22, 13
Bus Event Packaging
The Serialized Json (User Action)
{
“eventCode” : “user.login”,
“eventType” : “login”,
“version” : “1.0”,
“eventTime” : “1358603157746”,
“eventData” : {
“” : “”,
“” : “”,
“” : “”
},
“request” : {
“call-source” : “WEB”,
“user-context” : “00002b4f1150249206ac2b692e48ddb3”,
“user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)
AppleWebKit/537.11 (KHTML, like Gecko) Chrome/
23.0.1271.101 Safari/537.11”,
“cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516
ee2fae50cef6500101dc89; resolution=1920;
JSESSIONID=56EB165266A2C4AFF9
46F139669D746F; csrftoken=73bdcd
ddf151dc56b8020855b2cb10c8", "content-length" :
"204", "accept-encoding" : "gzip,deflate,sdch”,
}
}
Wednesday, May 22, 13
Bus Event Packaging
The Serialized Json (Generic Event)
{
“eventCode” : “generic.ui”,
“eventType” : “pageView”,
“version” : “1.0”,
“eventTime” : “1358603157746”,
“eventData” : {
“page” : “/learnvest/moneycenter/inbox”,
“section” : “transactions”,
“name” : “view transactions”
“object” : “page”
},
“request” : {
“call-source” : “WEB”,
“user-context” : “00002b4f1150249206ac2b692e48ddb3”,
“user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)
AppleWebKit/537.11 (KHTML, like Gecko) Chrome/
23.0.1271.101 Safari/537.11”,
“cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516
ee2fae50cef6500101dc89; resolution=1920;
JSESSIONID=56EB165266A2C4AFF9
46F139669D746F; csrftoken=73bdcd
ddf151dc56b8020855b2cb10c8", "content-length" :
"204", "accept-encoding" : "gzip,deflate,sdch”,
}
}
Wednesday, May 22, 13
Bus Event Packaging
The Serialized Json (Generic Event)
{
“eventCode” : “generic.ui”,
“eventType” : “pageView”,
“version” : “1.0”,
“eventTime” : “1358603157746”,
“eventData” : {
“page” : “/learnvest/moneycenter/inbox”,
“section” : “transactions”,
“name” : “view transactions”
“object” : “page”
},
“request” : {
“call-source” : “WEB”,
“user-context” : “00002b4f1150249206ac2b692e48ddb3”,
“user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)
AppleWebKit/537.11 (KHTML, like Gecko) Chrome/
23.0.1271.101 Safari/537.11”,
“cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516
ee2fae50cef6500101dc89; resolution=1920;
JSESSIONID=56EB165266A2C4AFF9
46F139669D746F; csrftoken=73bdcd
ddf151dc56b8020855b2cb10c8", "content-length" :
"204", "accept-encoding" : "gzip,deflate,sdch”,
}
}
Wednesday, May 22, 13
Event Data Warehousing
MongoDB Information
• v2.2.0
• 3-node replica-set
• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines
• Each with single 500GB EBS volumes mounted to /opt/data
MongoDB Config File
dbpath = /opt/data/mongodb/data
rest = true
replSet = voyager
Volumes
~IM events daily on web, ~600K on mobile
2-3 GB per day at start, slowed to ~1GB per day
Currently at 78GB (collecting since August 2012)
Future Scaling Strategy
• Setup 2nd Replica-Set in a new AWS region
• Not intending to shard - data is archived 12 months in lieu
Wednesday, May 22, 13
Event Data Warehousing
Approach
1. Persist all events, bucketed by source:-
WEB
MOBILE
2. Persist all events, bucketed by source, event code and time:-
WEB/MOBILE
user.login
time (day, week-ending, month, year)
3. Insert into collection e_web / e_mobile
4.Also insert into Daily, weekly and monthly collections for main payload and http request
payload
• e_web_05232013
• e_web_request_05232013
4. Predictable model for scaling and measuring business growth
Wednesday, May 22, 13
Event Data Warehousing
Persist all events
> db.e_web.findOne()
{ "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" :
ISODate("2013-01-02T21:07:55.656Z"), "created_date" :
ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/
json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" :
"localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" :
"c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://
fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac
OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/
537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4;
CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;
JSESSIONID=56EB165266A2C4AFF946F139669D746F;
csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-
encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" :
"button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" :
Wednesday, May 22, 13
Event Data Warehousing
Access Pattern
•No reads off primary node, insert only
•Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large
Instance and 3.75GB on Medium instances
•Split datetime in two fields and compound index on date with other fields like eventType
and user unique id (user-context)
	 	 	 	
Wednesday, May 22, 13
Event Data Warehousing
Indexing Strategy
> db.e_web.getIndexes()
[
{
"v" : 1,
"key" : {
"request.user-context" : 1,
"created_date" : 1
},
"ns" : "moneycenter.e_web",
"name" : "request.user-context_1_created_date_1"
},
{
"v" : 1,
"key" : {
"eventData.name" : 1,
"created_date" : 1
},
"ns" : "moneycenter.e_web",
"name" : "eventData.name_1_created_date_1"
}
]
	 	 	 	
Wednesday, May 22, 13
User Data Warehousing
Elastic Search (http://www.elasticsearch.org/)
• Open-source lucene cluster
• Mature query language, accessed via RestAPI
• Unstructured schema and feature rich
• Strong API support
Configuration
•Single instance for user
•Deployed over 3 EC2 Medium AML instances
•Updated by a Java process checking a redis cache for uuids
•Accessed by multiple applications for canonical user objects
	 	 	 	
Wednesday, May 22, 13
User Data Warehousing
Building the User Object
For each userid in the redis cache, retrieve the following infomration:-
• ODS Slave (Learnvest data)
• Jotform.com (eform submissions)
• FullSlate.com (calendar appointments)
• Stripe.com (payments)
• Desk.com (emails)
Build a canonical JSON Object and save in the elasticsearch cluster
Map<String, String> user = new HashMap<String,String>();
source.put(...);
client.execute(new Index.Builder(source).index(“Users”);
Wednesday, May 22, 13
Metrics
Objective
• Show historic and intraday stats on core use cases (logins, conversions)
• Show user funnel rates on conversion pages
• Show general usability - how do users really use the Web and IOS platforms?
Non-Functionals
• Intraday doesn’t need to be “real-time”, polling is good enough for now
• Overnight batch job for historic must scale horizontally
General Implementation Strategy
• Do all heavy lifting & object manipulation, UI should just display graph or table
• Modularize the service to be able to regenerate any graphs/tables without a full load
Wednesday, May 22, 13
Metrics
Java Batch Service
Java Mongo library to query key collections and return user counts and sum of events
DBCursor webUserLogins = c.find(
new BasicDBObject("date", sdf.format(new Date())));
private HashMap<String, Object> getSumAndCount(DBCursor cursor){
HashMap<String, Object> m = new HashMap<String, Object>();
int sum=0;
int count=0;
DBObject obj;
while(cursor.hasNext()){
obj=(DBObject)cursor.next();
count++;
sum=sum+(Integer)obj.get("count");
}
m.put("sum", sum);
m.put("count", count);
m.put("average", sdf.format(new Float(sum)/count));
return m;
}
Wednesday, May 22, 13
Metrics
Java Batch Service
Use Aggregation Framework where required on core collections (e_web) and external data
//create aggregation objects
DBObject project = new BasicDBObject("$project",
new BasicDBObject("day_value", fields) );
DBObject day_value = new BasicDBObject( "day_value", "$day_value");
DBObject groupFields = new BasicDBObject( "_id", day_value);
//create the fields to group by, in this case “number”
groupFields.put("number", new BasicDBObject( "$sum", 1));
//create the group
DBObject group = new BasicDBObject("$group", groupFields);
//execute
AggregationOutput output = mycollection.aggregate( project, group );
for(DBObject obj : output.results()){
.
.
}
Wednesday, May 22, 13
Metrics
Java Batch Service
MongoDB Command Line example on aggregation over a time period, e.g. month
> db.e_web.aggregate(
[
{ $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}},
{ $project : {
day_value : {"day" : { $dayOfMonth : "$created_date" },
"month":{ $month : "$created_date" }}
}},
{ $group : {
_id : {day_value:"$day_value"} ,
number : { $sum : 1 }
} },
{ $sort : { day_value : -1 } }
]
)
Wednesday, May 22, 13
Metrics
Java Batch Service
Persisting events into graph and table collections
>db.homeGraphs.find()
{ "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54,
"accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" :
"12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0",
"users_avg_linked" : "3.43", "users_linked" : 7 }
{ "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144,
"accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" :
"11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0",
"users_avg_linked" : "4", "users_linked" : 16 }
{ "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119,
"accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" :
Wednesday, May 22, 13
Metrics
Django and HighCharts
Extract data (pyMongo)
def getHomeChart(dt_from, dt_to):
"""Called by home method to get latest 30 day numbers"""
try:
conn = pymongo.Connection('localhost', 27017)
db = conn['lvanalytics']
cursor = db.accountmetrics.find(
{"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date")
return buildMetricsDict(cursor)
except Exception as e:
logger.error(e.message)
Return the graph object (as a list or a dict of lists) to the view that called the
method
pagedata={}
pagedata['accountsGraph']=mongodb_home.getHomeChart()
return render_to_response('home.html',{'pagedata': pagedata},
context_instance=RequestContext(request))
Wednesday, May 22, 13
Metrics
Django and HighCharts
Populate the series.. (JavaScript with Django templating)
seriesOptions[0] = {
id: 'naturalAccounts',
name: "Natural Accounts",
data: [
{% for a in pagedata.metrics.accounts_natural %}
{% if not forloop.first %}, {% endif %}
[Date.UTC({{a.0}}),{{a.1}}]
{% endfor %}
],
tooltip: {
valueDecimals: 2
}
};
Wednesday, May 22, 13
Metrics
Django and HighCharts
And Create the Charts and Tables...
Wednesday, May 22, 13
Metrics
Django and HighCharts
And Create the Charts and Tables...
Wednesday, May 22, 13
Data Science Tools
IPython Notebook
• Deployed on an EC2 Large AML Medium Instance
• Configured for Python 2.7.3
• Loaded with MatPlotLib, PyLab, SciPy, Numpi, pyMongo, MySQL-python
Insights
•Write wrapper methods to access user data
•Accessible to anyone through a browser
•Very effective way to scale quickly with little overhead
Applications
• Decision tree analysis over website and ios - showed common paths
• Session level analysis on IOS devices
• Multi-page form conversion retention rates
• Quicly coduct segment analysis via a programming aPI
Wednesday, May 22, 13
Data Science Tools
PIG
• Executed using ruby scripts
• Pulled data from MongoDB
• Forwarded to AWS EMR cluster for analysis
• MR functions written in Python and occasionally Java
Insights
• Used for ad-hoc analysis involving large datasets
Applications
• Daily,Weekly, Monthly conversion metrics on page views and forms
• Identified trends in spending over 1M rows
• Used lightly at Learnvest, growing in capability
Wednesday, May 22, 13
Things that didn’t work
MongoDB Upserts
Quickly becomes read-heavy and slows down the db
MongoDB Aggregation Framework
Fine for adhoc analysis but you might be better off with establishing a
repeatable framework to run MR algos
Django-noRel
Unstable, use Django and configure MongoDB as a datastore only
Wednesday, May 22, 13
Lessons Learned
•Date Time managed as two fields, Datetime and Date
• Real-time Map-Reduce in pyMongo - too slow, don’t do this.
•Memcached on Django is good enough (at the moment) - use django-
celery with rabbitmq to pre-cache all data after data loading
• HighCharts is buggy - considering D3 & other libraries
•Don’t need to retrieve data directly from MongoDB to Django, perhaps
provide all data via a service layer (at the expense of ever-additional
features in pyMongo)
•Make better use of EMR upfront if resources are limited and data is vast.
Wednesday, May 22, 13
Thanks!...Questions?
Wednesday, May 22, 13

Contenu connexe

Similaire à Open analytics | Cameron Sim

AnDevCon - Tracking User Behavior Creatively
AnDevCon - Tracking User Behavior CreativelyAnDevCon - Tracking User Behavior Creatively
AnDevCon - Tracking User Behavior Creatively
Kiana Tennyson
 
mDevCamp - The Best from Google IO
mDevCamp - The Best from Google IOmDevCamp - The Best from Google IO
mDevCamp - The Best from Google IO
ondraz
 
GemFire In Memory Data Grid
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data Grid
Dmitry Buzdin
 

Similaire à Open analytics | Cameron Sim (20)

Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on Mobile
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at Grab
 
3 Mobile App Dev Problems - Monospace
3 Mobile App Dev Problems - Monospace3 Mobile App Dev Problems - Monospace
3 Mobile App Dev Problems - Monospace
 
Clean architectures with fast api pycones
Clean architectures with fast api   pyconesClean architectures with fast api   pycones
Clean architectures with fast api pycones
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
AnDevCon - Tracking User Behavior Creatively
AnDevCon - Tracking User Behavior CreativelyAnDevCon - Tracking User Behavior Creatively
AnDevCon - Tracking User Behavior Creatively
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
mDevCamp - The Best from Google IO
mDevCamp - The Best from Google IOmDevCamp - The Best from Google IO
mDevCamp - The Best from Google IO
 
Sencha Roadshow 2017: Build Progressive Web Apps with Ext JS and Cmd
Sencha Roadshow 2017: Build Progressive Web Apps with Ext JS and Cmd Sencha Roadshow 2017: Build Progressive Web Apps with Ext JS and Cmd
Sencha Roadshow 2017: Build Progressive Web Apps with Ext JS and Cmd
 
Graphical display of statistical data on Android
Graphical display of statistical data on AndroidGraphical display of statistical data on Android
Graphical display of statistical data on Android
 
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
 
Micro service architecture
Micro service architectureMicro service architecture
Micro service architecture
 
GemFire In Memory Data Grid
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data Grid
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBig Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it all
 
odkk.pptx
odkk.pptxodkk.pptx
odkk.pptx
 
A new approach for user identification in web usage mining preprocessing
A new approach for user identification in web usage mining preprocessingA new approach for user identification in web usage mining preprocessing
A new approach for user identification in web usage mining preprocessing
 
GemFire In-Memory Data Grid
GemFire In-Memory Data GridGemFire In-Memory Data Grid
GemFire In-Memory Data Grid
 
OpenSocial and Mixi platform
OpenSocial and Mixi platformOpenSocial and Mixi platform
OpenSocial and Mixi platform
 

Plus de Open Analytics

MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
Open Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Open Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
Open Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
Open Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Open Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
Open Analytics
 

Plus de Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Open analytics | Cameron Sim

  • 1. Building a scalable analytics platform for personal financial planning May 23, 2013 - Open Analytics Cameron Sim - RoundArchIsobar (www.isobar.com) Wednesday, May 22, 13
  • 2. Agenda About LearnVest Architecture Data Capture Packaging Data Warehousing Metrics Finishing up Wednesday, May 22, 13
  • 3. LearnVest Inc. www.learnvest.com Company Founded in 2008 by AlexaVon Tobel, CEO 50+ People and Growing rapidly Based in NYC Platforms Web & iPhone Mission Statement “Aiming to make financial planning as accessible as having a gym membership” Key Products Account Aggregation and Management (Bank, Credit, Loan, Investment, Mortgage) Original and Syndicated Newsletter Content Financial Planning (tiered product offering) Stack Operational Wordpress, Backbone.js, Node.js Java Spring 3, Redis, Memcached, MongoDB,ActiveMQ, Nginx, MySQL 5.x Analytics MongoDB 2.2.0, Hadoop, Pig, Java 6, Spring 3 pyMongo Django 1.4 Wednesday, May 22, 13
  • 6. Conversion Funnels Web IOS Tele-Sale, scheduled call Account Creation Free Assessment Paid Product Wednesday, May 22, 13
  • 8. High Level Architecture} } } } Analytics Services & Event Capture Aggregation & Indexed Search Tools & Dashboards Production Production Services Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science Wednesday, May 22, 13
  • 9. High Level Architecture} } } } Analytics Services & Event Capture Aggregation & Indexed Search Tools & Dashboards Production Production Services Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science Wednesday, May 22, 13
  • 10. High Level Architecture} } } } Analytics Services & Event Capture Aggregation & Indexed Search Tools & Dashboards Production Production Services Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science Wednesday, May 22, 13
  • 11. High Level Architecture} } } } Analytics Services & Event Capture Aggregation & Indexed Search Tools & Dashboards Production Production Services Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science Wednesday, May 22, 13
  • 12. High Level Architecture} } } } Analytics Services & Event Capture Aggregation & Indexed Search Tools & Dashboards Production Production Services Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science Wednesday, May 22, 13
  • 13. Philosophy For Data Collection Capture Everything • User-Driven events over web and mobile • System-level exceptions • Everything else Temporary Data • Be ‘ok’ with approximate data • Operational Databases are the system of record Aggregate events as they come in • Remove the overhead of basic metrics (counts, sums) on core events •Group by user unique id and increment counts per event, over time-dimensions (day, week-ending, month, year) Wednesday, May 22, 13
  • 14. Philosophy For Data Collection Logical Separation Events • Core use cases (forms, conversion paths) • UI Actions (button clicks, swipes, views, forms) • HttpRequest level analysis (user-agent, ios version upgrades etc) User • Has a status/rating (Account Creation, Linked Bank Account, Paid Products) • Source and Conversion Path (how was the user acquired) • Quantified Actions (User completed x, y, z conversion actions when & how?) • Social Interactions (Facebook,Twitter) • Email Interactions (stats & emails for support@learnvest.com) Wednesday, May 22, 13
  • 15. Data Capture IOS - (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source; { NSMutableDictionary *eventData = [NSMutableDictionary dictionary]; if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params]; } Wednesday, May 22, 13
  • 16. Data Capture WEB (JavaScript) function internalTrackPageView() { var cookie = { userContext: jQuery.cookie('UserContextCookie'), }; var trackEvent = { eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } }; // AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader('Accept', 'application/json'); xhr.setRequestHeader('User-Context', cookie.userContext); if(settings.type === 'PUT' || settings.type === 'POST') { xhr.setRequestHeader('Content-Type', 'application/json'); } } }); } Wednesday, May 22, 13
  • 17. Bus Event Packaging 1.Spring 3 RESTful service layer, controller methods define the eventCode via @tracking annotation 2.Custom Intercepter class extends HandlerInterceptorAdapter and implements postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher 3.EventPublisher publishes to common event bus queue with multiple subscribers, one of which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest Service Wednesday, May 22, 13
  • 18. Bus Event Packaging 1) Spring RestController Methods Interface @RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request); Concrete/Impl Class @Override @Tracking("user.login") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){ //Implementation return event; } Wednesday, May 22, 13
  • 19. Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) { Map<String, Object> responseModel = new HashMap<String, Object>(); // remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event '" + trackingCode + "' : " + ExceptionUtils.getStackTrace(e)); } } Wednesday, May 22, 13
  • 20. Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) { Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence } public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class); } Wednesday, May 22, 13
  • 21. Bus Event Packaging The Serialized Json (User Action) { “eventCode” : “user.login”, “eventType” : “login”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “” : “”, “” : “”, “” : “” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } } Wednesday, May 22, 13
  • 22. Bus Event Packaging The Serialized Json (Generic Event) { “eventCode” : “generic.ui”, “eventType” : “pageView”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } } Wednesday, May 22, 13
  • 23. Bus Event Packaging The Serialized Json (Generic Event) { “eventCode” : “generic.ui”, “eventType” : “pageView”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } } Wednesday, May 22, 13
  • 24. Event Data Warehousing MongoDB Information • v2.2.0 • 3-node replica-set • 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines • Each with single 500GB EBS volumes mounted to /opt/data MongoDB Config File dbpath = /opt/data/mongodb/data rest = true replSet = voyager Volumes ~IM events daily on web, ~600K on mobile 2-3 GB per day at start, slowed to ~1GB per day Currently at 78GB (collecting since August 2012) Future Scaling Strategy • Setup 2nd Replica-Set in a new AWS region • Not intending to shard - data is archived 12 months in lieu Wednesday, May 22, 13
  • 25. Event Data Warehousing Approach 1. Persist all events, bucketed by source:- WEB MOBILE 2. Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year) 3. Insert into collection e_web / e_mobile 4.Also insert into Daily, weekly and monthly collections for main payload and http request payload • e_web_05232013 • e_web_request_05232013 4. Predictable model for scaling and measuring business growth Wednesday, May 22, 13
  • 26. Event Data Warehousing Persist all events > db.e_web.findOne() { "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/ json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension:// fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/ 537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept- encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" : Wednesday, May 22, 13
  • 27. Event Data Warehousing Access Pattern •No reads off primary node, insert only •Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances •Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context) Wednesday, May 22, 13
  • 28. Event Data Warehousing Indexing Strategy > db.e_web.getIndexes() [ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" } ] Wednesday, May 22, 13
  • 29. User Data Warehousing Elastic Search (http://www.elasticsearch.org/) • Open-source lucene cluster • Mature query language, accessed via RestAPI • Unstructured schema and feature rich • Strong API support Configuration •Single instance for user •Deployed over 3 EC2 Medium AML instances •Updated by a Java process checking a redis cache for uuids •Accessed by multiple applications for canonical user objects Wednesday, May 22, 13
  • 30. User Data Warehousing Building the User Object For each userid in the redis cache, retrieve the following infomration:- • ODS Slave (Learnvest data) • Jotform.com (eform submissions) • FullSlate.com (calendar appointments) • Stripe.com (payments) • Desk.com (emails) Build a canonical JSON Object and save in the elasticsearch cluster Map<String, String> user = new HashMap<String,String>(); source.put(...); client.execute(new Index.Builder(source).index(“Users”); Wednesday, May 22, 13
  • 31. Metrics Objective • Show historic and intraday stats on core use cases (logins, conversions) • Show user funnel rates on conversion pages • Show general usability - how do users really use the Web and IOS platforms? Non-Functionals • Intraday doesn’t need to be “real-time”, polling is good enough for now • Overnight batch job for historic must scale horizontally General Implementation Strategy • Do all heavy lifting & object manipulation, UI should just display graph or table • Modularize the service to be able to regenerate any graphs/tables without a full load Wednesday, May 22, 13
  • 32. Metrics Java Batch Service Java Mongo library to query key collections and return user counts and sum of events DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date()))); private HashMap<String, Object> getSumAndCount(DBCursor cursor){ HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m; } Wednesday, May 22, 13
  • 33. Metrics Java Batch Service Use Aggregation Framework where required on core collections (e_web) and external data //create aggregation objects DBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) ); DBObject day_value = new BasicDBObject( "day_value", "$day_value"); DBObject groupFields = new BasicDBObject( "_id", day_value); //create the fields to group by, in this case “number” groupFields.put("number", new BasicDBObject( "$sum", 1)); //create the group DBObject group = new BasicDBObject("$group", groupFields); //execute AggregationOutput output = mycollection.aggregate( project, group ); for(DBObject obj : output.results()){ . . } Wednesday, May 22, 13
  • 34. Metrics Java Batch Service MongoDB Command Line example on aggregation over a time period, e.g. month > db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} , number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ] ) Wednesday, May 22, 13
  • 35. Metrics Java Batch Service Persisting events into graph and table collections >db.homeGraphs.find() { "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 } { "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 } { "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" : Wednesday, May 22, 13
  • 36. Metrics Django and HighCharts Extract data (pyMongo) def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection('localhost', 27017) db = conn['lvanalytics'] cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor) except Exception as e: logger.error(e.message) Return the graph object (as a list or a dict of lists) to the view that called the method pagedata={} pagedata['accountsGraph']=mongodb_home.getHomeChart() return render_to_response('home.html',{'pagedata': pagedata}, context_instance=RequestContext(request)) Wednesday, May 22, 13
  • 37. Metrics Django and HighCharts Populate the series.. (JavaScript with Django templating) seriesOptions[0] = { id: 'naturalAccounts', name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } }; Wednesday, May 22, 13
  • 38. Metrics Django and HighCharts And Create the Charts and Tables... Wednesday, May 22, 13
  • 39. Metrics Django and HighCharts And Create the Charts and Tables... Wednesday, May 22, 13
  • 40. Data Science Tools IPython Notebook • Deployed on an EC2 Large AML Medium Instance • Configured for Python 2.7.3 • Loaded with MatPlotLib, PyLab, SciPy, Numpi, pyMongo, MySQL-python Insights •Write wrapper methods to access user data •Accessible to anyone through a browser •Very effective way to scale quickly with little overhead Applications • Decision tree analysis over website and ios - showed common paths • Session level analysis on IOS devices • Multi-page form conversion retention rates • Quicly coduct segment analysis via a programming aPI Wednesday, May 22, 13
  • 41. Data Science Tools PIG • Executed using ruby scripts • Pulled data from MongoDB • Forwarded to AWS EMR cluster for analysis • MR functions written in Python and occasionally Java Insights • Used for ad-hoc analysis involving large datasets Applications • Daily,Weekly, Monthly conversion metrics on page views and forms • Identified trends in spending over 1M rows • Used lightly at Learnvest, growing in capability Wednesday, May 22, 13
  • 42. Things that didn’t work MongoDB Upserts Quickly becomes read-heavy and slows down the db MongoDB Aggregation Framework Fine for adhoc analysis but you might be better off with establishing a repeatable framework to run MR algos Django-noRel Unstable, use Django and configure MongoDB as a datastore only Wednesday, May 22, 13
  • 43. Lessons Learned •Date Time managed as two fields, Datetime and Date • Real-time Map-Reduce in pyMongo - too slow, don’t do this. •Memcached on Django is good enough (at the moment) - use django- celery with rabbitmq to pre-cache all data after data loading • HighCharts is buggy - considering D3 & other libraries •Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo) •Make better use of EMR upfront if resources are limited and data is vast. Wednesday, May 22, 13