How to Troubleshoot Apps for the Modern Connected Worker
Open analytics | Cameron Sim
1. Building a scalable analytics platform for
personal financial planning
May 23, 2013 - Open Analytics
Cameron Sim - RoundArchIsobar (www.isobar.com)
Wednesday, May 22, 13
3. LearnVest Inc.
www.learnvest.com
Company
Founded in 2008 by AlexaVon Tobel, CEO
50+ People and Growing rapidly
Based in NYC
Platforms
Web & iPhone
Mission Statement
“Aiming to make financial planning as accessible as having a gym membership”
Key Products
Account Aggregation and Management
(Bank, Credit, Loan, Investment, Mortgage)
Original and Syndicated Newsletter Content
Financial Planning
(tiered product offering)
Stack
Operational
Wordpress, Backbone.js, Node.js
Java Spring 3, Redis, Memcached,
MongoDB,ActiveMQ, Nginx, MySQL 5.x
Analytics
MongoDB 2.2.0, Hadoop, Pig, Java 6, Spring 3
pyMongo
Django 1.4
Wednesday, May 22, 13
8. High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
9. High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
10. High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
11. High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
12. High Level Architecture}
}
}
}
Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data Science
Wednesday, May 22, 13
13. Philosophy For Data Collection
Capture Everything
• User-Driven events over web and mobile
• System-level exceptions
• Everything else
Temporary Data
• Be ‘ok’ with approximate data
• Operational Databases are the system of record
Aggregate events as they come in
• Remove the overhead of basic metrics (counts, sums) on core events
•Group by user unique id and increment counts per event, over time-dimensions
(day, week-ending, month, year)
Wednesday, May 22, 13
14. Philosophy For Data Collection
Logical Separation
Events
• Core use cases (forms, conversion paths)
• UI Actions (button clicks, swipes, views, forms)
• HttpRequest level analysis (user-agent, ios version upgrades etc)
User
• Has a status/rating (Account Creation, Linked Bank Account, Paid Products)
• Source and Conversion Path (how was the user acquired)
• Quantified Actions (User completed x, y, z conversion actions when & how?)
• Social Interactions (Facebook,Twitter)
• Email Interactions (stats & emails for support@learnvest.com)
Wednesday, May 22, 13
15. Data Capture
IOS
- (void) sendAnalyticEventType:(NSString*)eventType
object:(NSString*)object
name:(NSString*)name
page:(NSString*)page
source:(NSString*)source;
{
NSMutableDictionary *eventData = [NSMutableDictionary dictionary];
if (eventType!=nil) [params setObject:eventType forKey:@"eventType"];
if (object!=nil) [eventData setObject:object forKey:@"object"];
if (name!=nil) [eventData setObject:name forKey:@"name"];
if (page!=nil) [eventData setObject:page forKey:@"page"];
if (source!=nil) [eventData setObject:source forKey:@"source"];
if (eventData!=nil) [params setObject:eventData forKey:@"eventData"];
[[LVNetworkEngine sharedManager] analytics_send:params];
}
Wednesday, May 22, 13
16. Data Capture
WEB (JavaScript)
function internalTrackPageView() {
var cookie = {
userContext: jQuery.cookie('UserContextCookie'),
};
var trackEvent = {
eventType: "pageView",
eventData: {
page: window.location.pathname + window.location.search
}
};
// AJAX
jQuery.ajax({
url: "/api/track",
type: "POST",
dataType: "json",
data: JSON.stringify(trackEvent),
// Set Request Headers
beforeSend: function (xhr, settings) {
xhr.setRequestHeader('Accept', 'application/json');
xhr.setRequestHeader('User-Context', cookie.userContext);
if(settings.type === 'PUT' || settings.type === 'POST') {
xhr.setRequestHeader('Content-Type', 'application/json');
}
}
});
}
Wednesday, May 22, 13
17. Bus Event Packaging
1.Spring 3 RESTful service layer, controller methods define the eventCode via @tracking
annotation
2.Custom Intercepter class extends HandlerInterceptorAdapter and implements
postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher
3.EventPublisher publishes to common event bus queue with multiple subscribers, one of
which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest
Service
Wednesday, May 22, 13
18. Bus Event Packaging
1) Spring RestController Methods
Interface
@RequestMapping(value = "/user/login", method = RequestMethod.POST,
headers="Accept=application/json")
public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,
HttpServletRequest request);
Concrete/Impl Class
@Override
@Tracking("user.login")
public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,
HttpServletRequest request){
//Implementation
return event;
}
Wednesday, May 22, 13
19. Bus Event Packaging
2) Custom Intercepter class extends HandlerInterceptorAdapter
protected void handleTracking(String trackingCode, Map<String, Object> modelMap,
HttpServletRequest request) {
Map<String, Object> responseModel = new HashMap<String, Object>();
// remove non-serializables & copy over data from modelMap
try {
this.eventPublisher.publish(trackingCode, responseModel, request);
} catch (Exception e) {
log.error("Error tracking event '" + trackingCode + "' : "
+ ExceptionUtils.getStackTrace(e));
}
}
Wednesday, May 22, 13
20. Bus Event Packaging
2) Custom Intercepter class extends HandlerInterceptorAdapter
public void publish (String eventCode, Map<String,Object> eventData,
HttpServletRequest request) {
Map<String,Object> payload = new HashMap<String,Object>();
String eventId=UUID.randomUUID().toString();
Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request);
//Normalize message
payload.put("eventType", eventData.get("eventType"));
payload.put("eventData", eventData.get("eventType"));
payload.put("version", eventData.get("eventType"));
payload.put("eventId", eventId);
payload.put("eventTime", new Date());
payload.put("request", requestMap);
.
.
.
//Send to the Analytics Service for MongoDB persistence
}
public void sendPost(EventPayload payload){
HttpEntity request = new HttpEntity(payload.getEventPayload(), headers);
Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);
}
Wednesday, May 22, 13
24. Event Data Warehousing
MongoDB Information
• v2.2.0
• 3-node replica-set
• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines
• Each with single 500GB EBS volumes mounted to /opt/data
MongoDB Config File
dbpath = /opt/data/mongodb/data
rest = true
replSet = voyager
Volumes
~IM events daily on web, ~600K on mobile
2-3 GB per day at start, slowed to ~1GB per day
Currently at 78GB (collecting since August 2012)
Future Scaling Strategy
• Setup 2nd Replica-Set in a new AWS region
• Not intending to shard - data is archived 12 months in lieu
Wednesday, May 22, 13
25. Event Data Warehousing
Approach
1. Persist all events, bucketed by source:-
WEB
MOBILE
2. Persist all events, bucketed by source, event code and time:-
WEB/MOBILE
user.login
time (day, week-ending, month, year)
3. Insert into collection e_web / e_mobile
4.Also insert into Daily, weekly and monthly collections for main payload and http request
payload
• e_web_05232013
• e_web_request_05232013
4. Predictable model for scaling and measuring business growth
Wednesday, May 22, 13
27. Event Data Warehousing
Access Pattern
•No reads off primary node, insert only
•Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large
Instance and 3.75GB on Medium instances
•Split datetime in two fields and compound index on date with other fields like eventType
and user unique id (user-context)
Wednesday, May 22, 13
29. User Data Warehousing
Elastic Search (http://www.elasticsearch.org/)
• Open-source lucene cluster
• Mature query language, accessed via RestAPI
• Unstructured schema and feature rich
• Strong API support
Configuration
•Single instance for user
•Deployed over 3 EC2 Medium AML instances
•Updated by a Java process checking a redis cache for uuids
•Accessed by multiple applications for canonical user objects
Wednesday, May 22, 13
30. User Data Warehousing
Building the User Object
For each userid in the redis cache, retrieve the following infomration:-
• ODS Slave (Learnvest data)
• Jotform.com (eform submissions)
• FullSlate.com (calendar appointments)
• Stripe.com (payments)
• Desk.com (emails)
Build a canonical JSON Object and save in the elasticsearch cluster
Map<String, String> user = new HashMap<String,String>();
source.put(...);
client.execute(new Index.Builder(source).index(“Users”);
Wednesday, May 22, 13
31. Metrics
Objective
• Show historic and intraday stats on core use cases (logins, conversions)
• Show user funnel rates on conversion pages
• Show general usability - how do users really use the Web and IOS platforms?
Non-Functionals
• Intraday doesn’t need to be “real-time”, polling is good enough for now
• Overnight batch job for historic must scale horizontally
General Implementation Strategy
• Do all heavy lifting & object manipulation, UI should just display graph or table
• Modularize the service to be able to regenerate any graphs/tables without a full load
Wednesday, May 22, 13
32. Metrics
Java Batch Service
Java Mongo library to query key collections and return user counts and sum of events
DBCursor webUserLogins = c.find(
new BasicDBObject("date", sdf.format(new Date())));
private HashMap<String, Object> getSumAndCount(DBCursor cursor){
HashMap<String, Object> m = new HashMap<String, Object>();
int sum=0;
int count=0;
DBObject obj;
while(cursor.hasNext()){
obj=(DBObject)cursor.next();
count++;
sum=sum+(Integer)obj.get("count");
}
m.put("sum", sum);
m.put("count", count);
m.put("average", sdf.format(new Float(sum)/count));
return m;
}
Wednesday, May 22, 13
33. Metrics
Java Batch Service
Use Aggregation Framework where required on core collections (e_web) and external data
//create aggregation objects
DBObject project = new BasicDBObject("$project",
new BasicDBObject("day_value", fields) );
DBObject day_value = new BasicDBObject( "day_value", "$day_value");
DBObject groupFields = new BasicDBObject( "_id", day_value);
//create the fields to group by, in this case “number”
groupFields.put("number", new BasicDBObject( "$sum", 1));
//create the group
DBObject group = new BasicDBObject("$group", groupFields);
//execute
AggregationOutput output = mycollection.aggregate( project, group );
for(DBObject obj : output.results()){
.
.
}
Wednesday, May 22, 13
34. Metrics
Java Batch Service
MongoDB Command Line example on aggregation over a time period, e.g. month
> db.e_web.aggregate(
[
{ $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}},
{ $project : {
day_value : {"day" : { $dayOfMonth : "$created_date" },
"month":{ $month : "$created_date" }}
}},
{ $group : {
_id : {day_value:"$day_value"} ,
number : { $sum : 1 }
} },
{ $sort : { day_value : -1 } }
]
)
Wednesday, May 22, 13
36. Metrics
Django and HighCharts
Extract data (pyMongo)
def getHomeChart(dt_from, dt_to):
"""Called by home method to get latest 30 day numbers"""
try:
conn = pymongo.Connection('localhost', 27017)
db = conn['lvanalytics']
cursor = db.accountmetrics.find(
{"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date")
return buildMetricsDict(cursor)
except Exception as e:
logger.error(e.message)
Return the graph object (as a list or a dict of lists) to the view that called the
method
pagedata={}
pagedata['accountsGraph']=mongodb_home.getHomeChart()
return render_to_response('home.html',{'pagedata': pagedata},
context_instance=RequestContext(request))
Wednesday, May 22, 13
37. Metrics
Django and HighCharts
Populate the series.. (JavaScript with Django templating)
seriesOptions[0] = {
id: 'naturalAccounts',
name: "Natural Accounts",
data: [
{% for a in pagedata.metrics.accounts_natural %}
{% if not forloop.first %}, {% endif %}
[Date.UTC({{a.0}}),{{a.1}}]
{% endfor %}
],
tooltip: {
valueDecimals: 2
}
};
Wednesday, May 22, 13
40. Data Science Tools
IPython Notebook
• Deployed on an EC2 Large AML Medium Instance
• Configured for Python 2.7.3
• Loaded with MatPlotLib, PyLab, SciPy, Numpi, pyMongo, MySQL-python
Insights
•Write wrapper methods to access user data
•Accessible to anyone through a browser
•Very effective way to scale quickly with little overhead
Applications
• Decision tree analysis over website and ios - showed common paths
• Session level analysis on IOS devices
• Multi-page form conversion retention rates
• Quicly coduct segment analysis via a programming aPI
Wednesday, May 22, 13
41. Data Science Tools
PIG
• Executed using ruby scripts
• Pulled data from MongoDB
• Forwarded to AWS EMR cluster for analysis
• MR functions written in Python and occasionally Java
Insights
• Used for ad-hoc analysis involving large datasets
Applications
• Daily,Weekly, Monthly conversion metrics on page views and forms
• Identified trends in spending over 1M rows
• Used lightly at Learnvest, growing in capability
Wednesday, May 22, 13
42. Things that didn’t work
MongoDB Upserts
Quickly becomes read-heavy and slows down the db
MongoDB Aggregation Framework
Fine for adhoc analysis but you might be better off with establishing a
repeatable framework to run MR algos
Django-noRel
Unstable, use Django and configure MongoDB as a datastore only
Wednesday, May 22, 13
43. Lessons Learned
•Date Time managed as two fields, Datetime and Date
• Real-time Map-Reduce in pyMongo - too slow, don’t do this.
•Memcached on Django is good enough (at the moment) - use django-
celery with rabbitmq to pre-cache all data after data loading
• HighCharts is buggy - considering D3 & other libraries
•Don’t need to retrieve data directly from MongoDB to Django, perhaps
provide all data via a service layer (at the expense of ever-additional
features in pyMongo)
•Make better use of EMR upfront if resources are limited and data is vast.
Wednesday, May 22, 13