1. Google Cloud for
Data Crunchers
Patrick Chanezon, Developer Advocate, Cloud
@chanezon, chanezon@google.com
Rajdeep Dua, Developer Advocate, Cloud and
Android
@rajdeepdua, rajdeep@google.com
Google Developer Day 2010
2. Agenda
• Google App Engine
• Google Storage for Developers
• Prediction API
• BigQuery
• Google SQL Service
• Google Fusion Tables
• Google Refine
Google Developer Day 2010
5. Cloud Computing Defined
SaaS
PaaS
IaaS
Source: Gartner AADI Summit Dec 2009
Google Developer Day 2010
6. Cloud Computing Defined
SaaS
PaaS
IaaS
Source: Gartner AADI Summit Dec 2009
Google Developer Day 2010
7. Cloud Computing Defined
SaaS
PaaS
IaaS
Source: Gartner AADI Summit Dec 2009
Google Developer Day 2010
8. Cloud Computing Defined
SaaS
PaaS
IaaS
Source: Gartner AADI Summit Dec 2009
Google Developer Day 2010
9. Google's Cloud Offerings
1. Google Apps
2. Third party Apps:
Google Apps Marketplace
SaaS 3. ________
Google App Engine
PaaS
Google Storage
IaaS Prediction API
BigQuery
Google Developer Day 2010
10. Google's Cloud Offerings
Your Apps
1. Google Apps
2. Third party Apps:
Google Apps Marketplace
SaaS 3. ________
Google App Engine
PaaS
Google Storage
IaaS Prediction API
BigQuery
Google Developer Day 2010
11. Google App Engine
- Easy to build
- Easy to maintain
- Easy to scale
7
12. Cloud development in a box
• SDK & “The Cloud”
• Hardware
• Networking
• Operating system
• Application runtime
o Java, Python
• Static file serving
• Services
• Fault tolerance
• Load balancing
8
13. App Engine Services
Memcache Datastore URL Fetch
Mail XMPP Task Queue
Images Blobstore User Service
9
14. Always free to get started
~5M pageviews/month
• 6.5 CPU hrs/day
• 1 GB storage
• 650K URL Fetch calls/day
• 2,000 recipients emailed
• 1 GB/day bandwidth
• 100,000 tasks enqueued
• 650K XMPP messages/day
10
16. Google App Engine for Business
Same scalable cloud hosting platform. Designed for the enterprise.
• Enterprise application management
– Centralized domain console
• Enterprise reliability and support
– 99.9% Service Level Agreement
– Premium Developer Support
• Hosted SQL
– Managed relational SQL database in the cloud
• SSL on your domain
– Including "naked" domain support
• Secure by default
– Integrated Single Sign On (SSO)
• Pricing that makes sense Google App Engine
for Business
– Pay only for what you use
* Hosted SQL and SSL on your domain available later this year
Google Developer Day 2010
17. App Engine for Data Crunchers
• High Performance Image Serving
• OpenId/Oauth integration
• Increased quotas
• > 1k entities per query
• 10’’ task queues
• Async UrlFetch
• Mapper API (Reduce coming soon)
• Channel API
• Matcher API
Google Developer Day 2010
18. Mapper API
• First component of App Engine’s MapReduce toolkit
• Large scale data manipulation
• Examples include:
• Report generation
• Computing statistics and metrics …
• Python Example:
• http://blog.notdot.net/2010/05/Exploring-the-new-mapper-API
• Java Example:
• http://ikaisays.com/2010/07/09/using-the-java-mapper-framework-for-app-
engine/
Google Developer Day 2010
19. Channel API
• Allows for Server Push (Comet) to browser
• Blog post announcement:
• http://googleappengine.blogspot.com/2010/05/app-engine-at-google-
io-2010.html
• External coverage:
• Sneak Peak from an early trusted tester
• http://bitshaq.com/2010/09/01/sneak-peak-gae-channel-api/
• Demo code for Dance Dance Robot available here:
• http://code.google.com/p/dance-dance-robot/
• Also see: https://groups.google.com/group/google-appengine-java/
browse_thread/thread/6fa09953ffae2cd3/c1db7de5fdb82b65?pli=1#
Google Developer Day 2010
20. Matcher API
• Allows an app to register a set of queries to match against a
stream of documents
• Trustes Testers, Python only
• Group post announcement:
• http://groups.google.com/group/google-appengine/msg/40021537e2e58962
• Docs:
• http://code.google.com/p/google-app-engine-samples/wiki/
AppEngineMatcherService
• Demo code:
• http://code.google.com/p/google-app-engine-samples/source/browse/#svn/trunk/
matcher-sample
Google Developer Day 2010
21. Google Storage for Developers
Store your data in Google's cloud
Google Developer Day 2010
22. What Is Google Storage?
• Store your data in Google's cloud
o any format, any amount, any time
• You control access to your data
o private, shared, or public
• Access via Google APIs or 3rd party tools/libraries
Google Developer Day 2010
23. Google Storage Technical Details
RESTful API
• Verbs: GET, PUT, POST, HEAD, DELETE
• Resources: identified by URI, like:
http://commondatastorage.googleapis.com/bucket/object
• Compatible with S3
Buckets
• Flat containers (no bucket hierarchy)
Google Developer Day 2010
24. Google Storage : Concepts
Basic containers that hold your data, cannot nest
Buckets buckets
Individual pieces of data : Object data and object meta
Objects data
Namespace Single name space across Google storage
Hierarchy Flat hierarchy
any combination of Unicode characters (UTF-8
Object Names encoded) less than 1024 bytes in length
More restrictive than object names, unique.
Bucket Names Conform to DNS settings
20
Google Developer Day 2010
25. Google Storage Use Cases
Use Case HTTP Verb
Create a Bucket
Change ACLs of a Bucket
Uploads an Object
PUT
Change ACLs of an Object
List contents of a bucket or ACLs
Download an object or its ACLs
GET
Delete an Object
Delete an empty Bucket
DELETE
Uploads an Object using HTML form POST
lists the metadata of an Object HEAD
21
Google Developer Day 2010
26. Performance and Scalability
Object types and size
• Objects of any type and 100GB+ / Object
• Unlimited numbers of objects, 1000s of buckets
• Range-get support for data retrieval
Replication
• All data replicated to multiple US data centers
• Leveraging Google's worldwide network for data delivery
Consistency
• “Read-your-writes” data consistency
Google Developer Day 2010
27. Security and Privacy Features
Authenticated downloads from a web browser
• Sharing with individuals
• Group sharing via Google Groups
• Sharing with Google Apps domains
Permissions set on Buckets or Objects
• READ (an object, or list a bucket’s contents)
• WRITE (applicable to buckets, allows upload/delete/etc)
• FULL_CONTROL (read/write ACLs on objects or buckets)
Google Developer Day 2010
28. Tools
Google Storage Manager
gsutil
Google Developer Day 2010
29. Google Storage Benefits
High Performance and Scalability
Backed by Google infrastructure
Strong Security and Privacy
Control access to your data
Easy to Use
Get started fast with Google & 3rd party tools
Google Developer Day 2010
31. Google Storage usage within Google
Google Google
BigQuery Prediction API
Haiti Relief Imagery USPTO data
Partner Reporting Partner Reporting
Google Developer Day 2010
32. Google Storage - Availability
Limited preview in US* currently
• 100GB free storage and network per account
• Sign up for wait list at
• http://code.google.com/apis/storage/
* Non-US preview available on case-by-case basis
Google Developer Day 2010
34. Introducing the Google Prediction API
• Google's sophisticated machine learning technology
• Available as an on-demand RESTful HTTP web service
Google Developer Day 2010
35. A virtually endless number of applications...
Customer Transaction Species Message Diagnostics
Sentiment Risk Identification Routing
Churn Legal Docket Suspicious Work Roster Inappropriate
Prediction Classification Activity Assignment Content
Recommend Political Uplift Email Career
Products Bias Marketing Filtering Counseling
... and many more ...
Google Developer Day 2010
36. How does it work?
1. TRAIN The quick brown fox jumped over the
"english"
The Prediction API lazy dog.
finds relevant To err is human, but to really foul things
features in the "english"
up you need a computer.
sample data during
"spanish" No hay mal que por bien no venga.
training.
"spanish" La tercera es la vencida.
2. PREDICT To be or not to be, that is the
?
The Prediction API question.
later searches for ? La fe mueve montañas.
those features
during prediction.
Google Developer Day 2010
38. A Prediction API Example
Automatically determine application recommendations
• Goal: Increase relevancy on the Apps Marketplace via
recommendations
• Customers: Businesses of various sizes and industries
using Google Apps around the world
• Data: Sampling of previous installs of applications
• Outcome: Predict applications which would be
appropriate for a new customer visiting the site
Google Developer Day 2010
39. Using the Prediction API
A simple three step process...
Upload your training data to
1. Upload Google Storage
Build a model from your data
2. Train
3. Predict Make new predictions
Google Developer Day 2010
40. Step 1: Upload
Upload your training data to Google Storage
• Training data: outputs and input features
• Data format: comma separated value format (CSV), result in first column
"SlideRocket","EDUCATION","us","en","10","5"
"MailChimp","BUSINESS","us","en","7","0"
"MailChimp","STANDARD","se","sv","1","0"
"Smartsheet","BUSINESS","us","en","13","4"
Upload to Google Storage
gsutil cp installs gs://appdata/
Google Developer Day 2010
41. Step 2: Train
Create a new model by training on data
To train a model:
POST prediction/v1.1/training?data=appdata%2Finstalls
Training runs asynchronously. To see if it has finished:
GET prediction/v1.1/training/appdata%2Finstalls
{"data":{
"data":"appdata/installs",
"modelinfo":"estimated accuracy: 0.xx"}}}
Google Developer Day 2010
42. Step 3: Predict
Apply the trained model to make predictions on new data
POST prediction/v1.1/query/appdata%2Finstalls/predict
{ "data":{
"input": { "mixture" : [
"EDUCATION","us","en","10","0" ]}}}
{ data : {
"kind" : "prediction#output",
"outputLabel":"Manymoon",
"outputMulti" :[
{"label":"OffiSync", "score": x.xx}
{"label":"Zoho CRM", "score": x.xx}
{"label":"MailChimp", "score": x.xx}]}}
Google Developer Day 2010
46. Demo Screenshots
Predicting apps for a small business
Google Developer Day 2010
47. Demo Screenshots
Predicting apps for a small business
Google Developer Day 2010
48. Prediction API Capabilities
Data
• Input Features: numeric or unstructured text
• Output: up to hundreds of discrete categories, or
continuous values
Training
• Many machine learning techniques
• Automatically selected
• Performed asynchronously
Access from many platforms:
• Web app from Google App Engine
• Apps Script (e.g. from Google Spreadsheet)
• Desktop app
Google Developer Day 2010
49. Prediction API - Pricing
Free Quota in trial/development
• 100 predictions/day, 5MB trained/day
• Available for 6 months
Paid Usage
• $10/month per project includes 10,000 predictions
• Additional predictions are $0.50 per 1,000
• Absolute limit of 60,000 predictions per day
• $0.002 per MB trained (max size per dataset is 100MB)
Google Developer Day 2010
50. Prediction API- Availability
Limited preview in US* currently
• Sign up for wait list at
• http://code.google.com/apis/predict/
* Non-US preview available on case-by-case basis
Google Developer Day 2010
52. Introducing Google BigQuery
• Google's large data adhoc analysis technology
• Analyze massive amounts of data in seconds
• Simple SQL-like query language
• Flexible access
• REST APIs, JSON-RPC, Google Apps Script
48
Google Developer Day 2010
54. Many Use Cases ...
Trends
Interactive Spam
Detection
Tools
Web Network
Dashboards Optimization
Google Developer Day 2010
55. Key Capabilities of BigQuery
• Scalable: Billions of rows
• Fast: Response in seconds
• Simple: Queries in SQL
• Web Service
o REST
o JSON-RPC
o Google App Scripts
Google Developer Day 2010
56. Components of BigQuery
java python php
bq tool client libraries
REST, JSON RPC
Big Query Service
Big Storage
52
Google Developer Day 2010
57. Using BigQuery
Another simple three step process...
Upload your raw data to
1. Upload Google Storage
Import raw data into
2. Import
BigQuery table
3. Query Perform SQL queries
on table
Google Developer Day 2010
58. Big Query : Create Data File
• Data file is in the CSV format Isabella,F,22067
• Format: CSV [http://tools.ietf.org/html/ Emma,F,17716
rfc4180]
• Encoding: UTF-8
Olivia,F,17246
• No header row allowed Sophia,F,16743
• Newlines not supported in quoted strings Ava,F,15730
• Max row size: 64K Emily,F,15204
• Max cell size: 64K
• Max file size: 1GB.
• Supported cell data formats:
◦ string – UTF-8 encoded string up to 64K
of data (as opposed to 64K characters).
◦ integer – IEEE 64-bit signed integers
(-264–-264)
54
Google Developer Day 2010
59. Big Query : Upload your Data
$./gsutil cp yob2009.txt gs://bucket1/tables/babynames/2009.csv
Tool compatible with File containing data Destination bucket
Google Storage to be uploaded for the data
• Data to be uploaded into a single/multiple Big Storage Bucket/s
• Use REST endpoints directly or the tool shipped with Google
storage
55
Google Developer Day 2010
60. Big Query : Table Creation
$ cat baby_schema
[
{ "id": "name", "type": "string", "mode": "REQUIRED" },
{ "id": "gender", "type": "string", "mode": "REQUIRED" },
{ "id": "count", "type": "integer", "mode": "REQUIRED" }
]
• Define a schema
• id : The string name of the field. Field names are any combination of uppercase
and/or lowercase letters (A-Z, a-z), digits (0-9) and underscores. The first
character must be a letter.
• type : The data type of this field. Supported values: string, integer, float, or
boolean
• mode : Optional property, specifying whether the cell can be null or not.
Supported values: NULLABLE or REQUIRED. Default value is NULLABLE.
56
Google Developer Day 2010
61. Big Query : Table Creation
$ bq create bucket1/tables/babynames/tblNames baby_schema
{ Schema File
"kind": "bigquery#table",
"name": "bucket1/tables/babynames/tblNames" Table name
}
Create table in the BigQuery
$ bq import bucket1/tables/babynames/tblNames
bucket1/tables/babynames/2009.csv
Table name
{
"table": "bucket1/tables/babynames/tblNames",
"kind": "bigquery#import_id", Data
"import": "d0cf328ed7d9bb46"
}
Import the data into the table
57
Google Developer Day 2010
62. Big Query : Query the table
$ bq query "SELECT name,count FROM [bucket1/tables/babynames/
tblNames] WHERE gender = 'F' ORDER BY count DESC LIMIT 5";
--------------
name COUNT
-------- -----
Isabella 22067
Query
Emma 17716
Olivia 17246
Sophia 16743 Result
Ava 15730
--------------
58
Google Developer Day 2010
63. Writing Queries
Compact subset of SQL
o SELECT ... FROM ...
WHERE ...
GROUP BY ... ORDER BY ...
LIMIT ...;
Common functions
o Math, String, Time, ...
Additional statistical approximations
o TOP
o COUNT DISTINCT
Google Developer Day 2010
64. BigQuery via REST
GET /bigquery/v1/tables/{table name}
GET /bigquery/v1/query?q={query}
Sample JSON Reply:
{
"results": {
"fields": { [
{"id":"COUNT(*)","type":"uint64"}, ... ]
},
"rows": [
{"f":[{"v":"2949"}, ...]},
{"f":[{"v":"5387"}, ...]}, ... ]
}
}
Also supports JSON-RPC
Google Developer Day 2010
65. Security and Privacy
Standard Google Authentication
• Client Login
• OAuth
• AuthSub
HTTPS support
• protects your credentials
• protects your data
Relies on Google Storage to manage access
Google Developer Day 2010
66. Large Data Analysis Example
Wikimedia Revision History
Wikimedia Revision history data from:
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history.xml.7z
Google Developer Day 2010
67. Large Data Analysis Example
Wikimedia Revision History
Wikimedia Revision history data from:
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history.xml.7z
Google Developer Day 2010
68. Using BigQuery Shell
Python DB API 2.0 + B. Clapper's sqlcmd
http://www.clapper.org/software/python/sqlcmd/
Google Developer Day 2010
72. Google Fusion Tables
• Manage large collections of tabular data in the cloud
• 100 Mb tables
• Filters, Aggregation, Merge
• ACL, Collaboration, Discuss Data
• Visualizations
• REST API
• Geo queries
• Maps Integration
• FusionTablesLayer
Google Developer Day 2010
75. Google Visualization API
• Collection of JavaScript Visualization components
• Some from Google (Chart Tools)
• Some from other developers
• Share the same wire protocol for Data Sources
Google Developer Day 2010
76. Example: Weather data
• US National Climatic Data Center
• weather data at stations around the globe since 1929
• Stored in Google Storage
• Created a Table for Bigquery
• Upload Weather Station coordinates in Fusion Tables
• App Engine App
• Maps API to display weather station Maps
• Bigquery to query average temperature in January
• A bit of Python to create a JSON Data Source
• Visualization API
• Just an example: rince, repeat, enhance!
Google Developer Day 2010
79. Google Refine
• Power tool for working with messy data
• Cleanup
• Transform
• Augment
• (Link with FreeBase)
• Desktop software for now
• http://code.google.com/p/google-refine/
Google Developer Day 2010
81. Recap
• Google App Engine
o Easy to build, deploy and manage web apps
• Google Storage
o High speed data storage on Google Cloud
• Prediction API
o Google's machine learning technology
• BigQuery
o Interactive analysis of very large data sets
• Google Fusion Tables
o Manage collections of tabular data in the cloud
• Google Refine
o Power tool for working with messy data
• Google Visualization
o Collection of JavaScript Visualization
Google Developer Day 2010
Does CLOUD COMPUTING just means your servers are SOMEWHERE ELSE? Or is it SOMETHING MORE?\nWHY put your servers in the cloud?\n- Don’t want to MANAGE servers?\n- Or is it the ELASTICITY and SCALABILITY of the cloud?\n- If so, you NEED: DISTRIBUTED cloud computing\n * TODAY we’ll talk about why\n
\n
\n
\n
\n
\n
\n
EXISTING GOOGLE SERVICES made available to your App Engine apps \n\nSPECIALIZATION: Do ONE THING. Do it WELL.\n- Doing one thing well is EASIER than doing a lot of different things\n- Less complexity: fewer corner cases, fewer bugs\n- Offload App Servers - they SPECIALIZE in serving web requests\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
CloudSherpas (Google Apps management tools) porting to Google Storage\n\nMediaBeacon publishes US Navy Image Services Media files to media outlets\n\nSocialWork is "Facebook for the enterprise".  Share image including Phone in demo\n