"From Big Data to Smart data"
Jie (Jack) Yang, Associate Research Fellow, SMART Infrastructure Facility, presented a summary of his research as part of the SMART Seminar Series on 28 April 2016.
For more information, visit the event page at: http://smart.uow.edu.au/events/UOW212890.html.
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
SMART Seminar Series: "From Big Data to Smart data"
1. From Big Data to
Smart data
Jie (Jack) Yang | April 2016
2. —What is Big Data?
—Challenge of Big Data processing
—Smart Learning framework
—Applications
—Conclusions
Outline
3. —No single standard definition
—5-V information assets that require innovative
techniques, algorithms, and analytics that enable
decision making, and process automation
Big Data definition
4. 1 – Scale (Volume)
12+ TBs
of tweet data
every day
25+ TBs of
log data every
day
?TBsof
dataeveryday
2+ billion people on
the Web by end
2011
30 billion RFID tags
today
(1.3B in 2005)
4.6 billion
camera
phones
world wide
100s of
millions of
GPS enabled
devices sold
annually
76 million smart meters in
2009…
200M by 2014
5. The ability to manage, analyse, summarise, visualise,
and discover knowledge from the collected data in a
timely and scalable manner
2 – Speed (Velocity)
Social media and networks
(millions of active users)
Mobile devices
(tracking objects all the time)
Infrastructure sensors and/or
instruments
(measuring all kinds of data)
6. Various formats, types and structures:
— Text
— Numerical
— Multi-dim arrays
— Images, audio, video, sequences
— Time series
— Graph (network)
— Streaming data
— etc
3 – Complexity (Varity)
8. 5 – Benefit (Value)
Value ($, time, performance)
9. Beer & Diaper (Woolworths in Illawarra)
“A number of convenience store clerks noticed that men
often bought beer at the same time they bought diapers.
The store mined its receipts and proved the clerks'
observations correct. So, the store began stocking
diapers next to the beer coolers, and sales skyrocketed”
Asimple example
14. Main features
— Collection across different platforms and formats
• APIs
• Web crawling
— 1 master and 6 workers
• distributing–working–waiting–reactivating
process
— Data volume (per day)
• 20K+ records user activities
• 25K+ records from social platforms
• 200K+ tweets around AU and EU
Data harvesting
15. Main features
— save data into different formats
• Pure TXT / CSV
• (NO)SQL
— Query across all
— Fast respond
Data storage
SELECT * FROM
(SELECT * FROM /web/logs/CSV) t0
JOIN
( SELECT country, count(*)
FROM mysql.web.users
GROUP BY country) t1
JOIN
(SELECT timestamp
FROM s3.root.clicks.json
WHERE user_id = 'jdoe‘) t2
16. Main features
— Preprocessing (filtering, cleansing, feature
extraction)
— Event simulation
— Saving to DBs
— Running ML jobs on the fly
• Receiver throughput = 3kb /sec
• Consumer throughput = 2kb /sec
• Consumer latency = 0.23 sec
Data streaming
17. Main features (35 online training jobs per day)
— Supervised (with a human assisting in classification) /
unsupervised machine learning techniques, to assist with
classification, clustering and prediction;
— Geospatial analysis: K-pop cluster in geographical regions;
— Network analysis to understand social connections between
consumers and producers;
— Other analysis including:
• More sophisticated number crunching of comments, such as
time series analysis to examine trends;
• Natural language processing techniques to assist with
sentiment analysis.
Data mining
18. Student behaviour analysis (OLPC, until Feb 2016):
— 153+ schools
— 20K+ active laptops
— 4.2M+ activity records
Application 1
0
1000
2000
3000
1.2M 2.6M 4.2M
Most popular Apps (per school) App usage (per school)
0
1000
2000
3000
1.2M 2.6M 4.2M
25. Jie Yang; Jun Ma, A structure optimization algorithm of neural networks for large-scale data sets, Fuzz-IEEE,2014;
Jie Yang; Jun Ma, A Sparsity-Based Training Algorithm for Least Squares SVM, IEEE SSCI, 2014;
Jie Yang, Jun Ma, A big-data processing framework for uncertainties in Transportation data, Fuzz-IEEE, 2015
Jie Yang, Jun Ma, and Sarah K. Howard, A Structure Optimization Algorithm of Neural Networks for Pattern Learning from Educational Data, Springer
Studies in Computational Intelligence ANN Modelling, 2015
Jie Yang; Jun Ma, A hybrid gene expression programming algorithm based on orthogonal design, International Journal of Computational Intelligence
Systems, 2015
Jie Yang, Brian Yecies, Mining Chinese Social Media UGC A SmartLearning Framework For Analyzing Douban Movie Reviews, Journal of Big Data,
2016
Jie Yang; Jun Ma, A structure optimization framework for feed-forward neural networks using sparse representation, Knowledge-Based Systems, 2016;
Jie Yang; Jun Ma, Sarah K. Howard, Exploring Technology Integration in Education using Fuzzy Representation and Feature Selection, Fuzz-IEEE,
2016
Brian Yecies, Jie Yang, Matthew Berryman, Kai Soh, Marketing Bait: Using SMART Data to Identify E-guanxi Among China’s ‘Internet Aborigines,
Film Marketing in a Global Era, 2015
Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer-Directors and SMART Analysis of Douban
commentary Among China’s Digital Natives, Women Screenwriters: An International Guide, 2015
Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer–Directors and SMART Analysis of Douban
Commentary Among China’s Digital Natives, Participations: International Journal of Audience Research, 2016
Sarah K. Howard, Jun Ma, Jie Yang, Kate Thompson, The use of data mining to explore factors of technology integration in learning and teaching,
EARLI 2015
Sarah K. Howard, Ellie Rennie, Jun Ma, Jie Yang, Big Data, Big Theory: Moving Beyond New Empiricism to Generate Powerful Explanations, The
New Data “Revolution” in Sociology, 2016
Jun Ma, Jie Yang, Rohan W. Denagamage and Murad Safadi, A Conceptual Model for Clustering Local Government Areas using Complex Fuzzy Sets,
Fuzz-IEEE, 2016
Publications
26. — OLPC (ARC-Linkage)
— NSW-DER
— CAAR
— China-South Korean Foundation
— Healthcare (Pubmed, Seer)
— Tourism business project (UTS)
— MTR
Projects and grants
27. — Big Data processing:
• Data collection; streaming data; data storage; and Machine
learning
• Open source libraries
— Other domains:
• Public transportation
• Business Intelligence
• Health care
Conclusions