SlideShare a Scribd company logo
1 of 32
A Hacking Toolset
for Big Tabular Files
2016-12-05 Toshiyuki Shimono
1IEEE Bigdata 2016, Washington DC
Uhuru Corp.
Tokyo, Japan
https://github.com/
tulamili/bin4tsv
A Hacking Toolset
for Big Tabular Files
2016-12-05 Toshiyuki Shimono
2IEEE Bigdata 2016, Washington DC
• Nowadays, so much data.
• Then, how do we
decipher/analyze data?
3IEEE Bigdata 2016, Washington DC
The Difficulties :
1. Deciphering / Understanding
given data.
2. Pre-Processing before
main-analysis, modeling.
3. Verifying the data.
Those are tedious. But, why?
4IEEE Bigdata 2016, Washington DC
The reasons
Though Excel, R, SQL, Python, ..
are widely used,
1. Many common routines are not implemented
for data-deciphering / pre-processing.
2. So, one needs to combine many functions.
3. Thus, testing/debugging
takes a ridiculously large amount of time 
5IEEE Bigdata 2016, Washington DC
IEEE Bigdata 2016, Washington DC 6
Almost no one can make a
computer program from scratch
if it combines more than 5 or 6
components.
You need 10 or 20 times test
processes usually!
Thus, providing useful common
functions should be developed
in a bunch!
Thus, I created a hacking toolset:
1. Focusing on
tabular data (TSV, CSV as well),
with a heading line (optionally).
2. Based on UNIX-philosophy.
3. Covering many functions carefully.
4. Imposing swiftness in many aspects.
7IEEE Bigdata 2016, Washington DC
Function Examples
IEEE Bigdata 2016, Washington DC 8
Coloring ; colorplus
• A CUI named “colorplus” puts color,
with various options :
-3 for number legibility, -4 for Asian number system,
-t number for column-wise coloring,
-s regex for specifying character string to put color,
-b color_name for specifying the color
9IEEE Bigdata 2016, Washington DC
colsummary : grasping what each column is.
1st (white) : column number
2nd (green) : how many distinct values in the column
3rd (blue) : numerical average (optionally can be omitted)
4th (yellow) : column name
5th (white) : value range ( min .. max )
6th (gray) : most frequent values ( quantity can be specified. )
7th (green) : frequency ( highest .. lowest, “x” means multiplicity )
When new data come, one should understand them.
Enough information to go on to next step is provided
by colsummary command.
10IEEE Bigdata 2016, Washington DC
“colsummary" command
11
olympic.tsv :
The numbers of
medals Japan got in
the past Olympic
Games (1912-2016).
The biggest hurdle at data deciphering quickly goes
away by colsummary command.
IEEE Bigdata 2016, Washington DC
“colsummary" command
12
number of
different values
Value
range
Numeric
average
Frequent
values
olympic.tsv :
The numbers of
medals Japan got in
the past Olympic
Games (1912-2016).
The biggest hurdle at data deciphering quickly goes
away by colsummary command.
IEEE Bigdata 2016, Washington DC
Column
name
Frequencies
highest .. lowest
(with multiplicity)
crosstable : 2-way contingency tables
• Almost impossible by a SQL query.
• Excel-pivot-table is an cumbersome alternative.
13IEEE Bigdata 2016, Washington DC
crosstable 2-way contingency table
Provides the
cross-table from 2
columned table
Add blue
color on “0”
Extract 3rd and 4th
columns
One may draw many cross-tables from a tabular data.
The crosstable commands provides cross-tables very quickly.
14IEEE Bigdata 2016, Washington DC
freq : How frequent each string appears?
Functionally equivalent to Unix/Linux “ sort | uniq -c ” .
Much more speedy in computation.
Note that many useful sub-functions are implemented on “freq”.
So you can easily analyze in many ways related to counting numbers.
15IEEE Bigdata 2016, Washington DC
“HTTP status code”
freq : with options
For freq,
-f for sorting in frequency number,
-r for reversing the order,
-% for showing the ratio among the whole,
-s for cumulative counting,
-= for recognizing the heading line (appeared in the previous page.)
16IEEE Bigdata 2016, Washington DC
• Knowing the cardinalities
helps your data analysis task well. 
• Thus, determine them before
your main analysis work.
17
You can easily refocus your data
analysis task not by the file names
but by the numbers, after breaks.
IEEE Bigdata 2016, Washington DC
This is for a technique for
data analysis people. The
file sizes, the line numbers,
column numbers carry you
much imagination,
especially Monday morning
in the middle of data
analysis projects.
How many elements overlap, and
how the inclusion-relations occur
are beneficial to understand data files.
Venn Diagrams
18IEEE Bigdata 2016, Washington DC
venn2 (Cardinalities for overlapping 2 sets)
n.g19 u.g19
646,408
19
92
506
IEEE Bigdata 2016, Washington DC
venn4 (Cardinalities for complex conditions)
A B
C
D
20
Compare with the
15 numbers below.
This 10 x 10 table
also helpful.
IEEE Bigdata 2016, Washington DC
A
B
C
D
2 163
929
79
4919
506
646164
4835
cols : extracting columns
• Easier than AWK and Unix-cut .
• Can avoid the geo-locale problems unlike Unix-cut.
cols –t 2 ⇒ moves the 2nd column to rightmost.
 cols –h -1 ⇒ moves the last column to leftmost.
cols –p 5,9..7 ⇒ shows 5th,9th,8th,7th columns.
 cols –d 6..9 ⇒ shows except 6th,.., 9th columns.
-d stands for deleting, -p for printing,
-h for head, -t for tail.
21IEEE Bigdata 2016, Washington DC
Useful CUI functions, more.
22
NAME FUNCTION DETAIL
xcol Table looking-up Similar to SQL-join, Unix-join, Excel-vlookup
sampler Probabilistic extraction Can specify random seeds, probability weight.
keyvalues Key multiplicity Inspects many aspects of a file of “key + value”.
kvcmp Check the KV relation When 2 KV files are given, check the sameness.
colgrep Column-wise grep More useful than Unix-grep
idmaker ID creation Second DB normalization
denomfind Finding common denom.Guessing the whole number from the fractions.
shuffler Shuffle lines Also an option with preserving the heading line.
alluniq Check the uniqueness Also print out how many same lines.
expskip Logarithmic extraction Useful for taking a glace at the first stage.
eofcheck Check the end of file Important at the initial stage of data analysis task.
dirhier Directory structure How many files on each depth.
timeput Time stamp on each line When one line comes, time added on its head.
IEEE Bigdata 2016, Washington DC
CUI already developed.
As of 2016-08-18.
23IEEE Bigdata 2016, Washington DC
How each program created?
1. One meets a routine
that may need to be combined
from many existing functions.
2. Generalize the routine to be used widely
that can be described within 2 words.
3. Design the user interface with option-switches
which specifies additional parameters etc.
4. Try not to make too many programs,
or necessary programs cannot be recalled.
24IEEE Bigdata 2016, Washington DC
Although the produced
programs are important for
exhibition, how to make
them is much important.
Comparisons
25IEEE Bigdata 2016, Washington DC
Comparison with existing software.
• Excel does not handle more than 1 million lines.
• R takes long time to load big data.
• Python with pandas also takes long time to load.
• SQL requires burdensome table designing.
• Elaborate Unix/Linux usage is also insufficient.
26IEEE Bigdata 2016, Washington DC
Comparison in detail.
Excel R Pandas SQL Hadoop Unix Our soft
Primary usage
Spread
Sheet
Statistics Matrix etc.
DB
manipulation
Distributed DB
File
manipulations
Initial deciphering
and Pre-processing
Ease of use ◎ ○ △ △ × ○ ○
Transparency × ○ ○ depends depends ○ ○
Forward
compatibility
△ △ Too new ○-◎ Too new ◎ Too new
Processing
speed
×
Skill
required
○ △
◎ in
narrow scope
○ ○
Large size data
handling
×
Skill
required
○ specific ◎ specific ◎
High quantity
file processing
× △ -- -- ○ ◎
Small data
handling
◎ ◎ ○
Need to know
SQL grammar
× ◎ ◎
Column
selection
(Alphabe
tically)
Name &
Number
Similar to R Column name Column name
Column number
(cut/AWK)
Name/Number
And also by range
27
◎: very good ; ○ good ; △ moderate ; ×bad.
IEEE Bigdata 2016, Washington DC
Application Examples
28IEEE Bigdata 2016, Washington DC
Ex-1) crosstable
1. About 3 million Twitter remarks are collected.
2. 2 columns of date and time are extracted and by using crosstable, they are counted.
3. Subsequently copied and pasted into Excel, they are also colored by “conditional
formatting” by Excel.
4. Followings have become clear: differences between morning and afternoon, the
downtime of data collecting server.
日付/時間 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013-12-10 32 157 153 156 139 161 170 182 315 356 276 221 256
2013-12-11 127 93 64 58 39 72 90 144 200 149 170 116 177 144 170 149 156 268 318 253 216 187 203 235
2013-12-12 136 86 68 63 46 61 64 115 177 134 130 121 463 582 423 442 703 1285 1394 1446 1444 1611 2023 1974
2013-12-13 1532 958 634 478 396 403 526 837 1122 1097 1457 1386 1988 1816 2037 3025 3662 4125 4551 5062 4655 4993 5576 2969
2013-12-14 2873 1626 1139 1018 529 682 807 1164 1655 2115 2536 2715 2868 3454 3720 3812 3611 3438 4202 5005 2557 3133 5031 4274
2013-12-15 3617 2232 1694 1330 1051 974 1173 1732 2590 3549 4942 5871 5306 3056 3571 5247 6666 6798 6964 2597 1710 3382 7879 6425
2013-12-16 4623 2741 1631 1304 927 1094 1185 2053 2597 2474 3415 3614 4580 4836 4187 4999 4115 4122 5957 6427 6328 6793 6545 5963
2013-12-17 4045 2405 1545 1189 1007 946 1237 2692 3532 3435 4178 4925 6158 5049 4756 4375 3890 4138 5090 6265 7628 7077 6639 6325
2013-12-18 4458 2807 1713 1404 1131 1051 1347 1954 2299 2624 3412 2707 1962 1773 1910 3851 5263 6359 7177 6583 6936 6644 6234 6119
2013-12-19 4637 3064 1739 1509 1141 1100 1394 2362 2994 3897 3999 4344 5260 6844 5092 4459 5706 6004 7407 3817 1672 4132 6727 5514
2013-12-20 5393 3082 1977 1973 1321 1462 1667 974 1222 1423 1818 2954 5261 4082 4889 6194 6355 6755 9276 6215 6275 6152 5849 5959
2013-12-21 4924 3247 1859 1637 1331 1414 1617 1249 1327 2072 2385 2784 2927 2630 2740 2916 2871 3139 6222 5042 4680 4828 4829 4633
2013-12-22 3838 2702 2073 1836 1417 1186 1427 1442 1633 2043 3376 4120 4346 4159 4473 4759 4450 4773 4962 6876 7718 7327 7472 6768
2013-12-23 5260 3316 2134 1776 1258 1231 1672 1299 894 1023 1195 2402 3516 3584 5172 7778 8700 8331 8030 6035 7712 7669 8084 7450
2013-12-24 7098 4150 2293 1774 1205 1189 1589 1476 1145 1733 2023 2678 4647 3835 4571 6430 6943 7864 8863 5181 6080 7125 8609 6496
2013-12-25 4905 2734 1695 1604 1377 1360 1580 1200 1289 1335 2341 4287 5532 4332 5077 6474 6952 7888 7702 3655 6388 9370 9115 8470
2013-12-26 7468 3877 2269 1959 1370 1398 1908 467 39 140 155 143 131 418 1081 2148 8863 8775 8934 8498 9086 8356
2013-12-27 6419 4015 1833 1627 1421 1361 1590 2511 3768 3996 5245 6196 7894 8098 7373 7497 7296 7324 8437 8120 8059 8384 8435 8277
2013-12-28 7678 4543 2619 2041 1587 1568 1726 2539 3578 4911 5945 7082 7563 7343 7916 7529 7863 8234 8165 8076 8528 8763 8670 7892
2013-12-29 6661 4372 2621 1823 1336 1224 1433 2290 3223 4477 5873 7296 7842 7614 7975 8060 8639 8402 8889 8398 8457 8320 9454 8700
2013-12-30 7607 4656 2765 2100 1528 1440 1635 2511 3997 5140 7044 7658 9435 9147 9147 10058 9956 10529 9852 9662 10644 11448 12848 12136
2013-12-31 10349 6155 3885 2703 1900 1803 2110 3586 5362 8051 10713 12249 13035 13001 13298 15048 15950 17235 17573 15644 16229 15365 14898 17461
2014-01-01 22849 15263 4149 3157 4218 4828 7216 15086 7594 12145 27222 2067 2511 2540 2478 4938 9088 14816 21263 7463 1861 4411 10241 9638
2014-01-02 8627 4964 2749 4744 3729 4168 6328 1915 1342 1889 2848 3233 3197 3612 4528 4449 7701 12565 20139 6596 4637 6998 8677 8949
2014-01-03 6857 5227 6151 3906 2678 2656 2899 2227 1378 2470 2706 3105 3573 3287 3754 7378 10992 11090 10936 5549 6432 8894 9887 8318
2014-01-04 6833 4113 2371 2687 2248 2016 2138 1689 1079 1437 1921 2448 2451 4279 7620 8104 8364 8326 7957 6164 5549 5233 5850 8926
2014-01-05 9757 6208 3823 2627 2073 1782 1854 695 419 511 661 1073 1976 2815 3173 3884 9464 11131 11139 9671 9997 10924 9716 10967
2014-01-06 7623 4818 2873 2159 1656 1480 1733 2948 3538 4550 5638 6605 9118 7973 7552 7691 7392 8170 7714 8308 8974 9368 10272 9641
2014-01-07 7025 4124 2604 1800 1488 1417 1647 1235 1434 2369 3146 3886 5686 6811 6426 6926 7317 8297 8453 6858 9441 9961 9958 8910
2014-01-08 7551 3997 2606 1837 1379 1248 1872 886 647 1022 1465 1828 2463 2138 1909 2163 2298 2832 2944 5583 7953 9266 10316 10737
2014-01-09 9161 4801 2999 2361 1871 1734 2094 3674 5135 5177 6758 7622 11370 10181 10548 12154 13235 13557 11017 4679 5650 6648 7858 7154
2014-01-10 5273 3243 1795 1349 1206 1020 1556 673 526 505 698 827 1039 996 1127 1232 2925 5981 10609 1481
29IEEE Bigdata 2016, Washington DC
Ex-2) silhouette (integrated histogram)
1. Suppose you are interested in the distribution
of the ratio of the follower number and the
following number of each Twitter account.
2. To properly understand the ratio distribution,
stratifying according to the number of the
follower of each account is also considered :
≧5000, ≧ 500 from the rest, ≧ 50 ditto, and the others.
3. The silhouette command given the data
provides the PDF file as shown in right.
4. You can read that with 48% possibility, the
account with ≧5000 followers has followers
that are more than 2 times of the number of
following.
(On average, 95% twitter accounts does not.)
0% 25% 50% 75% 100%
◀
The command name
“silhouette” comes
from the image when
various height people
are aligned in the order
of height. See the curve
connecting the heads.
30IEEE Bigdata 2016, Washington DC
The ratio of following number
divided by followers number.
This commands depends on R installation to draw graph.
The cross grids are
carefully designed
so that one can
easily read the
numbers directly
from the graph.
Further perspective :
Stage I
1. Core engine for tabular data hacking tools as CUIs.
2. Sophisticated policy/philosophy
carefully covering the functions necessary.
3. Simple GUI with Wx and Tk, possibly.
Stage II
4. Parallel computation mechanism.
5. Full-fledged GUI.
6. API, with clouding services on the Internet.
Stage III
6. Ubiquitous “analysis” deployment
beyond spreadsheet and MapReduce model.
31IEEE Bigdata 2016, Washington DC
Features and summaries
Useful functions for analyses in TSV.
Especially in deciphering and pre-processing.
Handles a heading line properly if it exists.
Swiftness in computation, recalling, understanding.
Well-designed interface, based on Unix philosophy.
Can be used on any Perl5 installed machine.
 The software partly appears on GITHUB repository.
32IEEE Bigdata 2016, Washington DC

More Related Content

What's hot

Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big DataTeemu Heikkilä
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challengesfazail amin
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAmpoolIO
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data TechnologiesMahindra Comviva
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Toshiyuki Shimono
 

What's hot (20)

Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big Data
 
Big data 101
Big data 101Big data 101
Big data 101
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
 

Similar to A Hacking Toolset for Big Tabular Files (3)

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Rohit Srivastava
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019Dave Stokes
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 

Similar to A Hacking Toolset for Big Tabular Files (3) (20)

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Big Data
Big DataBig Data
Big Data
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 

More from Toshiyuki Shimono

国際産業数理・応用数理会議のポスター(作成中)
国際産業数理・応用数理会議のポスター(作成中)国際産業数理・応用数理会議のポスター(作成中)
国際産業数理・応用数理会議のポスター(作成中)Toshiyuki Shimono
 
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装Toshiyuki Shimono
 
extracting only a necessary file from a zip file
extracting only a necessary file from a zip fileextracting only a necessary file from a zip file
extracting only a necessary file from a zip fileToshiyuki Shimono
 
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021Toshiyuki Shimono
 
新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬までToshiyuki Shimono
 
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
 Multiplicative Decompositions of Stochastic Distributions and Their Applicat... Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...Toshiyuki Shimono
 
Theory to consider an inaccurate testing and how to determine the prior proba...
Theory to consider an inaccurate testing and how to determine the prior proba...Theory to consider an inaccurate testing and how to determine the prior proba...
Theory to consider an inaccurate testing and how to determine the prior proba...Toshiyuki Shimono
 
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...Toshiyuki Shimono
 
BigQueryを使ってみた(2018年2月)
BigQueryを使ってみた(2018年2月)BigQueryを使ってみた(2018年2月)
BigQueryを使ってみた(2018年2月)Toshiyuki Shimono
 
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案Toshiyuki Shimono
 
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...Toshiyuki Shimono
 
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...Toshiyuki Shimono
 
企業等に蓄積されたデータを分析するための処理機能の提案
企業等に蓄積されたデータを分析するための処理機能の提案企業等に蓄積されたデータを分析するための処理機能の提案
企業等に蓄積されたデータを分析するための処理機能の提案Toshiyuki Shimono
 
新入社員の頃に教えて欲しかったようなことなど
新入社員の頃に教えて欲しかったようなことなど新入社員の頃に教えて欲しかったようなことなど
新入社員の頃に教えて欲しかったようなことなどToshiyuki Shimono
 
ページャ lessを使いこなす
ページャ lessを使いこなすページャ lessを使いこなす
ページャ lessを使いこなすToshiyuki Shimono
 
Guiを使わないテキストデータ処理
Guiを使わないテキストデータ処理Guiを使わないテキストデータ処理
Guiを使わないテキストデータ処理Toshiyuki Shimono
 
データ全貌把握の方法170324
データ全貌把握の方法170324データ全貌把握の方法170324
データ全貌把握の方法170324Toshiyuki Shimono
 
Macで開発環境を整える170420
Macで開発環境を整える170420Macで開発環境を整える170420
Macで開発環境を整える170420Toshiyuki Shimono
 
大きなテキストデータを閲覧するには
大きなテキストデータを閲覧するには大きなテキストデータを閲覧するには
大きなテキストデータを閲覧するにはToshiyuki Shimono
 

More from Toshiyuki Shimono (20)

国際産業数理・応用数理会議のポスター(作成中)
国際産業数理・応用数理会議のポスター(作成中)国際産業数理・応用数理会議のポスター(作成中)
国際産業数理・応用数理会議のポスター(作成中)
 
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
インターネット等からデータを自動収集するソフトウェアに必要な補助機能とその実装
 
extracting only a necessary file from a zip file
extracting only a necessary file from a zip fileextracting only a necessary file from a zip file
extracting only a necessary file from a zip file
 
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
A Hacking Toolset for Big Tabular Files -- JAPAN.PM 2021
 
新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで新型コロナの感染者数 全国の状況 2021年2月上旬まで
新型コロナの感染者数 全国の状況 2021年2月上旬まで
 
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
 Multiplicative Decompositions of Stochastic Distributions and Their Applicat... Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
Multiplicative Decompositions of Stochastic Distributions and Their Applicat...
 
Theory to consider an inaccurate testing and how to determine the prior proba...
Theory to consider an inaccurate testing and how to determine the prior proba...Theory to consider an inaccurate testing and how to determine the prior proba...
Theory to consider an inaccurate testing and how to determine the prior proba...
 
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
 
BigQueryを使ってみた(2018年2月)
BigQueryを使ってみた(2018年2月)BigQueryを使ってみた(2018年2月)
BigQueryを使ってみた(2018年2月)
 
Seminar0917
Seminar0917Seminar0917
Seminar0917
 
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
既存分析ソフトへ
データを投入する前に
簡便な分析するためのソフトの作り方の提案
 
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
 
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
To Make Graphs Such as Scatter Plots Numerically Readable (PacificVis 2018, K...
 
企業等に蓄積されたデータを分析するための処理機能の提案
企業等に蓄積されたデータを分析するための処理機能の提案企業等に蓄積されたデータを分析するための処理機能の提案
企業等に蓄積されたデータを分析するための処理機能の提案
 
新入社員の頃に教えて欲しかったようなことなど
新入社員の頃に教えて欲しかったようなことなど新入社員の頃に教えて欲しかったようなことなど
新入社員の頃に教えて欲しかったようなことなど
 
ページャ lessを使いこなす
ページャ lessを使いこなすページャ lessを使いこなす
ページャ lessを使いこなす
 
Guiを使わないテキストデータ処理
Guiを使わないテキストデータ処理Guiを使わないテキストデータ処理
Guiを使わないテキストデータ処理
 
データ全貌把握の方法170324
データ全貌把握の方法170324データ全貌把握の方法170324
データ全貌把握の方法170324
 
Macで開発環境を整える170420
Macで開発環境を整える170420Macで開発環境を整える170420
Macで開発環境を整える170420
 
大きなテキストデータを閲覧するには
大きなテキストデータを閲覧するには大きなテキストデータを閲覧するには
大きなテキストデータを閲覧するには
 

Recently uploaded

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Recently uploaded (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

A Hacking Toolset for Big Tabular Files (3)

  • 1. A Hacking Toolset for Big Tabular Files 2016-12-05 Toshiyuki Shimono 1IEEE Bigdata 2016, Washington DC Uhuru Corp. Tokyo, Japan https://github.com/ tulamili/bin4tsv
  • 2. A Hacking Toolset for Big Tabular Files 2016-12-05 Toshiyuki Shimono 2IEEE Bigdata 2016, Washington DC
  • 3. • Nowadays, so much data. • Then, how do we decipher/analyze data? 3IEEE Bigdata 2016, Washington DC
  • 4. The Difficulties : 1. Deciphering / Understanding given data. 2. Pre-Processing before main-analysis, modeling. 3. Verifying the data. Those are tedious. But, why? 4IEEE Bigdata 2016, Washington DC
  • 5. The reasons Though Excel, R, SQL, Python, .. are widely used, 1. Many common routines are not implemented for data-deciphering / pre-processing. 2. So, one needs to combine many functions. 3. Thus, testing/debugging takes a ridiculously large amount of time  5IEEE Bigdata 2016, Washington DC
  • 6. IEEE Bigdata 2016, Washington DC 6 Almost no one can make a computer program from scratch if it combines more than 5 or 6 components. You need 10 or 20 times test processes usually! Thus, providing useful common functions should be developed in a bunch!
  • 7. Thus, I created a hacking toolset: 1. Focusing on tabular data (TSV, CSV as well), with a heading line (optionally). 2. Based on UNIX-philosophy. 3. Covering many functions carefully. 4. Imposing swiftness in many aspects. 7IEEE Bigdata 2016, Washington DC
  • 8. Function Examples IEEE Bigdata 2016, Washington DC 8
  • 9. Coloring ; colorplus • A CUI named “colorplus” puts color, with various options : -3 for number legibility, -4 for Asian number system, -t number for column-wise coloring, -s regex for specifying character string to put color, -b color_name for specifying the color 9IEEE Bigdata 2016, Washington DC
  • 10. colsummary : grasping what each column is. 1st (white) : column number 2nd (green) : how many distinct values in the column 3rd (blue) : numerical average (optionally can be omitted) 4th (yellow) : column name 5th (white) : value range ( min .. max ) 6th (gray) : most frequent values ( quantity can be specified. ) 7th (green) : frequency ( highest .. lowest, “x” means multiplicity ) When new data come, one should understand them. Enough information to go on to next step is provided by colsummary command. 10IEEE Bigdata 2016, Washington DC
  • 11. “colsummary" command 11 olympic.tsv : The numbers of medals Japan got in the past Olympic Games (1912-2016). The biggest hurdle at data deciphering quickly goes away by colsummary command. IEEE Bigdata 2016, Washington DC
  • 12. “colsummary" command 12 number of different values Value range Numeric average Frequent values olympic.tsv : The numbers of medals Japan got in the past Olympic Games (1912-2016). The biggest hurdle at data deciphering quickly goes away by colsummary command. IEEE Bigdata 2016, Washington DC Column name Frequencies highest .. lowest (with multiplicity)
  • 13. crosstable : 2-way contingency tables • Almost impossible by a SQL query. • Excel-pivot-table is an cumbersome alternative. 13IEEE Bigdata 2016, Washington DC
  • 14. crosstable 2-way contingency table Provides the cross-table from 2 columned table Add blue color on “0” Extract 3rd and 4th columns One may draw many cross-tables from a tabular data. The crosstable commands provides cross-tables very quickly. 14IEEE Bigdata 2016, Washington DC
  • 15. freq : How frequent each string appears? Functionally equivalent to Unix/Linux “ sort | uniq -c ” . Much more speedy in computation. Note that many useful sub-functions are implemented on “freq”. So you can easily analyze in many ways related to counting numbers. 15IEEE Bigdata 2016, Washington DC “HTTP status code”
  • 16. freq : with options For freq, -f for sorting in frequency number, -r for reversing the order, -% for showing the ratio among the whole, -s for cumulative counting, -= for recognizing the heading line (appeared in the previous page.) 16IEEE Bigdata 2016, Washington DC
  • 17. • Knowing the cardinalities helps your data analysis task well.  • Thus, determine them before your main analysis work. 17 You can easily refocus your data analysis task not by the file names but by the numbers, after breaks. IEEE Bigdata 2016, Washington DC This is for a technique for data analysis people. The file sizes, the line numbers, column numbers carry you much imagination, especially Monday morning in the middle of data analysis projects.
  • 18. How many elements overlap, and how the inclusion-relations occur are beneficial to understand data files. Venn Diagrams 18IEEE Bigdata 2016, Washington DC
  • 19. venn2 (Cardinalities for overlapping 2 sets) n.g19 u.g19 646,408 19 92 506 IEEE Bigdata 2016, Washington DC
  • 20. venn4 (Cardinalities for complex conditions) A B C D 20 Compare with the 15 numbers below. This 10 x 10 table also helpful. IEEE Bigdata 2016, Washington DC A B C D 2 163 929 79 4919 506 646164 4835
  • 21. cols : extracting columns • Easier than AWK and Unix-cut . • Can avoid the geo-locale problems unlike Unix-cut. cols –t 2 ⇒ moves the 2nd column to rightmost.  cols –h -1 ⇒ moves the last column to leftmost. cols –p 5,9..7 ⇒ shows 5th,9th,8th,7th columns.  cols –d 6..9 ⇒ shows except 6th,.., 9th columns. -d stands for deleting, -p for printing, -h for head, -t for tail. 21IEEE Bigdata 2016, Washington DC
  • 22. Useful CUI functions, more. 22 NAME FUNCTION DETAIL xcol Table looking-up Similar to SQL-join, Unix-join, Excel-vlookup sampler Probabilistic extraction Can specify random seeds, probability weight. keyvalues Key multiplicity Inspects many aspects of a file of “key + value”. kvcmp Check the KV relation When 2 KV files are given, check the sameness. colgrep Column-wise grep More useful than Unix-grep idmaker ID creation Second DB normalization denomfind Finding common denom.Guessing the whole number from the fractions. shuffler Shuffle lines Also an option with preserving the heading line. alluniq Check the uniqueness Also print out how many same lines. expskip Logarithmic extraction Useful for taking a glace at the first stage. eofcheck Check the end of file Important at the initial stage of data analysis task. dirhier Directory structure How many files on each depth. timeput Time stamp on each line When one line comes, time added on its head. IEEE Bigdata 2016, Washington DC
  • 23. CUI already developed. As of 2016-08-18. 23IEEE Bigdata 2016, Washington DC
  • 24. How each program created? 1. One meets a routine that may need to be combined from many existing functions. 2. Generalize the routine to be used widely that can be described within 2 words. 3. Design the user interface with option-switches which specifies additional parameters etc. 4. Try not to make too many programs, or necessary programs cannot be recalled. 24IEEE Bigdata 2016, Washington DC Although the produced programs are important for exhibition, how to make them is much important.
  • 26. Comparison with existing software. • Excel does not handle more than 1 million lines. • R takes long time to load big data. • Python with pandas also takes long time to load. • SQL requires burdensome table designing. • Elaborate Unix/Linux usage is also insufficient. 26IEEE Bigdata 2016, Washington DC
  • 27. Comparison in detail. Excel R Pandas SQL Hadoop Unix Our soft Primary usage Spread Sheet Statistics Matrix etc. DB manipulation Distributed DB File manipulations Initial deciphering and Pre-processing Ease of use ◎ ○ △ △ × ○ ○ Transparency × ○ ○ depends depends ○ ○ Forward compatibility △ △ Too new ○-◎ Too new ◎ Too new Processing speed × Skill required ○ △ ◎ in narrow scope ○ ○ Large size data handling × Skill required ○ specific ◎ specific ◎ High quantity file processing × △ -- -- ○ ◎ Small data handling ◎ ◎ ○ Need to know SQL grammar × ◎ ◎ Column selection (Alphabe tically) Name & Number Similar to R Column name Column name Column number (cut/AWK) Name/Number And also by range 27 ◎: very good ; ○ good ; △ moderate ; ×bad. IEEE Bigdata 2016, Washington DC
  • 28. Application Examples 28IEEE Bigdata 2016, Washington DC
  • 29. Ex-1) crosstable 1. About 3 million Twitter remarks are collected. 2. 2 columns of date and time are extracted and by using crosstable, they are counted. 3. Subsequently copied and pasted into Excel, they are also colored by “conditional formatting” by Excel. 4. Followings have become clear: differences between morning and afternoon, the downtime of data collecting server. 日付/時間 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2013-12-10 32 157 153 156 139 161 170 182 315 356 276 221 256 2013-12-11 127 93 64 58 39 72 90 144 200 149 170 116 177 144 170 149 156 268 318 253 216 187 203 235 2013-12-12 136 86 68 63 46 61 64 115 177 134 130 121 463 582 423 442 703 1285 1394 1446 1444 1611 2023 1974 2013-12-13 1532 958 634 478 396 403 526 837 1122 1097 1457 1386 1988 1816 2037 3025 3662 4125 4551 5062 4655 4993 5576 2969 2013-12-14 2873 1626 1139 1018 529 682 807 1164 1655 2115 2536 2715 2868 3454 3720 3812 3611 3438 4202 5005 2557 3133 5031 4274 2013-12-15 3617 2232 1694 1330 1051 974 1173 1732 2590 3549 4942 5871 5306 3056 3571 5247 6666 6798 6964 2597 1710 3382 7879 6425 2013-12-16 4623 2741 1631 1304 927 1094 1185 2053 2597 2474 3415 3614 4580 4836 4187 4999 4115 4122 5957 6427 6328 6793 6545 5963 2013-12-17 4045 2405 1545 1189 1007 946 1237 2692 3532 3435 4178 4925 6158 5049 4756 4375 3890 4138 5090 6265 7628 7077 6639 6325 2013-12-18 4458 2807 1713 1404 1131 1051 1347 1954 2299 2624 3412 2707 1962 1773 1910 3851 5263 6359 7177 6583 6936 6644 6234 6119 2013-12-19 4637 3064 1739 1509 1141 1100 1394 2362 2994 3897 3999 4344 5260 6844 5092 4459 5706 6004 7407 3817 1672 4132 6727 5514 2013-12-20 5393 3082 1977 1973 1321 1462 1667 974 1222 1423 1818 2954 5261 4082 4889 6194 6355 6755 9276 6215 6275 6152 5849 5959 2013-12-21 4924 3247 1859 1637 1331 1414 1617 1249 1327 2072 2385 2784 2927 2630 2740 2916 2871 3139 6222 5042 4680 4828 4829 4633 2013-12-22 3838 2702 2073 1836 1417 1186 1427 1442 1633 2043 3376 4120 4346 4159 4473 4759 4450 4773 4962 6876 7718 7327 7472 6768 2013-12-23 5260 3316 2134 1776 1258 1231 1672 1299 894 1023 1195 2402 3516 3584 5172 7778 8700 8331 8030 6035 7712 7669 8084 7450 2013-12-24 7098 4150 2293 1774 1205 1189 1589 1476 1145 1733 2023 2678 4647 3835 4571 6430 6943 7864 8863 5181 6080 7125 8609 6496 2013-12-25 4905 2734 1695 1604 1377 1360 1580 1200 1289 1335 2341 4287 5532 4332 5077 6474 6952 7888 7702 3655 6388 9370 9115 8470 2013-12-26 7468 3877 2269 1959 1370 1398 1908 467 39 140 155 143 131 418 1081 2148 8863 8775 8934 8498 9086 8356 2013-12-27 6419 4015 1833 1627 1421 1361 1590 2511 3768 3996 5245 6196 7894 8098 7373 7497 7296 7324 8437 8120 8059 8384 8435 8277 2013-12-28 7678 4543 2619 2041 1587 1568 1726 2539 3578 4911 5945 7082 7563 7343 7916 7529 7863 8234 8165 8076 8528 8763 8670 7892 2013-12-29 6661 4372 2621 1823 1336 1224 1433 2290 3223 4477 5873 7296 7842 7614 7975 8060 8639 8402 8889 8398 8457 8320 9454 8700 2013-12-30 7607 4656 2765 2100 1528 1440 1635 2511 3997 5140 7044 7658 9435 9147 9147 10058 9956 10529 9852 9662 10644 11448 12848 12136 2013-12-31 10349 6155 3885 2703 1900 1803 2110 3586 5362 8051 10713 12249 13035 13001 13298 15048 15950 17235 17573 15644 16229 15365 14898 17461 2014-01-01 22849 15263 4149 3157 4218 4828 7216 15086 7594 12145 27222 2067 2511 2540 2478 4938 9088 14816 21263 7463 1861 4411 10241 9638 2014-01-02 8627 4964 2749 4744 3729 4168 6328 1915 1342 1889 2848 3233 3197 3612 4528 4449 7701 12565 20139 6596 4637 6998 8677 8949 2014-01-03 6857 5227 6151 3906 2678 2656 2899 2227 1378 2470 2706 3105 3573 3287 3754 7378 10992 11090 10936 5549 6432 8894 9887 8318 2014-01-04 6833 4113 2371 2687 2248 2016 2138 1689 1079 1437 1921 2448 2451 4279 7620 8104 8364 8326 7957 6164 5549 5233 5850 8926 2014-01-05 9757 6208 3823 2627 2073 1782 1854 695 419 511 661 1073 1976 2815 3173 3884 9464 11131 11139 9671 9997 10924 9716 10967 2014-01-06 7623 4818 2873 2159 1656 1480 1733 2948 3538 4550 5638 6605 9118 7973 7552 7691 7392 8170 7714 8308 8974 9368 10272 9641 2014-01-07 7025 4124 2604 1800 1488 1417 1647 1235 1434 2369 3146 3886 5686 6811 6426 6926 7317 8297 8453 6858 9441 9961 9958 8910 2014-01-08 7551 3997 2606 1837 1379 1248 1872 886 647 1022 1465 1828 2463 2138 1909 2163 2298 2832 2944 5583 7953 9266 10316 10737 2014-01-09 9161 4801 2999 2361 1871 1734 2094 3674 5135 5177 6758 7622 11370 10181 10548 12154 13235 13557 11017 4679 5650 6648 7858 7154 2014-01-10 5273 3243 1795 1349 1206 1020 1556 673 526 505 698 827 1039 996 1127 1232 2925 5981 10609 1481 29IEEE Bigdata 2016, Washington DC
  • 30. Ex-2) silhouette (integrated histogram) 1. Suppose you are interested in the distribution of the ratio of the follower number and the following number of each Twitter account. 2. To properly understand the ratio distribution, stratifying according to the number of the follower of each account is also considered : ≧5000, ≧ 500 from the rest, ≧ 50 ditto, and the others. 3. The silhouette command given the data provides the PDF file as shown in right. 4. You can read that with 48% possibility, the account with ≧5000 followers has followers that are more than 2 times of the number of following. (On average, 95% twitter accounts does not.) 0% 25% 50% 75% 100% ◀ The command name “silhouette” comes from the image when various height people are aligned in the order of height. See the curve connecting the heads. 30IEEE Bigdata 2016, Washington DC The ratio of following number divided by followers number. This commands depends on R installation to draw graph. The cross grids are carefully designed so that one can easily read the numbers directly from the graph.
  • 31. Further perspective : Stage I 1. Core engine for tabular data hacking tools as CUIs. 2. Sophisticated policy/philosophy carefully covering the functions necessary. 3. Simple GUI with Wx and Tk, possibly. Stage II 4. Parallel computation mechanism. 5. Full-fledged GUI. 6. API, with clouding services on the Internet. Stage III 6. Ubiquitous “analysis” deployment beyond spreadsheet and MapReduce model. 31IEEE Bigdata 2016, Washington DC
  • 32. Features and summaries Useful functions for analyses in TSV. Especially in deciphering and pre-processing. Handles a heading line properly if it exists. Swiftness in computation, recalling, understanding. Well-designed interface, based on Unix philosophy. Can be used on any Perl5 installed machine.  The software partly appears on GITHUB repository. 32IEEE Bigdata 2016, Washington DC

Editor's Notes

  1. The Difficulties in Data Handling : Normalizing and reorganizing the given data
  2. Tkakes a ridiculous amount of time コロンブスの卵  ANALYSIS should be done there. Where is the …
  3. Tkakes a ridiculous amount of time コロンブスの卵  ANALYSIS should be done there. Where is the …
  4. Feature-rhch
  5. Without knowing anything above may cause tedious step-back.
  6. As a technique, determining the cardinalities of the records of your data files in the beginning greatly helps the following tasks. Because guessing what one of your data files is can be easily told from the number of the records. Also useful when you check the compatibility of master table with a transaction log.
  7. As a technique, determining the cardinalities of the records of your data files in the beginning greatly helps the following tasks. Because guessing what one of your data files is can be easily told from the number of the records. Also useful when you check the compatibility of master table with a transaction log.
  8. As a technique, determining the cardinalities of the records of your data files in the beginning greatly helps the following tasks. Because guessing what one of your data files is can be easily told from the number of the records.
  9. Speedy enough. ( awk > cols > cut ) Kaomoji, emoticon