BigData primer

"Big Data is the amount of data that one single
machine cannot store and process"
OTN - 2014
"I have travelled the length and breadth of this
country and talked with the best people, and I
can assure you that data processing is a fad that
won't last out the year."
Editor, Prentice Hall - 1957
"Information is the oil of the 21st
century, and analytics is the
combustion engine."
Peter Søndergaard – Gartner group
"Data is the new
science. Big Data holds
the answers."
Pat Gelsinger - EMC
"Big Data is not the new oil."
Jes Thorp – Harvard busienss review
"Not everything that can be counted
counts, and not everything that counts
can be counted."
William Bruce Cameron
"You can have data without information, but
you cannot have information without data."
Daniel Keys Moran

As for "Big Data" I think that is also a concept. In living memory keeping
detailed sales by style, color, and size was too much to hold for most
retail chains and at least two that tried screwed themselves into
bankruptcy. By now we have mostly advanced to vendor managed inventory of
not only the inventory and sales, but the in store shelf locations and
dollar turn per cubic centimeter of shelf space to leverage the vendors
against each other in negotiating for shelf space. That was an epochal
change in how the world of retail works, which as a side effect helps
non-brick and mortar establishments negotiate with vendors as well.
Being able to keep even transiently more orders of magnitude of data and
analyze it in a way that even *might* give a competitive advantage is the
concept of "Big Data" that makes it something different. I completely
dislike the name, but I think the concept is extremely useful. I don't think
it has a single thing to do with the physical infrastructure that processes
the data. A big part of the concept is that it includes data collection from
non-transactional systems and behaviors where the Internet Of Things is
included in the search space.
Mark Farnham - Oaktable
My take away from OpenWorld, was that you buy $1m in gear,
harvest 27 billion tweets....do the hadoop equivalent of:
select count(*)
from shitloads_of_tweets
where text like '%you suck%'
and work from there ?
Am I missing something ?
Connor McDonald - Oaktable

Others can call Big Data whatever shit they want but these days but
the only viable Big Data stack that is somewhat guaranteed to survive (but
likely evolve a lot) the rest is Hadoop. IMHO.
And there are TWO things it let you do:
1. Due to commodity software and hardware phenomena of last years,
you can now build scalable data processing systems affordable to pretty
much ANY organization. On few TB scale you just use Linux file system
and MySQL or Postgres if needed and maybe flash storage. Beyond that -
it's Hadoop.
2. Since running scalable Hadoop cluster is so cheap, efficiency of
processing becomes secondary and value moves towards its flexibility -
how quickly you can try things, grow the system and integrate new kinds
of data into it. Agility is king - time to market is critical.
What most forget is that in its current state, Hadoop requires shit load of
really good engineering talent. This is why it's only justifiable at certain
scale because savings on h/w and s/w will trump the cost of additional
engineering getting into order of magnitude difference or two.
I'll take my coat...
--
Alex Gorbachev
Software and hardware must be affordable
at scale, or you can go home. Oracle,
EMC, Teradata, IBM, Netapp can all just
forget about it.
Jeffrey Needham – One of the hadoops

Certainly 1000 node (or 5000 node, if you like) clusters are fully automated ...
The data science pipelines are not, nor is the surrounding ecosystem
engineering, but the Hadoop cluster needs a shopping cart to operate.
Nobody "operates" or admins clusters as this scale. This would be pure
insanity. XXXX operates 8 4000 node clusters with 10 people. These people
mostly surf YouTube on their NOC screens as there isn't much for them to do
either.
My job was in production engineering - making sure all the grids worked
across all colos (and for $100, no less). However, search engineering (or data
science production engineering is probably what the new group will be called)
has their back.
Everyone on oak table should figure out how to either build or be in a data
science production group)
Don't bother learning how to operate HDFS and Yarn (and the 8 zillion plugins).
Hadoop 2.0 (be it Hwx or CDH) will be the next OS/database kernel you need to
learn.
And It's OK if you don't believe me ...

DATA processing that scales
DATA processing with fault tolerance
DATA accessible from everywhere

“When there is an elephant in the room
– Introduce him”
Randy Pausch – The Last Lecture
https://www.youtube.com/watch?v=ji5_MqicxSo&t=0m45s

The Hadoop Distributed File System is not a complex, feature-rich,
kitchen sink file system, but it does two things very well: it’s
economical
and functional at enormous scale.
Affordable. At. Scale.
Maybe that’s all it should be. A big data reservoir should make it
possible for traditional database products to directly access HDFS
and still provide a canal for enterprises to channel their old data
sources into the new
Reservoir.
Big data reservoirs must allow old and new data to coexist and inter‐
mingle. For example, DB2 currently supports table spaces on tradi‐
tional OS file systems, but when it supports HDFS directly, it could
provide customers with a built-in channel from the past to the future.
HDFS contains a feature called federation that, over time, could be
used to create a reservoir of reservoirs, which will make it possible to
create planetary file systems that can act locally but think globally.

4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.filecache.DistributedCache;
8. import org.apache.hadoop.conf.*;
9. import org.apache.hadoop.io.*;
10. import org.apache.hadoop.mapred.*;
11. import org.apache.hadoop.util.*;
12.
13. public class WordCount extends Configured implements Tool {
14.
15. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
16.
17. static enum Counters { INPUT_WORDS }
18.
19. private final static IntWritable one = new IntWritable(1);
20. private Text word = new Text();
21.
22. private boolean caseSensitive = true;
23. private Set<String> patternsToSkip = new HashSet<String>();
24.
25. private long numRecords = 0;
26. private String inputFile;
27.
28. public void configure(JobConf job) {
29. caseSensitive = job.getBoolean("wordcount.case.sensitive", true);
30. inputFile = job.get("map.input.file");
31.
32. if (job.getBoolean("wordcount.skip.patterns", false)) {
33. Path[] patternsFiles = new Path[0];
34. try {
35. patternsFiles = DistributedCache.getLocalCacheFiles(job);
36. } catch (IOException ioe) {
37. System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(ioe));
38. }
39. for (Path patternsFile : patternsFiles) {
40. parseSkipFile(patternsFile);
41. }
42. }
43. }
44.
45. private void parseSkipFile(Path patternsFile) {
46. try {
47. BufferedReader fis = new BufferedReader(new FileReader(patternsFile.toString()));
48. String pattern = null;
49. while ((pattern = fis.readLine()) != null) {
50. patternsToSkip.add(pattern);
51. }
52. } catch (IOException ioe) {
53. System.err.println("Caught exception while parsing the cached file '" + patternsFile + "' : " + StringUtils.stringifyException(ioe));
54. }
55. }
56.
57. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
58. String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase();
59.
60. for (String pattern : patternsToSkip) {
61. line = line.replaceAll(pattern, "");
62. }
63.
64. StringTokenizer tokenizer = new StringTokenizer(line);
65. while (tokenizer.hasMoreTokens()) {
66. word.set(tokenizer.nextToken());
67. output.collect(word, one);
68. reporter.incrCounter(Counters.INPUT_WORDS, 1);
69. }
70.
71. if ((++numRecords % 100) == 0) {
72. reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile);
73. }
74. }
75. }
76.
77. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
78. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
79. int sum = 0;
80. while (values.hasNext()) {
81. sum += values.next().get();
82. }
83. output.collect(key, new IntWritable(sum));
84. }
85. }
86.
87. public int run(String[] args) throws Exception {
88. JobConf conf = new JobConf(getConf(), WordCount.class);
89. conf.setJobName("wordcount");
90.
91. conf.setOutputKeyClass(Text.class);
= select word, count(word)
From words_Table
Group by word;
De fleste syntes ikke at det var
smart

So what can we do with Oracle and Hadoop?
Data Loader for Oracle Oracle Direct Connector

DSB vs. P3
Top kunstnere skyld I forsinkelser:
Rihanna 03,46%
Medina 01,78%
Lady Gaga 01,26%
Andre < 1%
Danske kunstnere skyld I forsinkelser:
Medina 09,74%
Fallulah 04,31%
Panamah 02,83%
Pharfar 01,34%
Ukendt Kunstner 01,11%
Andre < 1%

DSB vs. Pollen
El
Hassel
Elm
Birk
Bynke
Græs
Andre
0 5 10 15 20 25 30 35
Pollen forsinkelser
Procent
Pollentype

DSB vs. Nasdaq
Januar Februar Marts April Maj Juni Juli August September Oktober November December
-5
0
5
10
15
20

Size does not matter
It is all about the data

BigData primer

Recommandé

Recommandé

Contenu connexe

Similaire à BigData primer

Similaire à BigData primer (20)

Dernier

Dernier (20)

BigData primer

Notes de l'éditeur