* Not many companies have transactional data that classifies as Big Data. Credit card companies, and financial services companies are about it.
* With stock market data were are talking about every stock trade and the bid and ask prices between the transactions - for every stock on multiple markets for a significant time period.
For many other companies the Big Data is sub-transactional - it is the events that lead up to transactions
* Weblogs are semi/badly structured. Consider the number of weblog entries created as you look for a book online - researching 5-10 books, reading reviews and comments. You might generate 1000 entries and may or may not buy a book - potentially lots of entries for no transaction. We also want to enrich this data with metadata about the URLs and information about the location of user
* In an online game or world every interaction between participants and the system and between each other is logged. An individual participant might generate > 1 million events for their 1 monthly transaction
* A single phone call or text message generates many events within a telecoms company
TAKE-AWAYS
Pentaho has many big data customers across a range of industries and big data platforms.
TAKE-AWAYS
Pentaho provides complete integrated DI+BI for every leading big data platform.
Big Data solutions are not databases. They don’t provide the capabilities that BI toolsets expect of a database.
Hadoop also has a high latency. This means the smallest query possible has an execution time that is much slower than that of a database
Hadoop is optimized for executing very intensive data processing tasks on very large amounts of data. It is not optimized for quick queries. Some Hadoop experts recommend configuring the workloads so that Hadoop jobs take an hour or more. This conflicts with OLAP performance criteria of 5-10 seconds per query.
There are database implementations within the Hadoop world, Hive, HBase etc.
Unfortunately for developers who are used to working with data transformation tools, the productivity within the Hadoop environment is not what they are used to.
TAKE-AWAYS
The better choice is obviously visual development