Tonight's agenda will basically be an overview of the different areas of BI, starting from the back-end with data warehousing and data integration and moving into the front-end with reporting, visualization, and statistical analysis.
So it's important for us to level-set and define what BI really is. It has quickly become one of the most important fields in the business world because it allows businesses to make better, faster decisions.
BI is not just one field, but many overlapping fields. One can't just look at IT and say that it is BI. It takes experts in data management, process modeling, and statistics to really make a BI program deliver the best return on investment.
Thomas Davenport discovered that businesses actually go through a very predictable pattern while developing the ability to make better business decisions through data. Analytically impaired companies are those that are more 'gut driven'. They make decisions based on conjecture and feeling, not on the actual data in their systems! At the top are the analytical competitors. These companies make all of their business decisions with good data to back them up. Some examples of those companies are: Amazon, Harrah's Entertainment, and Zynga.
In keeping with the same pyramid structure, there is also a clear path to the types of tools used when companies develop a BI program. Usually reporting is the start because companies need to know what happened. As mentioned, all of these tools could be used in silos throughout an analytically impaired company. A silo example would be one employee that builds complex spreadsheets because there's no other way to report on their department. As companies move up the Analytic Competitor pyramid, more of these tools are utilized and integrated throughout the organization. For their full potential to be met, not only does the company have to start using their data to make decisions rather, they have built-in systems that can take the data, filter based on business requirement criteria, and have their workflow automatically change based on that data.
BI's area of focus boils down to essentially three areas: past, present, and future. In taking performance as an example, one could start utilizing reporting tools to answer questions like 'How was my server's performance last week?'. At this point, the data is probably still coming from production systems and can actually hinder the performance the company is wanting to report on. As the company matures, questions quickly arise not only about past performance, but also how well performance is trending and how well are those systems currently performing. Dashboards and other data visualization tools can both report trending as well as current performance. By this time, most companies would have at least started a rudimentary data warehouse due to performance. Many companies stop there at present performance. It takes a lot of effort to move into predictive analytics because then more data oriented skills are needed. Answering with certainty about future performance based on historical trends is the ultimate goal of BI.
Any good BI program starts with a data warehouse. You can think of a warehouse as a specialized database that offloads historical data from your production environment. It does a lot more than that as well – unlike in a production environment a data warehouse actually stores deltas, changes in the data set, that would be lost forever in a production environment. For example, if you have a table that stores an employee's first name, the production system would only store the current value. If an employee named Robert changed his name from Bob and then to Sally, your production database would never remember the first two events. The data warehouse would not only store the three events, but also the time they occurred and how long they were valid. The other neat thing about data warehouses is how they integrate data from across an organization. If a company has an ERP, online website, and an external data set, the warehouse can integrate those three systems' data into one cohesive data set. There are many different modeling styles for a dwh. The traditional methodologies are very similar to what is used in an ideal database environment. Third normal form is the standard normalization you would see in a typical database while data marts move the data into a format that is better suited for reporting and analysis by end users. In the “Data Warehousing 2.0” line, there is data vault modeling which is a hybrid of the first two, and anchor modeling. Anchor modeling is interesting in that it is actually sixth normal form and can get pretty complex.
There are actually quite a few options for warehousing in OS. From more traditional databases that work well with 3NF to columnar data stores that are highly optimized for data marts. NoSQL has also become an option because it can store the unstructured and semi-structured data that never could be stored in a normal warehouse environment.
Columnar data stores basically flip the data from row based into columns. In a typical database, if the last name column needed to be filtered on, columns one through three would have to be scanned. In columnar, the last name row can be filtered on and the other aggregations can be performed as fast as the rows can be read. The other neat thing about columnar databases is that many of them are smart enough to learn how users query their data sets. They can actually trim and grow their indexes accordingly so that users will get huge performance gains.
NoSQL tools are able to store 'documents' in a highly compressed way so that PB+ data sets can be quickly filtered through. This is the tool that warehousers have wanted for years, but is only now starting to go mainstream! Unstructured and semi-structured data sets have not been able to easily be searched through until now. It's easily the proverbial gold mine. Look at Facebook or Twitter and you can see where this could be a huge advantage for understanding customer bases.
Where data warehouses are the backend storage system, data integration acts as the plumbing. DI moves data from source systems into a warehouse or other application. There are many types of DI, from ETL which is moving, cleaning, and loading data, to MDM, which is moving and syncing data across systems, and more. There are two big OS DI tools, Talend and Pentaho K.E.T.T.L.E.
Now that the back-end has been covered, we can start climbing the pyramid of front-end tools. Reporting is the start of this climb and usually where most organizations start since it is the easiest to implement.
There are quite a few options out there, and these are some of the more popular ones. The comparison is only taking into account the actual reporting tool and not their server-side component, if applicable. BIRT is an Eclipse-based tool, so if you're using Eclipse you may want to consider it. Pentaho's Report Designer, JasperReports,are stand-alone tools. All three use a style of design known as “banded” reports where data elements are essentially dragged and dropped onto a pallet. All three do have server-side components. All three report designers can embed reports into existing applications (i.e. web apps, Java apps). The neat thing about Saiku and SQL Power Wabit is that they are both built to handle OLAP cubes as well as normal reporting. Saiku's Interactive Reporting tool is still in beta, but is looking very impressive. They are a thin-client based analytics tool that can be embedded in with BI servers or live as it's own stand-alone tool.
Some charts generated in BIRT.
Here is a screenshot of Pentaho's Report Designer. Each line of the report is the 'banded row' mentioned earlier.
Visualization is the next area of our tour. In a nutshell, visualizations take very complex data and make it very easy to interpret and take action.
This dashboard is from Stephen Few's Information Dashboard Design book. Notice how it is not flashy, with muted colors that really help to draw attention to the bright red circles. There is a lot of information packed into this space. From trends, to current performance and pacing, it's all here and in plain sight. Usually dashboards like this will also have a “drill through” ability. For example, clicking on an alert will take you to a more detailed report or view of the data so that a decision can be made on how to react.
Visualization can also be fun, and even describe themselves. XKCD has quite a few such examples.
Notice how much information is packed into such a small space, yet can still be understood.
There is really only one OS tool that I have been able to find that builds dashboards akin to Few's. Pentaho's Community Dashboard Framework and Editor was designed by a Web Details and adopted by Pentaho. It is still a stand-alone library.
This is a sample dashboard that WebDetails built for a training course on the tools. Notice that the same principles used by Few are applied here.
We've reached the top of our tour of BI. Statistical and Predictive analysis is the goal, and OS provides quite a few options.
Here's a pic of RapidMiner at work.
Of note, there are three companies providing an OSBI suite of tools. The biggest differentiation between them are their communities. Jaspersoft and SpagoBI's suites are not totally in their control because they have licensed Talend for their ETL and Metadata tools. All three use Pentaho's Mondrian OLAP engine. Pentaho and SpagoBI license the use of Weka as part of their suite of tools.
Yes, I have to put in a shameless plug. I am the Community Leader for the local Pentaho User Group. We are currently on LinkedIn ( www.linkedin.com/groups/RTP-Pentaho-User-Group-3674498 ) and will soon be on Meetup. We're currently meeting quarterly and are looking for speakers.