6. Three skills for big data analysts
• Strategic data planning. Understand how data
is the new raw material for any modern
business.
• Analytical skills. Reporters have always been
smart about asking the right questions, but
now they have to dig through the data too.
• Technology skills. Embrace the technology
and make it a key part of your reporting skill
set.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20. Riot Games Current BI Stack
• Honu: Streaming log collection and event
processing pipeline
• Platfora: BI analysis and visualization
• Oozie: Workflow job scheduler
• Hive: Data warehouse and queries
• Chef: Code deployment and configuration
management
• GitHub: Versioning and tracking of programs
• Jenkins: Build system management
• Eureka: Service discovery process
30. Stay in touch
• Copies of this presentation:
http://slideshare.net/davidstrom
• My blog: http://strominator.com
• Follow me on Twitter: @dstrom
• Old school: david@strom.com
http://strominator.com
30
Notes de l'éditeur
I will have a link to these slides if you want to download them at the end of the presentation
The Big Data tent is getting bigger, and this is just a small snapshot of the hundreds of vendors who are involved.
As an example of how big data is moving into the mainstream, take a look at this conference earlier this year put on by the Regional Arts Commission and several community groups on how the arts can tap into better analytic tools.
Your car has become a data hub, with USB ports, a SD card reader, Bluetooth connections to your phone and even a mobile Wifi hotspot. This next picture is a shot of the latest Ford My Touch dashboard that can be found in many of their cars. It provides all sorts of controls on what music you listen to, the indoor climate controls of your car, and a connection to your phone to dial your address book. Currently, Ford collects and aggregates data from the 4 million vehicles that use in-car sensing and remote app management software to create a virtuous cycle of information. The data allows Ford engineers to glean information on a range of issues, from how drivers are using their vehicles, to the driving environment, to electromagnetic forces affecting the vehicle, and feedback on other road conditions that could help them improve the quality, safety, fuel economy and emissions of the vehicle. Drivers willing to share how many miles they’ve traveled could get discounts between 10 and 40 percent in exchange for providing State Farm with a more accurate picture of their vehicle-use habits, which they obtain from directly accessing the Sync telematics systems in the cars electronically.
Like Paul has posted in one of his blog entries earlier this year, it is time we started thinking that each of us develop all three of these sides and fill out our skills so we can become more valuable to our organizations. Paul posted:Why don't we put stronger emphasis on one person having the breadth of skills to play multiple roles on a given project?http://walkingoncoals.blogspot.com/2013/08/data-modelers-model-what-do-data.htmlhttp://www.readwriteweb.com/cloud/2012/02/strata-2012-3-essential-skills.phpDiego Saenz of Data Driven CEO
Let’s move on to talking about maps.Maps can be extremely useful analysis tools, being able to spot corporate trends ahead of other methods and can be a part of a broader data analysis project that can win over your management for new business investments. This is a historic case -- A doctor used a map of cholera outbreaks in central London in the 1850s as a way to identify the sources of infected water pumps, this was used in Tufte’s book on visualizing data. Surprisingly, mapping disease transmission you think would have taken off after this example but it only recently has emerged as something that epidemeologists use for their own analyses. There isn’t much understanding about the spatial factors for disease risk today, and it is a rich field of study.
And while Google Maps is certainly popular, there are other sites making it even more powerful that combine the wisdom of the crowds. These efforts includeCrowdmap and OpenStreet Maps. Here is a map that was crowd sourced of a neighborhood outside of Nairobi Kenya which until this effort was pretty much an uncharted territory, what mappers call outdoor white spaces. Thanks to this citizen effort, the community put together a map with all sorts of resources located such as water pumps and grocery stores. Other humanitarian efforts have been aided by open maps using crowds to help people get more control over their local government and make their politicians more accountable. This illustrates a big trend in online mapping where we are getting better and higher definition maps all the time. For example, once mapping specialists didn't care about where abandoned car tires were sitting on the ground by the sides of roads or in otherwise vacant lots. However, in certain parts of the world, these tires collect standing water and are places where insects can breed and carry disease. Now they are included in some maps.
StreetRx.com can be used to find the least expensive medications in your local area.
Let me put up the next slide showing you something a bit more palatable. David Smith put this map together from about 400 wineries in the Napa Valley area. Not only can you scroll and zoom the map, but clicking on one of the winery markers will tell you its address and whether an appointment is required for tastings. He worked with Barry Rowlingson who used OpenStreetMaps and his own R package to build this. And while 400 data points doesn't sound like a very big collection of data, what these guys did is noteworthy since they used a collection of APIs and open source code to produce the final product.
Some of the firewall vendors have taken mapping a step further. When you set up their firewall rules, you can exclude or monitor traffic based on the country of origin. This can be helpful if you examine your firewall logs and see unexpected and unwanted traffic, such as exploits, coming from these countries. For example, let's say you are prohibited by law from doing business in certain export-controlled countries such as Cuba or North Korea. Wouldn't you like to know if your staff is handling support requests from Cubans? This could be a good indication that your products are entering those countries through grey markets. They have also integrated geofencing with their own reputation management systems so they can tie in their protection and identify particular domains that are known to send malware or to be able to locate where lots of exploits originate. Here is an example using the McAfee Firewall and its TrustedSource.org reputation management service. You can select particular countries to deny or allow traffic, using a simple series of menus. McAfee comes with some preset groups, such as countries with US export controls
But that is just the great outdoors. The firm Aisle411.com is working with major retailers to produce custom indoor maps, to make it easier for shoppers to track down that odd piece of hardware at Lowe's or find the half-price jar of olives at the local supermarket. And others are in the process of creating inexpensive portable indoor sensors that can be distributed to building owners and occupants to collect information that could ultimately be used to improve business processes that happen in their buildings, such as changing production lines or environmental factors. What used to be done manually and took a lot of time and effort can be done digitally and can provide more insights and take less time
This article ran in Restaurant News earlier this year and spoke about how several chains, including Boston Market, are using Big Data techniques to focus on particular store promotions to offer repeat customers prepaid debit cards as incentives to return.
Big Data is also being used in some of the world's largest corporations. We are looking at Proctor and Gamble’s Business Sphere big data situation room in their Cincinnati HQ. A big data analyst drives these large screens that display data visualizations on sales, market share, ad spending and the like, so everyone in the meeting is seeing the same information based on 4 billion daily transactions of P&G products. P&G isn’t after new data types; it still wants to share and analyze point-of-sale, inventory, ad spending, and shipment data. What’s new is the higher frequency and speed at which P&G gets that data, and the finer granularity. Even with all this gear, P&G has about two-thirds of the real-time data it needs.
This is an article in Forbes published last summer about Farmeron, a Web data service that farmers can use to aggregate the troves of information produced about their animals: It was started by two Croatian computer scientists. You can track animal physical characteristics along with milk production, medical treatments, and even particular feeding group schedules. You can view how the weight of your animals has changed based on certain feedlot procedures or keep up with the particulars of your animals' breeding schedules. So as soon as your animal is born or enters your farm, you can track all of these details in their database.And John Deere, the leading tractor company, isn’t just operating on idle either. Today’s tractors are pretty high tech affairs: a farmer can operate the machine without having hands on the steering wheel because the product is driven according to GPS coordinates. This improves precision in seeding, fertilizing, and allows for improved harvesting. Deere's tractors can collect significant data in what crops are planted, how they are fertilized and how much yield any portion of the field produces. You can even input curfew hours that you don't expect your machinery to be operating. There is even a web portal to monitor all this data.
Traditional dairy operations are fairly labor intensive, requiring consistent milking regimens several times a day and seven days a week. That may be a thing of the past, thanks to a number of automatic milking machines that are available, such as this one from a Swedish company DeLaval. The machines have various arms that handle different tasks, such as sterilization, the actual milking process, and tracking the RFID tags that are placed in each cow's ear. They come with optical sensors to place their milking collectors at just the right place on the cow (we'll let you imagine the anatomical details on your own). And given that there are more than eight million Holstein dairy cows in the United States so the potential Big Data uses are huge.You can see the small computer control station on the right and there is even an Internet connection so that farmers can monitor the milk collection remotely and running their herd from a laptop. They can also milk their cows 24x7, which helps to increase production and is less stressful on both the farmer and the cows!
We will be hearing from Jeff Melching first hand, but here is a little preview.Monsanto is using Hadoop in many Big Data efforts besides keeping track of their crop genomes and other biological plant properties. They also have photographic imagery of crop fields. All told, there are several tens of petabytes that need storage and analysis, a number that’s doubling roughly every 16 months. They also have invested in FarmCare, which sends mobile phone alerts about real-time weather threats to farmers, and North Star, a global supply chain transportation management system that has saved millions of dollars in overhead costs, and Precision Planting, which uses software to support farming techniques;
FieldScripts is the first in an evolution of farming software tools that will provide a lot more intelligence and real-time information to farmers. In the past, farmers stored their agronomic data on USB sticks that they mailed to Monsanto for analysis—a cumbersome process, to say the least. Now the cloud has become part of the equation, with Monsanto considering how to best leverage the growth in mobile connections to send data from a farm for analysis.
Riot Games began its operations with a monolithic SQL platform for its data warehouse. It required a great deal of manual, custom-coded processes. Queries were written in MySQL and most of the reporting was done in Excel. As you can imagine, this was causing them issues. the daily data extract update was approaching 24 hours to complete. Plus, debugging software errors meant digging deep into log dumps to figure out what went wrong.
They replaced their system with Hadoop along with a cloud-based data warehouse and an end-to-end automated software development pipeline, using some of these tools shown here. They now have 7 PB of data!
Germany’s largest online retailer, the Otto Group, gets about a million daily visitors to its fifty different Web storefronts. They set out last year on a project to better track their customers. Through a combination of tools including Hadoop and a massive Teradata data warehouse connected to their Intershop ecommerce system, they were able to sift through terabytes of website log files. They came up with what they call “Customer DNA” to identify how their customers come and go on their sites. Through a combination of tools including Hadoop and a massive Teradata data warehouse connected to their Intershop ecommerce system, Otto Group was able to mix SQL and NoSQL data collections effectively to focus their websites and boost traffic and salesThey pull more than 20 different databases into hadoop for this analysis.
Hallmark cards introduces 10,000 new greeting cards are each year and their BI team is trying to become more data-driven. They say data is something that marketing needs to use in its business processes.
PKO, Poland’s largest bank, was looking to roll out a new epayment app for its smartphone users and needed to identify those customers that were Internet-savvy and had the appropriate smartphones and were also comfortable with downloading apps. They used a combination of tools to comb their data warehouse and target the first 37,000 customers that fit their profile. But more importantly, they were able to measure the number of activations of their app by particular marketing campaigns to see which ones brought in the largest number of customers.
Williams Sonoma is a classy retailer that has tried to make their online presence just as commanding and satisfying for their customers. Their site has various triggers that have been programmed to respond to particular customer actions, such as recent browsers of a particular item that is put on sale in its stores are notified via email of the sales by geography. It could be borderline creepy but it worksTheir goal is to match great looking Web pages with top-shelf analytics to keep track of customers.“Data science is brand building here, said one IT manager.The more online visitors buy a particular items, the more the company stocks them at retail outlets. The BI team analyzes these purchases over time to help improve each store’s inventory moving forward.
So what are some important lessons to be learned from some of these examples that I have shown you today? Let’s cover a few recommendations on how you can improve your Big Data use.First, keep your Hadoop etc. stacks current. As you can see from the slide with all the software that Riot Games uses, there is a lot of new software to deal with. The community is constantly making updates and you don’t want to be the one asking about a bug in the forums that has already been fixed in a later update.
Second, don’t be afraid to mix SQL and noSQL data. No need to be religious about it. I think John from SpliceMachine will have something to say along these lines a bit later. Also automate your software development pipeline. Complexity only introduces error, so eliminate manual methods wherever you can.
The customer is always king, and data only serves to improve customer satisfaction. Many of the Big Data projects that I mentioned here were done for this purpose, rather than some rogue IT project.
When in doubt, use a map. Here is one from Healthmap.org tracks modern day disease outbreaks
Finally, don’t be afraid to get help (meetups, various wiki documentation, and github too)
Thanks everyone for listening to me and good luck with your own Big Data explorations.