Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Spark Use Case at 
Telefónica CyberSecurity (CBS) 
Antonio Alcocer 
antonio@stratio.com 
Oscar Mendez 
oscar@stratio.com 
...
Who are we? 
STRATIO 
• Stratio is a Big Data Company 
• Founded in 2013 
• Commercially launched in 2014 
• 50+ employees...
3
General info 
o 1924- 2014: 317+ customer with 130.000+ employees 
o 2nd European operator by revenues 
o 4th global integ...
Present in 24 countries 
#CassandraSummit 2014 5
Their main brands 
#CassandraSummit 2014 6
Other brands 
#CassandraSummit 2014 7
Telefónica Global Solutions 
Global Security Services 
A global infrastructure to 
safeguard your business_ 
#CassandraSum...
Managed Security 
#CassandraSummit 2014 9
CyberSecurity?? 
10
Why???? 
A picture is worth a thousand words - but a film clip, a million! 
#CassandraSummit 2014 11
Why???? 
A picture is worth a thousand words - but a film clip, a million! 
#CassandraSummit 2014 12
Why???? 
A picture is worth a thousand words - but a film clip, a million! 
#CassandraSummit 2014 13
Don’t worry… 
#CassandraSummit 2014 14
What is Cybersecurity? 
What does it mean for us? 
“Cybersecurity is the collection of tools, policies… capabilities to pr...
An example of threats 
Cassandra OpsCenter 
World map 
Wordpress 
#CassandraSummit 2014 16
C* OpsCenter + Shodan 
#CassandraSummit 2014 17
C* OpsCenter + Shodan 
#CassandraSummit 2014 18
Another threats 
#CassandraSummit 2014 19
CyberSecurity in 
numbers 
20
Numbers 
Threats 
• DDoS (23%) 
• SQLi (19%) 
• Defacement (14%) 
• Account Hijacking (9%) 
• Unknown (18%) 
#CassandraSum...
Looking for unknown threats 
#CassandraSummit 2014 22
What did Telefonica need? 
#CassandraSummit 2014 23
Joining efforts 
24
Joinnig efforts
Required skills 
#CassandraSummit 2014 26
in 
27 
Using
Use Case Architecture 
We have three phases: 
• Ingestion: based on Apache Kafka 
• Data fusion: based on Apache Storm. 
•...
Data Adquisition 
• Data are in several sources: 
• DNS traffic 
• IP 
• Social media 
• Underground sources 
• Government...
Data fusion 
• We use Storm to process and 
normalize the information. 
• The system must fire alerts 
to the analysts. 
•...
Batch 
•The data are saved in 
Cassandra. 
•We use Cassandra directly for 
the easy queries. 
•And we used Spark to extrac...
Why did we use C*? 
Because we need their features: 
• P2P architecture 
• Read/write performance 
• Fault tolerance 
• Ea...
Why did we use C*? 
•And we needed data modeler: 
•The data in Storm is normalize by source. 
• The primary key is the sou...
Why did we use C*? 
IP main table 
IP timestamp Timesplit … Domain … Table name: IP 
Primary Key ((IP, timestamp)timesplit...
35 
What have we 
learned?
THE BEST OF BOTH 
WORLDS COMBINED 
“Two plus two is four? Sometimes… Sometimes it is five.” 
G. Orwell 
Combination wins
RISK 
Combination = add more and more products to the Stack
Complexity 
Platforms hybrid Hadoop + spark 
Hybrid = complexity
1 One stack to rule them all 
RDD-Based Matrices 
Interactive 
Batch 
processin 
g 
Stream 
processing 
Why Spark 
Batch 
...
Be rational not only emotional
The only Pure Spark processing 
No Hadoop elements 
+10 
year old constraints
Lean simplicity 
Pure Spark Platform 
Former Hadoop or 
Hybrid Hadoop-Spark Platforms 
Lean = Easier deployment, managemen...
Not to make a POC, but a real project for a Big Company is 
STRATIO 
ADMIN 
STRATIO 
DATAVIS 
STRATIO 
INGESTION 
STRATIO ...
Multiple Combination 
API 
Elastic S 
https://github.com/Stratio/stratio-meta
Full text search + queries 
C* 
node 
C* 
node 
Lucene 
index 
C* 
node 
Lucene 
index 
C* 
node 
Lucene 
index 
C* 
node ...
Stratio Streaming 
•Start using Spark Streaming for 
doing some Complex Event 
Processing operations. 
https://github.com/...
DATA JOURNEY THROUGH TIME 
PAS 
T 
PRESENT FUTURE 
Stored 
data 
Real Time 
Data 
Streaming 
ML 
Algorithms 
Ephemeral 
Ta...
aTdhvaannkcse in 
#CassandraSummit 2014 48
Prochain SlideShare
Chargement dans…5
×

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer

1 400 vues

Publié le

Spark & Cassandra Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer

  1. 1. Spark Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1
  2. 2. Who are we? STRATIO • Stratio is a Big Data Company • Founded in 2013 • Commercially launched in 2014 • 50+ employees in Madrid • Office in San Francisco • Certified Spark distribution #CassandraSummit 2014 2
  3. 3. 3
  4. 4. General info o 1924- 2014: 317+ customer with 130.000+ employees o 2nd European operator by revenues o 4th global integrated operator by accesses o 9th Telco in the Global ranking by market capitalization o 2nd global operator for investment in R+D #CassandraSummit 2014 4
  5. 5. Present in 24 countries #CassandraSummit 2014 5
  6. 6. Their main brands #CassandraSummit 2014 6
  7. 7. Other brands #CassandraSummit 2014 7
  8. 8. Telefónica Global Solutions Global Security Services A global infrastructure to safeguard your business_ #CassandraSummit 2014 8
  9. 9. Managed Security #CassandraSummit 2014 9
  10. 10. CyberSecurity?? 10
  11. 11. Why???? A picture is worth a thousand words - but a film clip, a million! #CassandraSummit 2014 11
  12. 12. Why???? A picture is worth a thousand words - but a film clip, a million! #CassandraSummit 2014 12
  13. 13. Why???? A picture is worth a thousand words - but a film clip, a million! #CassandraSummit 2014 13
  14. 14. Don’t worry… #CassandraSummit 2014 14
  15. 15. What is Cybersecurity? What does it mean for us? “Cybersecurity is the collection of tools, policies… capabilities to protect the cyber environment and organization and user’s assets. Cybersecurity strives to ensure unauthorized access to, manipulation of the integrity, confidentiality, or availability of an information, or unauthorized exfiltration of information.” No rules, just guidelines. #CassandraSummit 2014 15
  16. 16. An example of threats Cassandra OpsCenter World map Wordpress #CassandraSummit 2014 16
  17. 17. C* OpsCenter + Shodan #CassandraSummit 2014 17
  18. 18. C* OpsCenter + Shodan #CassandraSummit 2014 18
  19. 19. Another threats #CassandraSummit 2014 19
  20. 20. CyberSecurity in numbers 20
  21. 21. Numbers Threats • DDoS (23%) • SQLi (19%) • Defacement (14%) • Account Hijacking (9%) • Unknown (18%) #CassandraSummit 2014 21
  22. 22. Looking for unknown threats #CassandraSummit 2014 22
  23. 23. What did Telefonica need? #CassandraSummit 2014 23
  24. 24. Joining efforts 24
  25. 25. Joinnig efforts
  26. 26. Required skills #CassandraSummit 2014 26
  27. 27. in 27 Using
  28. 28. Use Case Architecture We have three phases: • Ingestion: based on Apache Kafka • Data fusion: based on Apache Storm. • Batch & Analytics: Based on Cassandra and Spark #CassandraSummit 2014 28
  29. 29. Data Adquisition • Data are in several sources: • DNS traffic • IP • Social media • Underground sources • Government sources • … Data sources Sources Sources Sources Sources Sources Sources KAFKA API • There are several sources consumers pulling the info and pushing it into a Kafka Cluster • Sources are heterogeneous and their speed is variable. Sources Sources #CassandraSummit 2014 29
  30. 30. Data fusion • We use Storm to process and normalize the information. • The system must fire alerts to the analysts. • This use case required a Big Data component capable of processing the data and extract its information in real-time. • Warnings and alerts are time-sensitive in order to deal efficiently with security attacks. #CassandraSummit 2014 30
  31. 31. Batch •The data are saved in Cassandra. •We use Cassandra directly for the easy queries. •And we used Spark to extract the information not accessible to cassandra directly. Data process INTEGRATION INTEGRATION INTEGRATION #CassandraSummit 2014 31
  32. 32. Why did we use C*? Because we need their features: • P2P architecture • Read/write performance • Fault tolerance • Easy to deploy • CQL #CassandraSummit 2014 32
  33. 33. Why did we use C*? •And we needed data modeler: •The data in Storm is normalize by source. • The primary key is the source key (f.e. IP) and a time stamp to split the cluster key. • All the data row have view tables with relationship between entities: IP, DNS, Domain… IP timestamp Timesplit … Domain … Table name: IP Primary Key ((IP, timestamp)timesplit) Domain timestamp timesplit IP1 … IPn Table name: IP_Domain Primary Key ((Domain, timestamp)timesplit) #CassandraSummit 2014 33
  34. 34. Why did we use C*? IP main table IP timestamp Timesplit … Domain … Table name: IP Primary Key ((IP, timestamp)timesplit) IP view for domain Domain timestamp timesplit IP1 … IPn Table name: IP_Domain Primary Key ((Domain, timestamp)timesplit) Domain main table Domain timestamp Timesplit … IP … Table name: domain Primary Key ((domain, timestamp)timesplit) IP view for domain IP timestamp timesplit domain1 … Domainn Table name: Domain_IP Primary Key ((IP, timestamp)timesplit) #CassandraSummit 2014 34
  35. 35. 35 What have we learned?
  36. 36. THE BEST OF BOTH WORLDS COMBINED “Two plus two is four? Sometimes… Sometimes it is five.” G. Orwell Combination wins
  37. 37. RISK Combination = add more and more products to the Stack
  38. 38. Complexity Platforms hybrid Hadoop + spark Hybrid = complexity
  39. 39. 1 One stack to rule them all RDD-Based Matrices Interactive Batch processin g Stream processing Why Spark Batch Interactive [SQL] Streaming Machine Learning Learn just one system Develop within one framework Deploy/Manage just one system Databricks co-founder & CTO Matei Zaharia (source)
  40. 40. Be rational not only emotional
  41. 41. The only Pure Spark processing No Hadoop elements +10 year old constraints
  42. 42. Lean simplicity Pure Spark Platform Former Hadoop or Hybrid Hadoop-Spark Platforms Lean = Easier deployment, management, and use of the system
  43. 43. Not to make a POC, but a real project for a Big Company is STRATIO ADMIN STRATIO DATAVIS STRATIO INGESTION STRATIO CROSSDATA (SPARK) CASSANDRA MONGO DB ELASTICSEARCH HDFS STRATIO STREAMING (SPARK STREAMING, SIDDHI) very demanding SPARK CERTIFIED
  44. 44. Multiple Combination API Elastic S https://github.com/Stratio/stratio-meta
  45. 45. Full text search + queries C* node C* node Lucene index C* node Lucene index C* node Lucene index C* node Lucene index Lucene index SELECT * FROM logs WHERE description MATCH ‘*Exception’;
  46. 46. Stratio Streaming •Start using Spark Streaming for doing some Complex Event Processing operations. https://github.com/Stratio/stratio-streaming
  47. 47. DATA JOURNEY THROUGH TIME PAS T PRESENT FUTURE Stored data Real Time Data Streaming ML Algorithms Ephemeral Tables Stored Tables SQL combination: Done SQL combination: In progress Quantum Tables
  48. 48. aTdhvaannkcse in #CassandraSummit 2014 48

×