Contenu connexe Similaire à Big data amb Cassandra i Celery ##bbmnk (20) Big data amb Cassandra i Celery ##bbmnk1. Big Data amb Cassandra i Celery
#bbmnk novembre 2013
Santi Camps Taltavull
@santicamps
@socialvane
2. La Problemàtica (Big Data)
➲
➲
➲
➲
➲
Gran volum d'informació (TeraBytes)
Informació no estructurada
Poca densitat d'informació útil
Altíssima capacitat de processament
Poca pasta
3. Les solucions aplicades
➲
➲
➲
➲
➲
➲
➲
➲
➲
➲
➲
➲
BBDD distribuida Cassandra
Gestor de tasques distribuides Celery
Gestor de missatgeria RabbitMQ
Aplicació --> RabbitMQ --> Celery <--> Cassandra
4 servidors inicials
12 TB de capacitat
208 GB de RAM
44 nuclis de CPU
Tolerant a fallades
Redundant
Molt Fàcilment Escalable
I Barat !!
4. Cassandra
➲
➲
➲
➲
➲
➲
Neix dins de Facebook i s'allibera
L'adopta la fundació Apache
Twitter també l'empra
Està escrit amb Java
És una BBDD NO SQL
Les dades es guarden com a clau -> valor
6. Cassandra - Inconvenients
➲
➲
➲
➲
No té gestió de transaccions
Es coordina amb timestamps
En mode RandomPartitioner no permet ordenar
En mode RandomPartitioner filtrar es fa difícil
8. Cassandra - Exemple
create column family user_item with key_validation_class = 'UTF8Type' and comparator =
'UTF8Type' and default_validation_class = 'UTF8Type'
and column_metadata=[
{ column_name: source,
validation_class: UTF8Type, index_type: KEYS},
{column_name: user_name,
validation_class: UTF8Type, index_type: KEYS},
{column_name: type,
validation_class: UTF8Type},
{column_name: last_update,
validation_class: UTF8Type},
{column_name: id,
validation_class: UTF8Type},
{column_name: profile_image_url, validation_class: UTF8Type},
{column_name: name,
validation_class: UTF8Type},
{column_name: friends_count,
validation_class: UTF8Type},
{column_name: followers_count, validation_class: UTF8Type},
{column_name: location,
validation_class: UTF8Type},
{column_name: description,
validation_class: UTF8Type},
{column_name: lang,
validation_class: UTF8Type},
{column_name: geo_latitude,
validation_class: FloatType, index_type: KEYS},
{column_name: geo_longitude,
validation_class: FloatType, index_type: KEYS},
{column_name: geo_radious,
validation_class: FloatType},
];
9. Cassandra - Exemple
get user_item['facebook.santi.camps.58']
... ;
=> (name=description, value=Me dedico a ..., timestamp=1383782405981374)
=> (name=followers_count, value=0, timestamp=1383782405981374)
=> (name=friends_count, value=, timestamp=1383782405981374)
=> (name=geo_latitude, value=4.264729, timestamp=1383782405981374)
=> (name=geo_longitude, value=39.88943, timestamp=1383782405981374)
=> (name=geo_radious, value=8.976159, timestamp=1383782405981374)
=> (name=id, value=100000444843078, timestamp=1383782405981374)
=> (name=lang, value=en_GB, timestamp=1383782405981374)
=> (name=last_update, value=2013-11-07T01:00:05.981352, timestamp=1383782405981374)
=> (name=location, value=Mahón, Islas Baleares, Spain, timestamp=1383782405981374)
=> (name=name, value=Santi Camps, timestamp=1383782405981374)
=> (name=profile_image_url, value=https://graph.facebook.com/santi.camps.58/picture,
timestamp=1383782405981374)
=> (name=profile_url, value=https://www.facebook.com/santi.camps.58,
timestamp=1383782405981374)
=> (name=source, value=facebook, timestamp=1383782405981374)
=> (name=type, value=user, timestamp=1383782405981374)
=> (name=user_name, value=santi.camps.58, timestamp=1383782405981374)
10. Cassandra - Indexació
get user_follower_index['santicamps58.facebook.current'];
=> (name=2013-10-29T11:09:01.979083, value=santicamps58.facebook.100000561127539,
timestamp=1381823950979106)
=> (name=2013-10-27T09:59:07.980314, value=santicamps58.facebook.1810751517,
timestamp=1381823950980330)
=> (name=2013-10-11T07:50:10.980547, value=santicamps58.facebook.100002326398873,
timestamp=1381823950980559)
...
get user_follower_item['santicamps58.facebook.100002326398873'];
=> (name=fetch_date, value=2013-10-15, timestamp=1381823950980662)
=> (name=friend_count, value=134, timestamp=1381823950980662)
=> (name=id, value=100002326398873, timestamp=1381823950980662)
=> (name=lang, value=, timestamp=1381823950980662)
=> (name=name, value=Diego Izquierdo Carranza, timestamp=1381823950980662)
=> (name=profile_image_url, value=https://graph.facebook.com/diego.izquierdocarranza/picture,
timestamp=1381823950980662)
=> (name=profile_url, value=https://www.facebook.com/diego.izquierdocarranza,
timestamp=1381823950980662)
=> (name=source, value=facebook, timestamp=1381823950980662)
=> (name=start_date, value=2013-10-15, timestamp=1381823950980662)
=> (name=user_name, value=diego.izquierdocarranza, timestamp=1381823950980662)
11. Cassandra - Indexació
get mention_tag_source_index['803.possitive'];
...
=> (name=2013-11-08T02:00:27.361445, value=803__-UzkY7psQTYJ,
timestamp=1383876396514768)
=> (name=2013-11-08T06:53:57, value=803__twitter.398704931630481408,
timestamp=1383894677856944)
=> (name=2013-11-08T06:54:38, value=803__twitter.398705100648382464,
timestamp=1383894677646453)
=> (name=2013-11-08T06:57:51, value=803__twitter.398705909511503872,
timestamp=1383894677313681)
...
get mention_tag_source_index['803.possitive.google'];
=> (name=2012-12-01T00:00:00.395260, value=803__YfOIKwVseDkJ,
timestamp=1381830781423739)
=> (name=2012-12-01T00:00:00.420936, value=803__YfOIKwVseDkJ,
timestamp=1381867147942586)
=> (name=2012-12-01T00:00:00.633055, value=803__YfOIKwVseDkJ,
timestamp=1381830436666804)
=> (name=2013-06-14T00:00:00.055140, value=803__5Bv2Eu9qk04J,
timestamp=1381867142254676)
12. Cassandra - Indexació
get mention_item['803__twitter.398705909511503872'];
=> (name=body, value=@SocialVane INTERESANTÍSIMA HERRAMIENTA DE ANÁLISIS PARA
REDES SOCIALES, timestamp=1383894677307778)
=> (name=body_norm, value=your_brand interesante herramienta analisis red your_brand,
timestamp=1383894677307778)
=> (name=brand, value=103, timestamp=1383894677307778)
=> (name=checked, value=false, timestamp=1383894677307778)
=> (name=emissor, value=SebastianCamps, timestamp=1383894677307778)
=> (name=emissor_id, value=234140801, timestamp=1383894677307778)
=> (name=emissor_name, value=Sebastián Camps , timestamp=1383894677307778)
=> (name=geo, value=None, timestamp=1383894677307778)
=> (name=id, value=398705909511503872, timestamp=1383894677307778)
=> (name=in_reply_to_id, value=, timestamp=1383894677307778)
=> (name=interest, value=, timestamp=1383894677307778)
=> (name=interest_checked, value=False, timestamp=1383894677307778)
=> (name=lang, value=es, timestamp=1383894677307778)
=> (name=like_action_count, value=0, timestamp=1383894677307778)
=> (name=probability, value=0.482361909795, timestamp=1383894677307778)
=> (name=query, value=803, timestamp=1383894677307778)
=> (name=reply_action_count, value=0, timestamp=1383894677307778)
=> (name=retweeted, value=False, timestamp=1383894677307778)
=> (name=share_action_count, value=0, timestamp=1383894677307778)
=> (name=source, value=twitter, timestamp=1383894677307778)
=> (name=tag, value=possitive, timestamp=1383894677307778)
=> (name=time, value=2013-11-08T06:57:51, timestamp=1383894677307778)
13. Celery
➲
➲
➲
➲
➲
Es configuren cues d'execució
S'engeguen N workers a M màquines escoltant cada cua
Les tasques distribuibles es marquen al codi
Es defineix la cua d'execució de cada tasca
Es poden cridar síncronament o asíncrona
➲
➲
➲
Molt senzill d'implantar
Molt fàcil d'escalar
Cal vigilar la concurrència
14. Celery Exemple
CELERY_ROUTES = {
"celeryutils.track_all_users_followers": {"queue": "slow", "routing_key": "slow_task"},
"userfollowers.bulk_insert": {"queue": "slow", "routing_key": "slow_task"},
"extract_mentions_from_website": {"queue": "slow", "routing_key": "slow_task"},
"LeadsClassifier.classify_untagged": {"queue": "cpu", "routing_key": "cpu_task"},
...
@task(name = 'extract_mentions_from_website', time_limit=300)
def extract_mentions_from_website(brand, query,...):
...
# CRIDA LOCAL
extract_mentions_from_website(params)
# CRIDA DISTRIBUIDA ASÍNCRONA
extract_mentions_from_website.delay(params)
# CRIDA DISTRIBUIDA SÍNCRONA
extract_mentions_from_website.delay(params).get()