SpotFlow: Tracking Method Calls and States at Runtime
Stream processing on mobile networks
1. Apache Flink in action –
stream processing of mobile
networks
Future of Data: Real Time Stream Processing with Apache Flink
2. Who we are
We are a company that deals with the
processing of data, its storage, distribution
and analysis. We combine advanced
technology with expert services in order to
obtain value for our customers.
Main focus is on the big data technologies,
like Hadoop, Kafka, NiFi, Flink.
Web: http://triviadata.com/
3. What we‘re going to talk about
• Why mobile network operators need stream processing
• Architecture
• Business Challenges
• Operating Flink in Hadoop environment
• Stream processing challenges in our use case
6. Mobile operator’s data
Client’s transactions:
• SMS – simplest transaction (mostly a few records)
• Data – lenght of session = number of records
• Calls – most complex joining of records
Operators data:
• Network usage
• Billing events
7. Typical use cases in telco
Customer oriented
• fraud & security
• Customer Experience Management
• triggers alarms based on customer-related
quality indicators
• CEM KPI
• Fast issue diagnosis & Customer support
• reduce the Average Handling Time and First
Call Resolution rate
• Data source for analysis:
• Community analysis
• Household detection
• Segmentation
• Churn prediction
• Behavioural analysis
Operation oriented
• networks performing overlook
• service management support
• precise problem geolocation
• end-to-end in-depth troubleshooting
• real-time fault detection
• automated troubleshooting (diagnosis,
recovery)
• QoS KPI trend analysis
Constant monitoring of network,
service and customer KPIs.
8. Use cases in action
• Network Analytics (web application)
• Cell
• User
• Device
• Getting raw data into HDFS for analysts – SQL queries via
Impala
10. Challenges
• Conversion from binary format (e.g. ASN.1)
• Tightening the feedback loop
• Have solution ready for future use cases
• Anomaly detection
• Predictive maintenance
• Still allow people to run analytical queries on data
12. Apache Kafka
• De facto standard for stream processing
• Fault tolerant
• Highly scalable
• We use it with
• Avro (schema evolution)
• Schema registry
13. Apache Flink
• Very flexible window definitions
• Event time semantics
• Many deployment options
• Can handle large state
14. Challenges
• Running Flink on YARN
• Secured Hadoop & Kafka cluster
• Data onboarding
• Side inputs/data enrichment
• Storing data in Hadoop
15. Flink on YARN
• Big, Fat, Long running
YARN session
• Or Flink cluster per job
${FLINK_HOME}/bin/flink run
-m yarn-cluster
-d
-ynm ${APPLICATION_NAME}
-yn 2
-ys 2
-yjm 2048
-ytm 4096
-c com.triviadata.streaming.job.SipVoiceStream ${JAR_PATH}
--kafkaServer ${KAFKA_SERVER}
--schemaRegistryUrl ${SCHEMA_REGISTRY_URL}
--sipVoiceTopic raw.SipVoice
--correlatedSipVoiceTopic result.SipVoiceCorrelated
--stateLocation ${FLINK_STATE_LOCATION}
--security-protocol SASL_PLAINTEXT
--sasl-kerberos-service-name kafka
16. Kerberized Hadoop & Kafka
• Easy & Straightforward Flink setup
• Hbase/Phoenix privileges
• Hassle with Kafka ACLs
• ACL to read from the topic
• ACL to write to the topic
• ACL to join consumer group
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: /home/appuser/appuser.keytab
security.kerberos.login.principal: appuser
security.kerberos.login.contexts: Client,KafkaClient
19. Side inputs/Data enrichment
• Read code lists from HDFS
• Store them in Rocks DB
on the local filesystem of the Data Node
• Ask Rocks DB to translate code -> value
20. Side inputs/Data enrichment
• Code list files on HDFS updated
once a day
• Command topic to notify jobs about
new files
• Refresh code lists stored in Rocks
DB
25. Correlation
• Merge together related messages coming from one stream
• Key stream by calling/called number
• Merge messages with the same key where start time difference is less
than X.
26. Correlation
override def processElement(
value: SipVoice,
ctx: KeyedProcessFunction[String, SipVoice,
SipVoices]#Context,
out: Collector[SipVoices]): Unit = {
val startTime = parseTime(value.startTime)
val (key, values) =
sipVoiceState
.keys
.asScala
.find(s => math.abs(s - startTime) <= waitingTime)
.map(k => (k, value :: sipVoiceState.get(k)))
.getOrElse {
val triggerTimeStamp =
ctx.timerService().currentProcessingTime() + delayPeriod
ctx
.timerService
.registerProcessingTimeTimer(triggerTimeStamp)
sipVoiceTimers
.put(triggerTimeStamp, startTime)
(startTime, List(value))
}
sipVoiceState.put(key, values)
}
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[String, SipVoice,
SipVoices]#OnTimerContext,
out: Collector[SipVoices]): Unit = {
if (sipVoiceTimers.contains(timestamp)) {
val sipVoiceKey = sipVoiceTimers.get(timestamp)
val correlationId = UUID.randomUUID().toString
val correlatedSipVoices =
sipVoiceState
.get(sipVoiceKey)
.map(_.toCorrelated(correlationId))
.sortBy(_.startTime)
out.collect(SipVoices(correlatedSipVoices))
correlatedSipVoice.inc()
inStateSipVoice.dec(correlatedSipVoices.size)
sipVoiceTimers.remove(timestamp)
sipVoiceState.remove(sipVoiceKey)
}
}
27. Correlation
• Correlate massages among
multiple streams
• Switching between networks
during the call
• Call failure and reestablishment
• Event time semantics
• Lateness
• Out of order messages
28. Aggregations
• As an example for a cell we
want to see:
• Number of errors
• Number of calls
• Number of intercell handovers
• …
30. Table API
table
.window(Tumble over windowLengthInMinutes.minutes on 'timestamp as 'timeWindow)
.groupBy(
'lastCell,
'cellName,
'cellType,
'cellBand,
'cellBandwidthDownload4g,
'cellBandwidthUpload4g,
'cellSiteName,
'cellSiteAddress,
'timeWindow
)
.select(
'lastCell,
'cellName,
'cellType,
'cellBand,
'cellBandwidthDownload4g,
'cellBandwidthUpload4g,
'cellSiteName,
'cellSiteAddress,
'voiceConnectAttempt.sum as 'voiceConnectAttempt,
'voiceConnectSuccess.sum as 'voiceConnectSuccess,
'interCellHandovers.sum as 'interCellHandovers,
'srvccHandovers.sum as 'srvccHandovers,
'timeWindow.start.cast(Types.LONG) as 'timeWindow
)
Notes de l'éditeur
Picture is just 2G and 3G
4G is simmilar – NodeB is changed to eNodeB + some new boxes
Acronyms:
base station controller (BSC)
Radio Network Controller (or RNC)
mobile switching center (MSC)
Short Message Service Center (SMSC)
Serving GPRS Support Node (SGSN)
Network Analytics portal
Network operation & Development to detect and troubleshoot problems in the network.
Customer technical support – track Quality of service of a specific customer
Based on batch jobs,
Transforming and moving data between different layers (pre-stage, stage, datamarts,...),
Cons:
- Data stored multiple times.
Heavy to calculate correlations and aggregations
About one hour latency.
Avro allows us to generate Java/Scala classes for our projects. There are Maven/SBT plugins, DDL scripts
At the time we were choosing stream processing framework this was the only one which met our needs.
We were considering Flink, Spark, Kafka Streams
Spark (1.6) -> did not handle large state well
Kafka Steams -> not so rich API. Too new at that time
We have different setup for different clients.
Why?
Separation of concerns
More processors in case of nifi. Copy from sFTP, parse, push to kafka, copy raw data to hdfs,….
In case of ASN.1 parsing -> has been already done for batch processing, generating CSV files. Now changed to also produce messages to Kafka
AVOID NEW DB/CACHE – there is already whole Hadoop ensemble to maintain.
PROBLEM: we don’t get updates, we get new version of each codelist every day
Took too long while new values were reflected in the data stream
Receive command to refresh codelist,
Broadcast command to all parallel instances of next component
check timestamp weather your codelists aren’t newer.
-> It can be either refresh all, refresh one, refresh from different location…
So far it works. There is possible problem if our codelists grow too big – e.g. whole user profile with history for streaming machine learning algorithms etc.
Quite simple aggregations – usually SUM or COUNT
We have different jobs calculating different aggregations – differently keyed stream
We use tumble windows of length 5 minutes – which is our finest granularity.
Coarser granularities we calctulate on with SQL on query time – 15 minutes/1 hour/1 day
But it‘s possible to have defined multiple windows with different length
Very natural way to write SQL like syntax in Scala.
STREAMING API – reduce, aggregate, fold
TABLE API
SQL API – sql, window defined in group by