3. SWIFTRIVER IS FOR...
Improving information ďŹndability
Surfacing content you didn't know you were looking for
Understanding media from other parts of the world (translation)
Making urgent data more discoverable (structured, published and accessible)
Verifying eyewitness accounts
Using location as context
Expanding the grassroots reporting network
Preserving information (archiving)
4. SwiftRiver Web Services
⢠SiLCC - NLP for SMS and Twitter
⢠SULSa - Location Services
⢠SiCDS - Duplication Filtering
⢠River ID - Distributed Reputation
⢠Reverberations - Measures inďŹuence of online content
13. WHAT IS SILCC?
â˘Swift Language Computation Component
â˘One of the SwiftRiver Web Services
â˘Open Web API
â˘Semantic Tagging of Short Text
â˘Multilingual
â˘Multiple sources (twitter, email, SMS, blogs etc)
â˘Active Learning capability
â˘Open Source
â˘Easy to Deploy, Modify and Run
14. Swiftriver   SiLCC  Dataflow Â
Â
SiSLS Â
Content  Items  coming  from  the  SiSLS  have   where Â
Swiftriver  Source   SiSLS  integrations  is  enabled   global  trust  values Â
Library  Service  added  to  the  object  model. Â
Â
SiLCC Â
Swiftriver  Language  An  API  key  is  sent  along  with  the  text  to  ensure  that Â
the  SiLCC  is  not  open  to  any  malicious  usage.  Â
Computational  Core Â
 The  text  of  the Â
content  is  sent  to  the Â
SiLCC. Â
There  is  still  a  bit  of  ambiguity  around  what  the  NLP Â
should  extract  from  the  text  but  at  its  most  simple, Â
Using  NLP,  the  SiLCC  all  the  nouns  would  be  a  good  start. Â
extracts  Nouns  and Â
other  keywords  from Â
the  text. Â
The  SiLCC  send  back Â
The  lists  of  tags  sent  back  from  the  SiLCC  can  be Â
a  list  of  tags  that  are Â
added  to  the  Content  Item  along  with  any  that  were Â
added  to  the  extracted  from  the  source  data  by  the  parser. Â
Content  Item Â
SLISa Â
Although  the  NLP  tags  have  now  been  applied,  the Â
SLISa  is  now  responsible  for  applying  instance Â
Swiftriver  Language Â
specific  tagging  corrections. Â
Improvement  Service Â
Â
15. OUR GOALS
â˘Simple Tagging of short snippets of text
â˘Rapid tagging for high volume environments
â˘Simple API, easy to use
â˘Learns from user feedback
â˘Routing of messages to upstream services
â˘Semantic ClassiďŹcation
â˘Sorts rapid streams into buckets
â˘Clusters like messages
â˘Visual effects
â˘Cross-referencing
16. WHAT ITâS NOT
â˘Does not do deep analysis of text
â˘Only identiďŹes words within original text
17. HOW DOES IT WORK?
â˘Step 1: Lexical Analysis
â˘Step 2: Parsing into constituent parts
â˘Step 3: Part of Speech tagging
â˘Step 4: Feature extraction
â˘Step 5: Compute using feature weights
â˘Lets examine each one in turn...
18. STEP 1: LEXICAL ANALYSIS
â˘For news headlines, email subjects this is trivial, just
split on spaces.
â˘For Twitter this is more complex...
19. TWEET ANALYSIS
â˘Tweets are surprisingly complex
â˘Only 140 characters but many features
â˘Emergent features from community (e.g. hashtags)
â˘Lets take a look at a typical tweet...
20. TWEET ANALYSIS
The typical Tweet: âRT @directrelief: RT
@PIH: PBS @NewsHour addresses mental health
needs in the aftermath of the #Haiti earthquake
#health #earthquake... http://bit.ly/bNhyK6â
â˘RT indicates a âre-tweetâ
â˘@name indicates who the original tweeter was
â˘Multiple embedded retweets
â˘Hashtags (e.g. #Haiti) can play two roles, as a tag
and as part of the sentence
21. TWEET ANALYSIS 2
â˘Two or more hashtags within a tweet (e.g.
#health and #earthquake)
â˘Continuation dots â...â indicates that there
was more text that didnât ďŹt into the 140 limit
somewhere in itâs history
â˘Urls many tweets contain one or more urls
As we can see this simple tweet contains no less
than 7 different features and thatâs not all!
22. TWEET ANALYSIS 3
We want to break up the tweet into the following
parts:
{
'text': ['PBS addresses mental health needs in the aftermath of the Haiti
earthquake'],
'hashtags': ['#Haiti', '#health', '#earthquake'],
'names': ['@directrelief', '@PIH', '@NewsHour'],
'urls': ['http://bit.ly/bNhyK6'],
}
23. TWEET ANALYSIS 4
Why do we want to break up the tweet into parts
(parsing)?
â˘Because we want to further process the
grammatically correct english text
â˘Part of speech tagging would otherwise be
corrupted by words it cannot recognize (e.g. urls,
hashtags, @names etc.)
â˘We want to save the hashtags for later use
â˘Many of the features are irrelevant to the task of
identifying tags (e.g. dots, punctuation, @name, RT)
24. TWEET ANALYSIS 5
â˘We now take the âtextâ portion of the tweet and
perform part of speech tagging on it
â˘After part of speech tagging, we perform feature
extraction
â˘Features are now passed through the keyword
classiďŹer which returns a list of keywords / tags
â˘Finally we combine these tags with the hashtags we
saved earlier to give the complete tag set
25. HEADLINE AND EMAIL
SUBJECT ANALYSIS
â˘This is much simpler to do
â˘Its a subset of the steps in Tweet Analysis
â˘There is no parsing since there are no hashtags,
@names etc.
26. FEATURE EXTRACTION
⢠For the active learning algorithm we need to extract features to use in classiďŹcation
⢠These features should be subject/domain independent
⢠We therefore never use the actual words as features
⢠This would for example give artiďŹcially high weights to words such as âearthquakeâ
⢠We don't want these artiďŹcial weights as we canât foresee future disasters and we
want to be as generic with classiďŹcation as possible
⢠The use of training sets does allow for domain customization if where necessary
27. FEATURE EXTRACTION
⢠Capitalization of individual words: Either ďŹrst caps, or all caps, this is an
important indicator of proper nouns or other important words that make good tag
candidates
⢠Position in text: Tags seem to have a greater preponderance near the
beginning of text
⢠Part of Speech: Nouns and proper nouns are particularly important but so are
some adjectives and adverbs
⢠Capitalization of entire text: sometimes the whole text is capitalized and
this should reduce overall weighting of other features
⢠Length of the text: In shorter texts the words are more likely to be tags
⢠The parts of speech of previous and next words (effectively this means we
are using trigrams; or a window of 3)
28. TRAINING
⢠Requires user reviewed examples
⢠Lexical analysis, parsing and feature extraction on the examples
⢠Multinomial naïve Bayes algorithm
⢠NB: The granularity we are classifying is at the word level
⢠For each word in the text, we classify it as either a keyword or not
⢠This has pleasant side effect of providing several training examples from each user
reviewed text
⢠Even with less than 50 reviewed texts the results are comparable to the simple
approach of using nouns only
29. ACTIVE LEARNING
â˘The API also provides a method for users to send
back corrected text
â˘The corrected text is saved and then used in the
next iteration of training
â˘User may optionally specify a corpus for the
example to go into
â˘Training can be performed using any combination of
corpora
30. DEVELOPER FRIENDLY
â˘Two levels of API, the web API and the internal
Python API
â˘Either one may be used but most users will use the
web API
â˘Design is highly modular and maintainable
â˘For very rapid backend processing the native Python
API can be used
31. PYTHON CLASSES
Most of the classes that make up the library are
divided into three types:
1) Tokenizers
2) Parsers
3) Taggers
All three types have consistent API's and are
interchangeable.
32. PYTHON API
â˘A tagger calls a parser
â˘A parser calls a tokenizer
â˘Output of the tokenizer goes into the parser
â˘Output of the parser goes into the tagger
â˘Output of the tagger goes into the user!
33. CLASSES
⢠BasicTokenizer â This is used for splitting basic (non-tweet) text into individual
words
⢠TweetTokenizer â This is used to tokenize a tweet, it may also be used to
tokenize plain text since plain text is a subset of tweets
⢠TweetParser â Calls the TweetTokenizer and the parses the output (see
previous example)
⢠TweetTagger â Calls the TweetTokenizer and then tags the output of the text
part and adds the hashtags
⢠BasicTagger â Calls the BasicTokenizer and then tags the text, should only be
used for non-tweet text, uses simple Part of Speech to identify tags
⢠BayesTagger â Same as BasicTagger but uses weights from the naĂŻve Bayes
training algorithm
34. DEPENDANCIES
â˘Part of speech tagging is currently performed by the
Python NLTK
â˘The Web API uses the Pylons web framework
35. CURRENT STATUS
â˘Tag method of API is ready for use, individual
deployments can choose between using the
BasicTagger or the BayesTagger
â˘Tell method (for user feedback) will be ready by
the time you read this!
â˘Training is possible on corpora of tagged data in .csv
format (see examples in distribution)
36. CURRENT LIMITATIONS
â˘Only English text is supported at the moment
â˘Tags are always one of the words in the supplied
text ie they can never be a word not in the supplied
text
â˘Very few training examples exist at the moment
37. FUTURE WORK
â˘Multilingual, use non-english part of speech taggers
â˘UTF8 compatible
â˘Experiment with different learning algorithms (e.g.
neural networks)
â˘Perform external text analysis (e.g. if there is a url,
analyze the text in the url as well as in the tweet)
â˘Allow users to specify required density of tags
38. SWIFT RIVER
jon@ushahidi.com
http://swift.ushahidi.com
http://github.com/appfrica/silcc
An Ushahidi Initiative
by Neville Newey and Jon Gosier