2. Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.
http://tarpipe.com/user/bpedro
12. Content extraction
URL retry?
yes
yes
no
fetch contents redirect? error?
yes
no
save
13. Content analysis
• Assume malformed (X)HTML
• Regular expressions
or
• Convert to XHTML
• DOM traversing
14. Content classification
• HTML title, description, keywords
tag
cloud
{ • H1, H2, H3, ...
• Paragraphs
graph
placement { • Who shared the link?
• Internal and external links
15. Content classification
(X)HTML extract head
extract text extract text
elements
yes yes yes
H1,H2,... paragraphs
head found? save
found? found?
no no
17. Food for thought
• PubsubHubbub
http://code.google.com/p/pubsubhubbub/
• Activity Streams
http://activitystrea.ms/
• twitter streaming API
http://dev.twitter.com/pages/streaming_api
18. tarpipe streamlines your tarpipe is one of the most Today I had a chance to
updates to various social curious experiments in spend time experimenting
web sites, creating simple social media that I've with tarpipe and I have to
or complex workflows to seen lately. The service say that I am intrigued by
update several buckets in has the potential to be the concept and impressed
one fell swoop. the answer to the lament by the implementation.
I first talked about in The
Adam Pash looming crisis: Personal Jeff Barr
lifehacker syndication overload. Amazon.com
Rafe Needleman
CNET news
thank you automated publishing