SlideShare a Scribd company logo
1 of 25
Download to read offline
pa-pe-pi-po-
  Pure Python
Text Processing

Rodrigo Senra
rsenra@acm.org
PythonBrasil[7] - São Paulo
Anatomia do Blá
• Eu, Vocês e Python
• retrospectiva PythonBrasil[7] anos!
• pa-pe-pi-po-pure python text processing
• referências
• 1 palavra dos patrocinadores
Quem está aí ?
✓Profissionais de
    Informática

✓Desenvolvedores
✓Estudantes
✓Professores
✓1ª vez na PyConBrasil
✓Membros APyBr
•   Nenhuma resposta acima!
Cenas dos últimos capítulos...
[1] 2005 - BigKahuna
[2] 2006 - Show Pyrotécnico
           Iteradores, Geradores,Hooks,Decoradores
[3] 2007 - Show Pyrotécnico II
           Routing, RTSP, Twisted, GIS
[4] 2008 - ISIS-NBP
          Bibliotecas Digitais
[5] 2009 - Rest, Gtw e Compiladores
         SFC(Rede Petri) + ST(Pascal) > Ladder
[5] 2010 - Potter vs Voldemort:
           Lições ofidiglotas da prática pythonica
>>> type("bla")
<type 'str'>
>>> "".join(['pa',"pe",'''pi''',"""po"""])
'papepipo'
>>> str(2**1024)[100:120]
'21120113879871393357'
>>> 2**1024
1797693134862315907729305190789024733617976978942306572734
30081157732675805500963132708477322407536021120113879871393
3576587897688144166224928474306394741243777678934248654852
7630221960124609411945308295208500576883815068234246288147
3913110540827237163350510684586298239947245938479716304835
356329624224137216L
>>> 'ariediod'[::-1]
'doideira'
>>> "    deu branco no prefixo e no sufixo, limpa com strip ".strip()
'deu branco no prefixo e no sufixo, limpa com strip'
>>> _.startswith("deu")
True
>>> "o rato roeu a roupa do rei de roma".partition("r")
('o ', 'r', 'ato roeu a roupa do rei de roma')
>>> "o rato roeu a roupa do rei de roma".split("r")
['o ', 'ato ', 'oeu a ', 'oupa do ', 'ei de ', 'oma']
>>> "o rato roeu a roupa do rei de roma".split()
['o', 'rato', 'roeu', 'a', 'roupa', 'do', 'rei', 'de', 'roma']
>>> r"W:naoprecisadeescape"
'W:naoprecisadeescape'
>>> type(r"W:naoprecisadeescape")
<type 'str'>
>>> type(u"Unicode")
<type 'unicode'>
>>> print(u"xc3xa2")
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

>>> print(unicode('xc3xa1','iso-8859-1').encode('iso-8859-1'))
á
>>> import codecs, sys
>>> sys.stdout = codecs.lookup('iso-8859-1')[-1](sys.stdout)
>>> print(u"xc3xa1")
á
>>> b"String de 8-bit chars"
         'String de 8-bit chars'




Python 2.6.1              Python 3.1.4
>>> b"Bla"                >>> b"Bla"
'Bla'                     b'Bla'
>>> b"Bla"=="Bla"         >>> type(b"Bla")
True                      <class 'bytes'>
>>> type(b"Bla")          >>> type("Bla")
<type 'str'>              <class 'str'>
                          >>> "Bla"==b"Bla"
                          False
>>> [ord(i) for i in "nulalexsedlex"]
[110, 117, 108, 97, 108, 101, 120, 115, 101, 100, 108, 101, 120]
>>> "".join([chr(i) for i in _])
'nulalexsedlex'
>>> 'lex' in _
True
>>> import string
>>> dir(string)
['Formatter', 'Template', '_TemplateMetaclass', '__builtins__',
'__doc__', '__file__', '__name__', '__package__', '_float', '_idmap',
'_idmapL', '_int', '_long', '_multimap', '_re', 'ascii_letters',
'ascii_lowercase', 'ascii_uppercase', 'atof', 'atof_error', 'atoi',
'atoi_error', 'atol', 'atol_error', 'capitalize', 'capwords', 'center', 'count',
'digits', 'expandtabs', 'find', 'hexdigits', 'index', 'index_error', 'join',
'joinfields', 'letters', 'ljust', 'lower', 'lowercase', 'lstrip', 'maketrans',
'octdigits', 'printable', 'punctuation', 'replace', 'rfind', 'rindex', 'rjust',
'rsplit', 'rstrip', 'split', 'splitfields', 'strip', 'swapcase', 'translate', 'upper',
'uppercase', 'whitespace', 'zfill']
>>> string.hexdigits
'0123456789abcdefABCDEF'
>>> string.punctuation
'!"#$%&'()*+,-./:;<=>?@[]^_`{|}~'
>>> string.maketrans('','')
'x00x01x02x03x04x05x06x07x08tnx0bx0crx0ex0f
x10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f !"#
$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]
^_`abcdefghijklmnopqrstuvwxyz{|}~x7f
x80x81x82x83x84x85x86x87x88x89x8ax8bx8cx8dx8e
x8fx90x91x92x93x94x95x96x97x98x99x9ax9bx9cx9d
x9ex9fxa0xa1xa2xa3xa4xa5xa6xa7xa8xa9xaaxabxac
xadxaexafxb0xb1xb2xb3xb4xb5xb6xb7xb8xb9xbaxbb
xbcxbdxbexbfxc0xc1xc2xc3xc4xc5xc6xc7xc8xc9xcaxcb
xccxcdxcexcfxd0xd1xd2xd3xd4xd5xd6xd7xd8xd9xdaxdb
xdcxddxdexdfxe0xe1xe2xe3xe4xe5xe6xe7xe8xe9xea
xebxecxedxeexefxf0xf1xf2xf3xf4xf5xf6xf7xf8xf9xfa
xfbxfcxfdxfexff'
>>> def t(x,y): return string.translate(x,string.maketrans('',''),y)
...
>>> t("O rato roeu. O que? A roupa! De quem? Do rei, de roma;",
string.punctuation)
'O rato roeu O que A roupa De quem Do rei de roma'


>>> class Bla(object):
...   def __str__(self):
...       return "Belex"
...   def __repr__(self):
...       return "Bla()"
...
>>> b = Bla()
>>> for i in [b, eval(repr(b))]:
...   print(i, end='t')
...
Belex Belex >>>
>>> class istr(str):
...    pass
>>> for name in 'eq lt le gt ge ne cmp contains'.split():
...    meth = getattr(str, '__%s__' % name)
...   def new_meth(self, param, *args):
...        return meth(self.lower(), param.lower(), *args)
...   setattr(istr, '__%s__'% name, new_meth)
...
>>> istr("SomeCamelCase") == istr("sOmeCaMeLcase")
True
>>> 'Ec' in istr("SomeCamel")
True



                                          Adapted from Python Cookbook
>>> import re
>>> pat = re.compile(re.escape("<strong>"))
>>> re.escape("<strong>")
'<strong>'
>>> pat.sub("_","<strong>Hasta la vista<strong> baby")
'_Hasta la vista_ baby'
>>> date = re.compile(r"(dddd-dd-dd)s(w+)")
>>> date.findall("Em 2011-09-29 PythonBrasil na parada. Em 2010-10-21
curitiba hospedou")
[('2011-09-29', 'PythonBrasil'), ('2010-10-21', 'curitiba')]
$ python -mtimeit -s "import re; n=re.compile(r'abra')" "n.search
('abracadabra')"
1000000 loops, best of 3: 0.306 usec per loop


$ python -mtimeit -s "import re; n=r'abra'" "n in 'abracadabra'"
10000000 loops, best of 3: 0.0591 usec per loop



$ python -mtimeit -s "import re; n=re.compile(r'd+$')" "n.match
('0123456789')"
1000000 loops, best of 3: 0.511 usec per loop


$ python -mtimeit -s "import re" "'0123456789'.isdigit()"10000000
loops, best of 3: 0.0945 usec per loop



                                      Extracted from PyMag Jan 2008
$ python -mtimeit -s 
"import re;r=re.compile('pa|pe|pi|po|pu');h='patapetapitapotapuxa'” 
 "r.search(h)"
1000000 loops, best of 3: 0.383 usec per loop


$ python -mtimeit -s 
"import re;n=['pa','pe','pi','po','pu'];h='patapetapitapotapuxa'"
"any(x in h for x in n)"
1000000 loops, best of 3: 0.914 usec per loop




                                          Extracted from PyMag Jan 2008
from pyparsing import Word, Literal, Combine
import string
def doSum(s,l,tokens):
    return int(tokens[0]) + int(tokens[2])
integer = Word(string.digits)
addition = Combine(integer) + Literal('+') + Combine(integer)
addition.setParseAction(doSum)


>>> addition.parseString("5+7")
([12], {})
import ply.lex as lex
tokens = 'NUMBER', 'PLUS'
t_PLUS = r'+'
def t_NUMBER(t):
   r'd+'
   t.value = int(t.value)
   return t
t_ignore = ' tnw'
def t_error(t): t.lexer.skip(1)
lexer = lex.lex()




                                  Adapted from http://www.dabeaz.com
import ply.yacc as yacc
def p_expression_plus(p):
   'expression : expression PLUS expression'
   p[0] = p[1] + p[3]
def p_factor_num(p):
   'expression : NUMBER'
   p[0] = p[1]
def p_error(p):
   print "Syntax error in input!"
parser = yacc.yacc()




                                     Adapted from http://www.dabeaz.com
>>> parser.parse("1+2 + 45 n + 10")
58
>>> parser.parse("Quanto vale 2 + 7")
9
>>> parser.parse("A soma 2 + 7 resulta em 9")
Syntax error in input!
>>> parser.parse("2 + 7 9")
Syntax error in input!




                                     Adapted from http://www.dabeaz.com
>>> parser.parse("1+2 + 45 n + 10")
58
>>> parser.parse("Quanto vale 2 + 7")
9
>>> parser.parse("A soma 2 + 7 resulta em 9")
Syntax error in input!
>>> parser.parse("2 + 7 9")
Syntax error in input!




                                     Adapted from http://www.dabeaz.com
from nltk.tokenize import sent_tokenize, word_tokenize
msg = “Congratulations to Erico and his team. PythonBrasil gets better
every year. You are now the BiggestKahuna.”
>>> sent_tokenize(msg)
['Congratulations to Erico and his team.', 'PythonBrasil gets better every
year.', 'You are now the BiggestKahuna.']
>>> word_tokenize(msg)
['Congratulations', 'to', 'Erico', 'and', 'his', 'team.', 'PythonBrasil', 'gets',
'better', 'every', 'year.', 'You', 'are', 'now', 'the', 'BiggestKahuna', '.']




                                             Extracted from NLP with Python
>>> def gender_features(word):
...    return {"last_letter": word[-1]}
...
>>> from nltk.corpus import names
>>> len(names.words("male.txt"))
2943
>>> names = ([(name,'male') for name in names.words('male.txt')] +
...        [(name,'female') for name in names.words('female.txt')])
>>> import random
>>> random.shuffle(names)
>>> featuresets = [(gender_features(n),g) for n,g in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.naiveBayesClassifier.train(train_set)
>>> classifier.classify(gender_features("Dorneles"))
'male'
>>> classifier.classify(gender_features("Magali"))
'female'
                                        Extracted from NLP with Python
Referências
Uma palavra dos patrocinadores...
Obrigado a todos
                         pela atenção.

                            Rodrigo Dias Arruda Senra
                                 http://rodrigo.senra.nom.br
                                      rsenra@acm.org
As opiniões e conclusões expressas nesta apresentação são de exclusiva responsabilidade de Rodrigo Senra.

Não é necessário requisitar permissão do autor para o uso de partes ou do todo desta apresentação, desde que
não sejam feitas alterações no conteúdo reutilizado e que esta nota esteja presente na íntegra no material
resultante.

Imagens e referências para outros trabalhos nesta apresentação permanecem propriedade daqueles que detêm
seus direitos de copyright.

More Related Content

What's hot

Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
 
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Charles Nutter
 
우분투한국커뮤니티 수학스터디결과보고
우분투한국커뮤니티 수학스터디결과보고우분투한국커뮤니티 수학스터디결과보고
우분투한국커뮤니티 수학스터디결과보고용 최
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&CoMail.ru Group
 
pyconjp2015_talk_Translation of Python Program__
pyconjp2015_talk_Translation of Python Program__pyconjp2015_talk_Translation of Python Program__
pyconjp2015_talk_Translation of Python Program__Renyuan Lyu
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CSteffen Wenz
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesCharles Nutter
 
Learn Python 3 for absolute beginners
Learn Python 3 for absolute beginnersLearn Python 3 for absolute beginners
Learn Python 3 for absolute beginnersKingsleyAmankwa
 
Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016Steffen Wenz
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for BioinformaticsJosé Héctor Gálvez
 
Boost.Python - domesticating the snake
Boost.Python - domesticating the snakeBoost.Python - domesticating the snake
Boost.Python - domesticating the snakeSławomir Zborowski
 
sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matterDawid Weiss
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3Mosky Liu
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from DataMosky Liu
 
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekingeProf. Wim Van Criekinge
 
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in PracticeCluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in PracticeSteffen Wenz
 

What's hot (20)

Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
 
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
 
우분투한국커뮤니티 수학스터디결과보고
우분투한국커뮤니티 수학스터디결과보고우분투한국커뮤니티 수학스터디결과보고
우분투한국커뮤니티 수학스터디결과보고
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
 
pyconjp2015_talk_Translation of Python Program__
pyconjp2015_talk_Translation of Python Program__pyconjp2015_talk_Translation of Python Program__
pyconjp2015_talk_Translation of Python Program__
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Learn Python 3 for absolute beginners
Learn Python 3 for absolute beginnersLearn Python 3 for absolute beginners
Learn Python 3 for absolute beginners
 
Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016Powered by Python - PyCon Germany 2016
Powered by Python - PyCon Germany 2016
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Boost.Python - domesticating the snake
Boost.Python - domesticating the snakeBoost.Python - domesticating the snake
Boost.Python - domesticating the snake
 
Don't do this
Don't do thisDon't do this
Don't do this
 
sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matter
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 
Python tour
Python tourPython tour
Python tour
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
 
System Calls
System CallsSystem Calls
System Calls
 
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in PracticeCluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in Practice
 
TensorFlow XLA RPC
TensorFlow XLA RPCTensorFlow XLA RPC
TensorFlow XLA RPC
 

Viewers also liked

Tech talk about iswc2013
Tech talk about iswc2013Tech talk about iswc2013
Tech talk about iswc2013Rodrigo Senra
 
Show Pyrotécnico - Keynote PythonBrasil[9] 2013
Show Pyrotécnico - Keynote PythonBrasil[9] 2013Show Pyrotécnico - Keynote PythonBrasil[9] 2013
Show Pyrotécnico - Keynote PythonBrasil[9] 2013Rodrigo Senra
 
Depurador onisciente
Depurador oniscienteDepurador onisciente
Depurador oniscienteRodrigo Senra
 
Cases de Python no 7Masters 2012
Cases de Python no 7Masters 2012Cases de Python no 7Masters 2012
Cases de Python no 7Masters 2012Rodrigo Senra
 
Organicer: Organizando informação com Python
Organicer: Organizando informação com PythonOrganicer: Organizando informação com Python
Organicer: Organizando informação com PythonRodrigo Senra
 
Rupy2014 - Show Pyrotécnico
Rupy2014 - Show PyrotécnicoRupy2014 - Show Pyrotécnico
Rupy2014 - Show PyrotécnicoRodrigo Senra
 
Uma breve história no tempo...da computação
Uma breve história no tempo...da computaçãoUma breve história no tempo...da computação
Uma breve história no tempo...da computaçãoRodrigo Senra
 
Python: Cabe no seu bolso, no seu micro, no seu cérebro.
Python: Cabe no seu bolso, no seu micro, no seu cérebro.Python: Cabe no seu bolso, no seu micro, no seu cérebro.
Python: Cabe no seu bolso, no seu micro, no seu cérebro.Rodrigo Senra
 
Brainiak: Um plano maligno de dominação semântica hipermídia
Brainiak: Um plano maligno de dominação semântica hipermídiaBrainiak: Um plano maligno de dominação semântica hipermídia
Brainiak: Um plano maligno de dominação semântica hipermídiaRodrigo Senra
 
Linked data at globo.com
Linked data at globo.comLinked data at globo.com
Linked data at globo.comRodrigo Senra
 
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rodrigo Senra
 
Brainiak - uma API REST Hipermedia
Brainiak - uma API REST Hipermedia Brainiak - uma API REST Hipermedia
Brainiak - uma API REST Hipermedia Rodrigo Senra
 
Rest, Gateway e Compiladores
Rest, Gateway e CompiladoresRest, Gateway e Compiladores
Rest, Gateway e CompiladoresRodrigo Senra
 
Python: A Arma Secreta do Cientista de Dados
Python: A Arma Secreta do Cientista de DadosPython: A Arma Secreta do Cientista de Dados
Python: A Arma Secreta do Cientista de DadosRodrigo Senra
 
Python: a arma secreta do Cientista de Dados
Python: a arma secreta do Cientista de DadosPython: a arma secreta do Cientista de Dados
Python: a arma secreta do Cientista de DadosRodrigo Senra
 
Cientista de Dados - A profissão mais sexy do século 21
Cientista de Dados - A profissão mais sexy do século 21Cientista de Dados - A profissão mais sexy do século 21
Cientista de Dados - A profissão mais sexy do século 21Rodrigo Senra
 

Viewers also liked (17)

Tech talk about iswc2013
Tech talk about iswc2013Tech talk about iswc2013
Tech talk about iswc2013
 
Show Pyrotécnico - Keynote PythonBrasil[9] 2013
Show Pyrotécnico - Keynote PythonBrasil[9] 2013Show Pyrotécnico - Keynote PythonBrasil[9] 2013
Show Pyrotécnico - Keynote PythonBrasil[9] 2013
 
Depurador onisciente
Depurador oniscienteDepurador onisciente
Depurador onisciente
 
Cientista de Dados
Cientista de DadosCientista de Dados
Cientista de Dados
 
Cases de Python no 7Masters 2012
Cases de Python no 7Masters 2012Cases de Python no 7Masters 2012
Cases de Python no 7Masters 2012
 
Organicer: Organizando informação com Python
Organicer: Organizando informação com PythonOrganicer: Organizando informação com Python
Organicer: Organizando informação com Python
 
Rupy2014 - Show Pyrotécnico
Rupy2014 - Show PyrotécnicoRupy2014 - Show Pyrotécnico
Rupy2014 - Show Pyrotécnico
 
Uma breve história no tempo...da computação
Uma breve história no tempo...da computaçãoUma breve história no tempo...da computação
Uma breve história no tempo...da computação
 
Python: Cabe no seu bolso, no seu micro, no seu cérebro.
Python: Cabe no seu bolso, no seu micro, no seu cérebro.Python: Cabe no seu bolso, no seu micro, no seu cérebro.
Python: Cabe no seu bolso, no seu micro, no seu cérebro.
 
Brainiak: Um plano maligno de dominação semântica hipermídia
Brainiak: Um plano maligno de dominação semântica hipermídiaBrainiak: Um plano maligno de dominação semântica hipermídia
Brainiak: Um plano maligno de dominação semântica hipermídia
 
Linked data at globo.com
Linked data at globo.comLinked data at globo.com
Linked data at globo.com
 
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
 
Brainiak - uma API REST Hipermedia
Brainiak - uma API REST Hipermedia Brainiak - uma API REST Hipermedia
Brainiak - uma API REST Hipermedia
 
Rest, Gateway e Compiladores
Rest, Gateway e CompiladoresRest, Gateway e Compiladores
Rest, Gateway e Compiladores
 
Python: A Arma Secreta do Cientista de Dados
Python: A Arma Secreta do Cientista de DadosPython: A Arma Secreta do Cientista de Dados
Python: A Arma Secreta do Cientista de Dados
 
Python: a arma secreta do Cientista de Dados
Python: a arma secreta do Cientista de DadosPython: a arma secreta do Cientista de Dados
Python: a arma secreta do Cientista de Dados
 
Cientista de Dados - A profissão mais sexy do século 21
Cientista de Dados - A profissão mais sexy do século 21Cientista de Dados - A profissão mais sexy do século 21
Cientista de Dados - A profissão mais sexy do século 21
 

Similar to pa-pe-pi-po-pure Python Text Processing

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
The Vanishing Pattern: from iterators to generators in Python
The Vanishing Pattern: from iterators to generators in PythonThe Vanishing Pattern: from iterators to generators in Python
The Vanishing Pattern: from iterators to generators in PythonOSCON Byrum
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Stupid Awesome Python Tricks
Stupid Awesome Python TricksStupid Awesome Python Tricks
Stupid Awesome Python TricksBryan Helmig
 
Python 내장 함수
Python 내장 함수Python 내장 함수
Python 내장 함수용 최
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7decoupled
 
A Few of My Favorite (Python) Things
A Few of My Favorite (Python) ThingsA Few of My Favorite (Python) Things
A Few of My Favorite (Python) ThingsMichael Pirnat
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQLPeter Eisentraut
 
Python 표준 라이브러리
Python 표준 라이브러리Python 표준 라이브러리
Python 표준 라이브러리용 최
 
Python basic
Python basic Python basic
Python basic sewoo lee
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonKHNOG
 
Beautiful python - PyLadies
Beautiful python - PyLadiesBeautiful python - PyLadies
Beautiful python - PyLadiesAlicia Pérez
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanWei-Yuan Chang
 
Τα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την PythonΤα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την PythonMoses Boudourides
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsagniklal
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientistsLambda Tree
 
Clojure: Simple By Design
Clojure: Simple By DesignClojure: Simple By Design
Clojure: Simple By DesignAll Things Open
 

Similar to pa-pe-pi-po-pure Python Text Processing (20)

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
The Vanishing Pattern: from iterators to generators in Python
The Vanishing Pattern: from iterators to generators in PythonThe Vanishing Pattern: from iterators to generators in Python
The Vanishing Pattern: from iterators to generators in Python
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Stupid Awesome Python Tricks
Stupid Awesome Python TricksStupid Awesome Python Tricks
Stupid Awesome Python Tricks
 
Python 내장 함수
Python 내장 함수Python 내장 함수
Python 내장 함수
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
A Few of My Favorite (Python) Things
A Few of My Favorite (Python) ThingsA Few of My Favorite (Python) Things
A Few of My Favorite (Python) Things
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
Python 표준 라이브러리
Python 표준 라이브러리Python 표준 라이브러리
Python 표준 라이브러리
 
Python basic
Python basic Python basic
Python basic
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Python 1
Python 1Python 1
Python 1
 
Beautiful python - PyLadies
Beautiful python - PyLadiesBeautiful python - PyLadies
Beautiful python - PyLadies
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
 
Python basic
Python basicPython basic
Python basic
 
Τα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την PythonΤα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την Python
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 
Clojure: Simple By Design
Clojure: Simple By DesignClojure: Simple By Design
Clojure: Simple By Design
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

pa-pe-pi-po-pure Python Text Processing

  • 1. pa-pe-pi-po- Pure Python Text Processing Rodrigo Senra rsenra@acm.org PythonBrasil[7] - São Paulo
  • 2. Anatomia do Blá • Eu, Vocês e Python • retrospectiva PythonBrasil[7] anos! • pa-pe-pi-po-pure python text processing • referências • 1 palavra dos patrocinadores
  • 3. Quem está aí ? ✓Profissionais de Informática ✓Desenvolvedores ✓Estudantes ✓Professores ✓1ª vez na PyConBrasil ✓Membros APyBr • Nenhuma resposta acima!
  • 4. Cenas dos últimos capítulos... [1] 2005 - BigKahuna [2] 2006 - Show Pyrotécnico Iteradores, Geradores,Hooks,Decoradores [3] 2007 - Show Pyrotécnico II Routing, RTSP, Twisted, GIS [4] 2008 - ISIS-NBP Bibliotecas Digitais [5] 2009 - Rest, Gtw e Compiladores SFC(Rede Petri) + ST(Pascal) > Ladder [5] 2010 - Potter vs Voldemort: Lições ofidiglotas da prática pythonica
  • 5. >>> type("bla") <type 'str'> >>> "".join(['pa',"pe",'''pi''',"""po"""]) 'papepipo' >>> str(2**1024)[100:120] '21120113879871393357' >>> 2**1024 1797693134862315907729305190789024733617976978942306572734 30081157732675805500963132708477322407536021120113879871393 3576587897688144166224928474306394741243777678934248654852 7630221960124609411945308295208500576883815068234246288147 3913110540827237163350510684586298239947245938479716304835 356329624224137216L >>> 'ariediod'[::-1] 'doideira'
  • 6. >>> " deu branco no prefixo e no sufixo, limpa com strip ".strip() 'deu branco no prefixo e no sufixo, limpa com strip' >>> _.startswith("deu") True >>> "o rato roeu a roupa do rei de roma".partition("r") ('o ', 'r', 'ato roeu a roupa do rei de roma') >>> "o rato roeu a roupa do rei de roma".split("r") ['o ', 'ato ', 'oeu a ', 'oupa do ', 'ei de ', 'oma'] >>> "o rato roeu a roupa do rei de roma".split() ['o', 'rato', 'roeu', 'a', 'roupa', 'do', 'rei', 'de', 'roma']
  • 7. >>> r"W:naoprecisadeescape" 'W:naoprecisadeescape' >>> type(r"W:naoprecisadeescape") <type 'str'> >>> type(u"Unicode") <type 'unicode'> >>> print(u"xc3xa2") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) >>> print(unicode('xc3xa1','iso-8859-1').encode('iso-8859-1')) á >>> import codecs, sys >>> sys.stdout = codecs.lookup('iso-8859-1')[-1](sys.stdout) >>> print(u"xc3xa1") á
  • 8. >>> b"String de 8-bit chars" 'String de 8-bit chars' Python 2.6.1 Python 3.1.4 >>> b"Bla" >>> b"Bla" 'Bla' b'Bla' >>> b"Bla"=="Bla" >>> type(b"Bla") True <class 'bytes'> >>> type(b"Bla") >>> type("Bla") <type 'str'> <class 'str'> >>> "Bla"==b"Bla" False
  • 9. >>> [ord(i) for i in "nulalexsedlex"] [110, 117, 108, 97, 108, 101, 120, 115, 101, 100, 108, 101, 120] >>> "".join([chr(i) for i in _]) 'nulalexsedlex' >>> 'lex' in _ True >>> import string >>> dir(string) ['Formatter', 'Template', '_TemplateMetaclass', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '_float', '_idmap', '_idmapL', '_int', '_long', '_multimap', '_re', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'atof', 'atof_error', 'atoi', 'atoi_error', 'atol', 'atol_error', 'capitalize', 'capwords', 'center', 'count', 'digits', 'expandtabs', 'find', 'hexdigits', 'index', 'index_error', 'join', 'joinfields', 'letters', 'ljust', 'lower', 'lowercase', 'lstrip', 'maketrans', 'octdigits', 'printable', 'punctuation', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitfields', 'strip', 'swapcase', 'translate', 'upper', 'uppercase', 'whitespace', 'zfill']
  • 10. >>> string.hexdigits '0123456789abcdefABCDEF' >>> string.punctuation '!"#$%&'()*+,-./:;<=>?@[]^_`{|}~' >>> string.maketrans('','') 'x00x01x02x03x04x05x06x07x08tnx0bx0crx0ex0f x10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f !"# $%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[] ^_`abcdefghijklmnopqrstuvwxyz{|}~x7f x80x81x82x83x84x85x86x87x88x89x8ax8bx8cx8dx8e x8fx90x91x92x93x94x95x96x97x98x99x9ax9bx9cx9d x9ex9fxa0xa1xa2xa3xa4xa5xa6xa7xa8xa9xaaxabxac xadxaexafxb0xb1xb2xb3xb4xb5xb6xb7xb8xb9xbaxbb xbcxbdxbexbfxc0xc1xc2xc3xc4xc5xc6xc7xc8xc9xcaxcb xccxcdxcexcfxd0xd1xd2xd3xd4xd5xd6xd7xd8xd9xdaxdb xdcxddxdexdfxe0xe1xe2xe3xe4xe5xe6xe7xe8xe9xea xebxecxedxeexefxf0xf1xf2xf3xf4xf5xf6xf7xf8xf9xfa xfbxfcxfdxfexff'
  • 11. >>> def t(x,y): return string.translate(x,string.maketrans('',''),y) ... >>> t("O rato roeu. O que? A roupa! De quem? Do rei, de roma;", string.punctuation) 'O rato roeu O que A roupa De quem Do rei de roma' >>> class Bla(object): ... def __str__(self): ... return "Belex" ... def __repr__(self): ... return "Bla()" ... >>> b = Bla() >>> for i in [b, eval(repr(b))]: ... print(i, end='t') ... Belex Belex >>>
  • 12. >>> class istr(str): ... pass >>> for name in 'eq lt le gt ge ne cmp contains'.split(): ... meth = getattr(str, '__%s__' % name) ... def new_meth(self, param, *args): ... return meth(self.lower(), param.lower(), *args) ... setattr(istr, '__%s__'% name, new_meth) ... >>> istr("SomeCamelCase") == istr("sOmeCaMeLcase") True >>> 'Ec' in istr("SomeCamel") True Adapted from Python Cookbook
  • 13. >>> import re >>> pat = re.compile(re.escape("<strong>")) >>> re.escape("<strong>") '<strong>' >>> pat.sub("_","<strong>Hasta la vista<strong> baby") '_Hasta la vista_ baby' >>> date = re.compile(r"(dddd-dd-dd)s(w+)") >>> date.findall("Em 2011-09-29 PythonBrasil na parada. Em 2010-10-21 curitiba hospedou") [('2011-09-29', 'PythonBrasil'), ('2010-10-21', 'curitiba')]
  • 14. $ python -mtimeit -s "import re; n=re.compile(r'abra')" "n.search ('abracadabra')" 1000000 loops, best of 3: 0.306 usec per loop $ python -mtimeit -s "import re; n=r'abra'" "n in 'abracadabra'" 10000000 loops, best of 3: 0.0591 usec per loop $ python -mtimeit -s "import re; n=re.compile(r'd+$')" "n.match ('0123456789')" 1000000 loops, best of 3: 0.511 usec per loop $ python -mtimeit -s "import re" "'0123456789'.isdigit()"10000000 loops, best of 3: 0.0945 usec per loop Extracted from PyMag Jan 2008
  • 15. $ python -mtimeit -s "import re;r=re.compile('pa|pe|pi|po|pu');h='patapetapitapotapuxa'” "r.search(h)" 1000000 loops, best of 3: 0.383 usec per loop $ python -mtimeit -s "import re;n=['pa','pe','pi','po','pu'];h='patapetapitapotapuxa'" "any(x in h for x in n)" 1000000 loops, best of 3: 0.914 usec per loop Extracted from PyMag Jan 2008
  • 16. from pyparsing import Word, Literal, Combine import string def doSum(s,l,tokens): return int(tokens[0]) + int(tokens[2]) integer = Word(string.digits) addition = Combine(integer) + Literal('+') + Combine(integer) addition.setParseAction(doSum) >>> addition.parseString("5+7") ([12], {})
  • 17. import ply.lex as lex tokens = 'NUMBER', 'PLUS' t_PLUS = r'+' def t_NUMBER(t): r'd+' t.value = int(t.value) return t t_ignore = ' tnw' def t_error(t): t.lexer.skip(1) lexer = lex.lex() Adapted from http://www.dabeaz.com
  • 18. import ply.yacc as yacc def p_expression_plus(p): 'expression : expression PLUS expression' p[0] = p[1] + p[3] def p_factor_num(p): 'expression : NUMBER' p[0] = p[1] def p_error(p): print "Syntax error in input!" parser = yacc.yacc() Adapted from http://www.dabeaz.com
  • 19. >>> parser.parse("1+2 + 45 n + 10") 58 >>> parser.parse("Quanto vale 2 + 7") 9 >>> parser.parse("A soma 2 + 7 resulta em 9") Syntax error in input! >>> parser.parse("2 + 7 9") Syntax error in input! Adapted from http://www.dabeaz.com
  • 20. >>> parser.parse("1+2 + 45 n + 10") 58 >>> parser.parse("Quanto vale 2 + 7") 9 >>> parser.parse("A soma 2 + 7 resulta em 9") Syntax error in input! >>> parser.parse("2 + 7 9") Syntax error in input! Adapted from http://www.dabeaz.com
  • 21. from nltk.tokenize import sent_tokenize, word_tokenize msg = “Congratulations to Erico and his team. PythonBrasil gets better every year. You are now the BiggestKahuna.” >>> sent_tokenize(msg) ['Congratulations to Erico and his team.', 'PythonBrasil gets better every year.', 'You are now the BiggestKahuna.'] >>> word_tokenize(msg) ['Congratulations', 'to', 'Erico', 'and', 'his', 'team.', 'PythonBrasil', 'gets', 'better', 'every', 'year.', 'You', 'are', 'now', 'the', 'BiggestKahuna', '.'] Extracted from NLP with Python
  • 22. >>> def gender_features(word): ... return {"last_letter": word[-1]} ... >>> from nltk.corpus import names >>> len(names.words("male.txt")) 2943 >>> names = ([(name,'male') for name in names.words('male.txt')] + ... [(name,'female') for name in names.words('female.txt')]) >>> import random >>> random.shuffle(names) >>> featuresets = [(gender_features(n),g) for n,g in names] >>> train_set, test_set = featuresets[500:], featuresets[:500] >>> classifier = nltk.naiveBayesClassifier.train(train_set) >>> classifier.classify(gender_features("Dorneles")) 'male' >>> classifier.classify(gender_features("Magali")) 'female' Extracted from NLP with Python
  • 24. Uma palavra dos patrocinadores...
  • 25. Obrigado a todos pela atenção. Rodrigo Dias Arruda Senra http://rodrigo.senra.nom.br rsenra@acm.org As opiniões e conclusões expressas nesta apresentação são de exclusiva responsabilidade de Rodrigo Senra. Não é necessário requisitar permissão do autor para o uso de partes ou do todo desta apresentação, desde que não sejam feitas alterações no conteúdo reutilizado e que esta nota esteja presente na íntegra no material resultante. Imagens e referências para outros trabalhos nesta apresentação permanecem propriedade daqueles que detêm seus direitos de copyright.