A palestra descreve a área de Ciência de Dados e dá exemplos de diversas aplicações multi-modelos (tabelas, texto e grafos) e multi-disciplinares (biologia, enfermagem, educação).
2. Agenda
●
Ciência de Dados – História e Definições
●
Dilúvio de dados, Economia da informação
●
Espectro de estrutura e modelos de dados
– Dados tabulares
– Grafos
– Texto
●
Exemplos de pesquisa multi-modelos e
multi-disciplinares
3. Apresentação
●
Professor de Bancos de Dados/Ciência de
Dados – UTFPR
●
Interesses de Pesquisa: Big Data, NLP, IR,
Redes Complexas, ML, privacidade
●
Formado em Ciência da Computação,
“biólogo frustrado”
5. Mundo de Dados
●
Curtidas em Redes
Sociais
●
Páginas na Web
●
Notas dos alunos
●
Fotos do
Instagram
●
Localização de
Pokémons
●
Sinais de televisão
●
Saldo de contas
correntes
●
Produtos à venda
●
Imagens de satélite
●
Exames médicos
●
Medição de nível de
metano na atmosfera
de Marte
●
Telemetria de um
carro de F1
8. Data Science - History
●
Science has always been Data Science
●
Tycho Brahe (1546-1601)
and Johannes Kepler (1571-
1630) discovered the laws of
planetary motion collecting
and analysing a large volume
of observation data
●
What has changed now:
– The amount of data being generated
– The new methods and tech for analysis
– The dependency of our society on the generated
knowledge
11. Computers
●
Digital production and processing of data
●
DataBase Management Systems (DBMSs)
●
Data Analysis limited to large corporations
12. Internet, cellphones,
sensors, data storage...
●
Fast and cheap
communication for
everyone
●
Massive data production
and consumption
●
Commercial drive for new
data management tech
●
Data-driven economy
●
Data-driven science
14. Dilúvio de Informação
●
1bi usuários conectados no facebook
(23/08/2015)
●
2bi smartphones no mundo, 1b sites web
●
300 horas de vídeo no YouTube a cada
minuto
●
Google, Amazon, Microsoft and Facebook =
1,200 petabytes =
1.200.000.000.000.000.000 bytes = 5
pilhas de CDs até a Estação Espacial
Internacional
15.
16.
17. Big Data
●
Data sets that are so large or complex
that traditional data processing
applications are inadequate
●
Challenges: analysis, capture, data
curation, search, sharing, storage,
transfer, visualization, querying, updating
and information privacy
●
Predictive analytics, user behavior
analytics
18. Information economy
●
“The world’s most valuable resource is no
longer oil, but data”
●
Data companies are the most valuable
listed firms in the world
●
The nature of data makes the antitrust
remedies of the past less useful
The Economist - May 6th 2017
20. The Fourth Paradigm
●
Thousand years ago: science was empirical;
describing natural phenomena
●
Last few hundred years: theoretical branch;
using models, generalizations
●
Last few decades: a computational branch;
simulating complex phenomena
●
Today: data exploration (eScience), unify
theory, experiment, and simulation
(Jim Gray, 2007)
22. Data Science
Data Science, is an interdisciplinary field
about scientific methods, processes and
systems to extract knowledge or insights
from data in various forms [1]
.
Data science is a "concept to unify statistics,
data analysis and their related methods" in
order to "understand and analyze actual
phenomena" with data [2]
.
[1] Dhar, V. (2013). "Data science and 'prediction"
[2] Hayashi, Chikio (1998). "What is Data Science?
23. Data Science vs. Statistics
Dictionary definitions of statistical inference
tend to equate it with the entire discipline. This
has become less satisfactory in the “big data”
era of immense computer-based processing
algorithms. [...]
Very broadly speaking, algorithms are what
statisticians do while inference says why they
do them. A particularly energetic brand of the
statistical enterprise has flourished in the new
century, data science, emphasizing algorithmic
thinking rather than its inferential justification.
26. Tables
Red List of Threatened Species
(International Union for Conservation of Nature)
27. ON THE ECOLOGY OF HUMAN
CARNIVORY
ON THE ECOLOGY OF HUMAN
CARNIVORY
ZULMIRA COIMBRA
ADVISOR: FERNANDO
FERNANDEZ
ZULMIRA COIMBRA
ADVISOR: FERNANDO
FERNANDEZ
43. Text
Major threats to the species include cattle grazing,
agriculture activities and mining activities throughout
its range. A museum specimen collected from Reserva
Forestal de Yotoco in 1996 tested positive for
Batrachochytrium dendrobatidis (Velasquez et al.
2007). The presence of chytrid in this species in 1996 is
consistent with the timing of the declines observed in
the Yotoco subpopulations at the end of the 1990s, as
well as the timing of other Bd declines in montane
Andean species, suggesting it as a plausible, but
unconfirmed cause. However, the species can still be
found within Reserva Forestal de Yotoco (Velasquez et
al. 2007).
44. NLP - Levels of
Representation
Morphology
SyntaxSyntax
Explicit
Semantics
Full Semantics
Words
Also, higher representations require lower
49. Research
●
Analysis of association between in-class
social networks and academic performance
●
Goal: understand how the circle of friends
may influence grades of students
50. Case 1: In-class social networks
and academic performance
class social graph
grades spreadsheet
51. Resultados - Turmas
Média de conexões X
média de nota final
Média de agrupamento X
média de nota de trabalhos
Correlação 0,75 (p = 0,087) Correlação 0,64 (p = 0,167)
56. Student improvement
Students with friends that performed poorly on Exam 1 only improved by 0.5
on average on Exam 2 (p<0.01). Students with friends that performed well, in
contrast, improved by 1.9 points on average -- almost a 4 fold gain when
compared with the other group. This suggests that having friends with good
academic performance have a direct impact on students grade.
57. Resultados - Alunos
●
Correlação significativa entre a nota final
dos alunos e centralidade de autovetor
(correlação: 0,48) e e maior que grau
médio dos vizinhos (correlação: 0,40),
sugerindo importância da topologia da rede
●
Correlação negativa (fraca) entre as notas
centralidade de intermediação
59. Research
●
Analysis networks of entities cited in Fake
News
●
Goal: understand how entities are
mentioned and related in fake news and
how their prevalence correlates with
political events
60. Texto Grafo→ Elks →
...Moro investiga Lula na
operação Lava-Jato...
...Lula se reúne com
Dilma para tratar...
Moro
Lula
Lava-Jato
Dilma
Lula
Moro
LulaLava-Jato
Dilma
63. Identificação de tópicos
●
Algoritmo de agrupamento em grafos
(Modularidade)
●
Agrupamentos representam entidades
frequentemente co-citadas
●
Agrupamentos usados como
representantes de tópicos
68. Trabalhos em andamento
●
Compreender intencionalidade no uso de
metáforas em Fake News
●
Usar texto da tabela da IUCN para fazer
classificação automática das ameaças
●
Comparar evolução dos tópicos das Fake
News com eventos políticos
●
Estudar formação de grupos de alunos
●
Avaliar impacto de espécies em extinção
na rede trófica
69. Projeto Ciência de Dados
por uma Causa
Página: http://dainf.ct.utfpr.edu.br/umacausa
Facebook: https://fb.me/cienciadadoscausa/