A base de dados SciELO armazena e disponibiliza registros de metadados e de textos completos de aproximadamente 700 mil artigos de diferentes disciplinas e idiomas, originadas das coleções nacionais de periódicos da Rede SciELO de 16 países. Os registros de metadados contém os campos de dados das referências bibliográficas dos artigos (título, autor, periódico fonte, data, resumo e palavras chaves) e das referências dos documentos citados nos textos artigos (título, autor, fonte, data) que são disponibilizados pelo SciELO em acesso aberto com atribuição CC-BY.
Ao mesmo tempo, os metadados de todos os artigos indexados no SciELO, publicados nos últimos 10 anos, estão também armazenados e disponibilizados na base de dados do SciELO Citation Index na plataforma WoS e na base de dados Dimensions. Da mesma, forma estão disponíveis nas bases de dados Scopus e WoS os metadados dos artigos dos periódicos indexados por esta base.
O SciELO constitui portanto uma notável fonte de dados bibliométricos para o estudo das produções científicas dos países da Rede SciELO.
O escopo deste grupo de trabalho / workshop é compartilhar metodologias e tecnologias de acesso e exploração dos dados da base de dados SciELO, com destaque para a introdução ao uso de técnicas de ciência de dados com o concurso da linguagem Python, o acesso aos dados com a linguagem R com vistas a análises estatísticas e técnicas de acesso e uso dos dados das bases de dados SciELO Citation Index e Dimensions.
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Danilo Bellini - GT6: Workshop sobre uso dos dados da base de dados SciELO
1. WG6 Report
SciELO 20 Years
WG6 presentation title:
Lecturer/speaker/rapporteur:
Group coordinator:
Executive secretary:
Workshop on the use of datafrom
the SciELO database
Danilo J. S. Bellini
Gustavo Fonseca
Carolina Tanigushi
WG6 date:
Report date:
2018-09-24
2018-09-25
Venue: Tivoli Mofarrej São Paulo Hotel
WG6 Report S c iE L O 20 Years 1 / 8
2. Summary
During the workshop, these had been done:
Brief introduction to the data analysis processes, emphasizing
data access, data cleaning and exploratory data analysis
Hands-on examples using Python and R, with emphasis in the
Pandas library resources, showing how data munging,
normalization, data visualization and other data analysis
processes can beperformed
Explanation of data analyses previously performed on data
coming from SciELO and external sources
Most of the time was spent on exploratory data analysis,
interpretation of descriptive statistics and visualization. Some
highlights of what had been studied in more depth include:
Hirsch index
Google Scholar’s h5-index and h5-median
Calculation from raw Dimensions’ data
SCImagoJR’s H index
Field Citation Ratio (FCR) from Dimensions
WG6 Report S c iE L O 20 Years 2 / 8
3. Tools
Two programming languages had been used, besides several:
Python
IPython / Jupyter Notebook
Python built-in modules (csv, s t a t i s t i c s , urllib, json,
glob, os, re, collections, itertools, pprint)
numpy
pandas
matplotlib
seaborn
openpyxl, to open XLSX files
scipy.stats, to calculate the Pearson’s correlation coefficient
NetworkX, a graph manipulation library including an API to
draw graphs with matplotlib
R
R built-in modules (base, utils, stats, graphics)
R Studio, an IDE for R, for creating R Markdown notebooks
dplyr, to perform grouping operations similar to SQL’s GROUP
BYand Pandas’ DataFrame.groupby
WG6 Report S c iE L O 20 Years 3 / 8
4. Data sources
Several analyses were performed before the workshop, whose
processes and results were part of it. The data that had been
studied in every analysis performed during and before the workshop
came from thesesources:
SciELO’s JSON APIs (RESTful) from:
ArticleMeta, to get journal metadata
Ratchet, to get access data
SciELO’s articlemeta Python library, an alternative way to
access the ArticleMeta API
Reports from the SciELO Analytics
SciELO Citation Index entries from the Web of Science
Dimensions data regarding two journals: Nauplius and
Brazilian Journal of Plant Physiology
SCImagoJR’s CSV with all SJR and H indices for 2017
Scopus’ XLSX with all the data they make available
WG6 Report S c iE L O 20 Years 4 / 8
5. Introduction to the analysis
Besides:
Identifying which collections have data in SciELO analytics (all
certified and development collections, besides the active
independent collections)
Downloading all SciELO analytics reports
Evaluating if the network reports have everything from the
remaining reports
Simplifying the column names
Normalizing/cleaning the ISSN when dealing with multiple
collections
Normalizing the thematic area (dealing with unfilled data)
It had been seenhow to plot data, with a strong emphasis on data
interpretation and multiple types of plots (bar plots, line plots,
box-and-whisker plots, heat maps, scatterplots, etc.), as well as
subplot splitting/grouping.
WG6 Report S c iE L O 20 Years 5 / 8
6. Previously prepared analyses
Number of indexed journals in the SciELO network
Deindexing reason in the SciELO Brazil collection
Evaluating the daily access in the SciELO Brazil collection
Three indices in Scopus 2017: CiteScore, SNIP and SJR
SCImago Journal Rank in 2017, including SJR and H index
FCR and H index in Dimensions
Google Scholar indices
Languages of research articles in SciELO Brazil, by thematic
area, document publication year and journal indexing year
Citations in the SciELO CI
Proportion of Brazil in affiliation institutions in research
articles from journals in the SciELO Brazil collection
WG6 Report S c iE L O 20 Years 6 / 8
7. Results
The proportion of Brazilian affiliations of research articles in
the SciELO Brazil collection is decreasing
Most citations of research articles in the SciELO network come
from documents/journals that aren’t in the SciELO network.
For research articles written in English, 76%of the received
citations comes from documents external to the SciELO
network
The normalization step when calculating the FCR and its
non-standard average calculation can easily push down the
result (e.g. a journal with 10 documents receiving 15 citations
and 3 documents with zero citations would have an average of
citations of less than 7.5, before this number gets normalized
by the year and field of research), making it an index best fit to
evaluate older journals that are no longer publishing
We should always look for the mathematics that defines an
index, as that evaluation can already give us some insights
regarding its bias towards some documents/journals
WG6 Report S c iE L O 20 Years 7 / 8
8. Results
Scopus indices should be taken with care: mixing the data from
all countries makes it hard to compare data from SciELO and
from other journals
In SciELO Brazil, 95% of the journals marked asdeceased
were actually just renamed to a new journal title/ISSN
Matching data with external sources is difficult without a
common standardized index such as the ISSN and DOI
0.8% of SciELO journals are marked in Scopus as not open,
which seem to be an issue regarding Scopus data
WG6 Report S c iE L O 20 Years 8 / 8