SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Scraping and mining
XML data with SAS
Herve J. Momo
Manulife Financial
TASS, SEP-25th, 2015
Toronto
Motivation
O Breast MRI Literature analysis on
PUBMED
• Which journals actually publish in the field?
• Which countries publish the most?
• Where is the journal located?
• What is the publication status in the field
o Initial publication,
o Recent publication,
o Evolution through the years.
What is XML data
O XML = eXtensible Markup Language
O Use as a medium for data exchange
O Just another Text file but with hierarchical
structure
O Understanding the structure is the key to
successful processing
O PUBMED data available in XML format
PUBMED is an archive of biomedical and life sciences
journal literature
What is XML data
<Books>
<Article>
<Title>SAS and Machine Learning for Beginner</Title>
<Year>2015</Year>
<Month>9</Month>
<Day>25</Day>
</Article>
<Article>
<Title>Sam Teach Yourself SAS in 21 days</Title>
<Year>2013</Year>
<Month>9</Month>
<Day>25</Day>
</Article>
</Books>
Title Year Month Day
SAS and Machine Learning for Beginner 2015 9 25
Sam Teach Yourself SAS in 21 days 2013 3 6
Thinking Excel?
O File contains about 2 millions
records
tools for reading XML files
O Using XML engine on a libname statement
libname xmlread xml "Absolute path to xmlfile";
O XML Mapper for more complex file
O GUI interface that create map for reading
O SAS Programing for even more complex
file
O The option available to me
Four Steps for Reading
complex XML
O Reading in complicated XML files into
SAS can be accomplished using a four
step process:
1. Explore and understand the structure
2. Decide what you want to extract
3. Write the program
4. Submit your program and validate the
extract
Explore and understand
Decide what you want
O Publication date
O Name of journal
O Title of Article
O Authors affiliation
O Country of affiliation
Writing your program: SAS
Function
O Some important character processing
functions for the program
o Index,
o Scan,
o Substr,
o Tranwrd
o Strip
o compress
o Prxparse,
o Prxmatch
Writing your program:
Algorithm
O Reading a record
filename inputfile "C:UsersmomoherDownloadspubmed_result.xml" lrecl=2048 ;
data pmidc;
INFILE &inputfile truncover;
INPUT record $2048.;
retain rtclid 1;
if index(record, '<ArticleId IdType="pubmed">')=1 then output ;
if index(record, '</PubmedArticle>')=1 then rtclid+1;
run;
O Cleaning a record
data pmidn(keep= pmid rtclid);
set pmidc;
pos1=length('<ArticleId IdType="pubmed">');
pos2=find(record, '</ArticleId>');
pmid1=substr(record, pos1+1, pos2-pos1-1);
pmid=input(pmid1, 12.);
run;
Very
important
Writing your program: Pattern
Matching
data affiliation;
set affiliationtemp;
length eml_add $60 affiliation $512;
if _n_=1 then do;
rx1=prxparse("/( +w*.ww*@ww*)|( +ww*@ww*)/");
rx2=prxparse("/Electronic address:/");
end;
retain rx1 rx2;
posn1=prxmatch(rx1, record); posn2=prxmatch(rx2, record);
if posn1 > 0 and posn2 >0 then do;
eml_add=strip(substr(record, posn2+length("Electronic address:")+1));
toremove= cat("Electronic address:","",eml_add);
affiliation=strip(TRANWRD(record, toremove, ''));
end;
else if posn1 > 0 then do;
eml_add=strip(substr(record, posn1));
affiliation=strip(TRANWRD(record, eml_add, ''));
end;
else do;
affiliation=strip(record); email="";
end;
run;
Match
email
pattern
Remove
email from
records
Validating the extract:
Geographical Data
/*Join all tables */
proc sql;
create table &outputlib..alldata as
select p.*, a.*, j.*, d.*, t.articletitle
from &outputlib..pmidn as p, author_country as a, &outputlib..journal
as j, &outputlib..pubdaten as d, &outputlib..article_title as t
where p.rtclid=a.rtclid and a.rtclid=j.rtclid and j.rtclid=d.rtclid and
d.rtclid=t.rtclid ;
quit;
/*Validate countries*/
proc sort data=&outputlib..alldata out=alldata nodupkey;
by country pmid journal;
run;
data real_country unknown_country;
merge alldata (in=f1) GEOGRAPHICAL_LOOKUP(in=f2);
by country;
if f1;
if f1 and f2 then output known_country;
else output unknown_country;
run;
Require
several
iterations
Sample Data
Some challenges encountered
O Affiliation challenges
o Same author with different contacts in the
same country
o Affiliation ending with email address after
country
o Affiliation ending with country
o Country information missing
o Several authors in the same country for the
same article
Country Analytics
Row
Country_Nam
e numb
1 United States 2568
2 Japan 553
3 Germany 552
4 Italy 471
5 United Kingdom 398
6 France 229
7 Netherlands 223
8 Canada 218
9 China 213
10 Turkey 162
11 Australia 104
12 Spain 97
13 Belgium 89
14 India 82
15 Israel 76
16 Austria 74
17 Taiwan 55
18 Switzerland 55
19 Greece 54
20 Sweden 47
0
500
1000
1500
2000
2500
numb(Mean)
A
ustraliaA
ustriaBelgiumCanadaChina
FranceGermanyGreeceIndia
Israel
Italy
Japan
Netherlands
Spain
SwedenSwitzerland
TaiwanTurkeyUnited
Kingdom
United
States
Country_Name
It’s Official USA publishes more than
anyone else
Journal Analytics
0
100
200
300
numb(Mean)
A
JR
A
m
JRoentgenol
A
cad
Radiol
A
ctaRadiol
A
nn.Surg.Oncol.
BrJRadiol
BreastBreastCancer
BreastCancerRes.Treat.
BreastJEurJRadiol
EurRadiol
JComputA
ssistTomogr
JM
agn
Reson
Imaging
J.Clin.Oncol.
M
agn
Reson
Imaging
M
agn
Reson
M
ed
M
ed
Phys
Plast.Reconstr.Surg.
Radiology
Rofojournal
Ro
w Journal Numb
1Radiology 327
2AJR Am J Roentgenol 323
3J Magn Reson Imaging 302
4Eur Radiol 236
5Eur J Radiol 225
6Breast J 190
7Magn Reson Med 161
8Breast Cancer 138
9Breast Cancer Res. Treat. 129
10Magn Reson Imaging 117
11Rofo 116
12Ann. Surg. Oncol. 102
13Acad Radiol 101
14Acta Radiol 97
15Med Phys 94
16J Comput Assist Tomogr 91
17Plast. Reconstr. Surg. 86
18Breast 83
19J. Clin. Oncol. 82
20Br J Radiol 82
Publication Date Analytics
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Percent
1980 1990 2000 2010
pubdate
Thank you
O Herve Momo
o SAS Certified Base programmer for SAS 9
o SAS Certified Advanced Programmer for SAS 9
o Herve.momo@live.com
o http://www.hervemomo.info/
o Tel: 647-308-6075

Contenu connexe

Tendances

Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding FormJakob .
 
A Simple Introduction to R for Market Researchers
A Simple Introduction to R for Market ResearchersA Simple Introduction to R for Market Researchers
A Simple Introduction to R for Market ResearchersRay Poynter
 
Semantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynkSemantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynkNamgee Lee
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsFine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsMuhammad Saleem
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney
2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney
2013 CrossRef Workshops Boot Camp Introduction Patricia FeeneyCrossref
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesAlexandra Roatiș
 
Ks2008 Semanticweb In Action
Ks2008 Semanticweb In ActionKs2008 Semanticweb In Action
Ks2008 Semanticweb In ActionRinke Hoekstra
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinthsDaniel Camarda
 
Present Paper: Protecting Free Expression Online on Freenet
Present Paper: Protecting Free Expression Online on FreenetPresent Paper: Protecting Free Expression Online on Freenet
Present Paper: Protecting Free Expression Online on FreenetSivadon Chaisiri
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 

Tendances (14)

Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding Form
 
A Simple Introduction to R for Market Researchers
A Simple Introduction to R for Market ResearchersA Simple Introduction to R for Market Researchers
A Simple Introduction to R for Market Researchers
 
Semantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynkSemantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynk
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsFine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Sparql
SparqlSparql
Sparql
 
2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney
2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney
2013 CrossRef Workshops Boot Camp Introduction Patricia Feeney
 
Enhanced publication
Enhanced publicationEnhanced publication
Enhanced publication
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
Ks2008 Semanticweb In Action
Ks2008 Semanticweb In ActionKs2008 Semanticweb In Action
Ks2008 Semanticweb In Action
 
7 advanced uses of rdfs
7 advanced uses of rdfs7 advanced uses of rdfs
7 advanced uses of rdfs
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
 
Present Paper: Protecting Free Expression Online on Freenet
Present Paper: Protecting Free Expression Online on FreenetPresent Paper: Protecting Free Expression Online on Freenet
Present Paper: Protecting Free Expression Online on Freenet
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 

En vedette

How the English Media handled the Horsemeat Scandal?
How the English Media handled the Horsemeat Scandal?How the English Media handled the Horsemeat Scandal?
How the English Media handled the Horsemeat Scandal?KATERINADIM8
 
Matt Gregerson Resume - PDF
Matt Gregerson Resume - PDFMatt Gregerson Resume - PDF
Matt Gregerson Resume - PDFMatt Gregerson
 
M1 u2s4 ai_mauh_sist jurid mexico-alemania
M1 u2s4 ai_mauh_sist jurid mexico-alemaniaM1 u2s4 ai_mauh_sist jurid mexico-alemania
M1 u2s4 ai_mauh_sist jurid mexico-alemaniaMartin Ugalde Hdez
 
El desarrollo humano a través de los índices e indicadores de desarrollo asoc...
El desarrollo humano a través de los índices e indicadores de desarrollo asoc...El desarrollo humano a través de los índices e indicadores de desarrollo asoc...
El desarrollo humano a través de los índices e indicadores de desarrollo asoc...danielaanoriega
 
садок вишневий коло хати
садок вишневий коло хатисадок вишневий коло хати
садок вишневий коло хатиkotsiubasv
 
ΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙ
ΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙ
ΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙemathites
 
Restaurant Crisis Communications and Reputation Management
Restaurant Crisis Communications and Reputation ManagementRestaurant Crisis Communications and Reputation Management
Restaurant Crisis Communications and Reputation ManagementAaron Allen
 

En vedette (11)

IPOL_STU(2015)542193_EN
IPOL_STU(2015)542193_ENIPOL_STU(2015)542193_EN
IPOL_STU(2015)542193_EN
 
How the English Media handled the Horsemeat Scandal?
How the English Media handled the Horsemeat Scandal?How the English Media handled the Horsemeat Scandal?
How the English Media handled the Horsemeat Scandal?
 
Matt Gregerson Resume - PDF
Matt Gregerson Resume - PDFMatt Gregerson Resume - PDF
Matt Gregerson Resume - PDF
 
M1 u2s4 ai_mauh_sist jurid mexico-alemania
M1 u2s4 ai_mauh_sist jurid mexico-alemaniaM1 u2s4 ai_mauh_sist jurid mexico-alemania
M1 u2s4 ai_mauh_sist jurid mexico-alemania
 
A tool to empower people to help themselves
A tool to empower people to help themselvesA tool to empower people to help themselves
A tool to empower people to help themselves
 
Update_11-18
Update_11-18Update_11-18
Update_11-18
 
El desarrollo humano a través de los índices e indicadores de desarrollo asoc...
El desarrollo humano a través de los índices e indicadores de desarrollo asoc...El desarrollo humano a través de los índices e indicadores de desarrollo asoc...
El desarrollo humano a través de los índices e indicadores de desarrollo asoc...
 
Paseo santa lucia monterrey
Paseo santa lucia monterreyPaseo santa lucia monterrey
Paseo santa lucia monterrey
 
садок вишневий коло хати
садок вишневий коло хатисадок вишневий коло хати
садок вишневий коло хати
 
ΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙ
ΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙ
ΟΝΟΜΑΤΙΚΟΙ ΕΤΕΡΟΠΤΩΤΟΙ ΠΡΟΣΔΙΟΡΙΣΜΟΙ
 
Restaurant Crisis Communications and Reputation Management
Restaurant Crisis Communications and Reputation ManagementRestaurant Crisis Communications and Reputation Management
Restaurant Crisis Communications and Reputation Management
 

Similaire à Scraping and mining XML data with SAS

The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptxWidsoulDevil
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemDataWorks Summit/Hadoop Summit
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchErudite
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 
Diary of a Wimpy Model Manager
Diary of a Wimpy Model ManagerDiary of a Wimpy Model Manager
Diary of a Wimpy Model ManagerEric Stephan
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSchool of Data
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data scienceSusan Ibach
 
Crystal reports seminar
Crystal reports seminarCrystal reports seminar
Crystal reports seminarteope_ruvina
 
Structured writing presentation to London Content Strategy Meetup
Structured writing presentation to London Content Strategy MeetupStructured writing presentation to London Content Strategy Meetup
Structured writing presentation to London Content Strategy MeetupEllis Pratt
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackDenodo
 
apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...
apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...
apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...apidays
 
Mobile data collection and d viz presentation
Mobile data collection and d viz presentationMobile data collection and d viz presentation
Mobile data collection and d viz presentationMyo Min Oo
 

Similaire à Scraping and mining XML data with SAS (20)

The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Data Science Course.pdf
Data Science Course.pdfData Science Course.pdf
Data Science Course.pdf
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
 
Oracle.xml.publisher
Oracle.xml.publisherOracle.xml.publisher
Oracle.xml.publisher
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Diary of a Wimpy Model Manager
Diary of a Wimpy Model ManagerDiary of a Wimpy Model Manager
Diary of a Wimpy Model Manager
 
pm1
pm1pm1
pm1
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Drupal 7 and SolR
Drupal 7 and SolRDrupal 7 and SolR
Drupal 7 and SolR
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
 
Crystal reports seminar
Crystal reports seminarCrystal reports seminar
Crystal reports seminar
 
Structured writing presentation to London Content Strategy Meetup
Structured writing presentation to London Content Strategy MeetupStructured writing presentation to London Content Strategy Meetup
Structured writing presentation to London Content Strategy Meetup
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...
apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...
apidays LIVE Paris 2021 - Stargate.io, An OSS Api Layer for your Cassandra by...
 
Mobile data collection and d viz presentation
Mobile data collection and d viz presentationMobile data collection and d viz presentation
Mobile data collection and d viz presentation
 

Scraping and mining XML data with SAS

  • 1. Scraping and mining XML data with SAS Herve J. Momo Manulife Financial TASS, SEP-25th, 2015 Toronto
  • 2. Motivation O Breast MRI Literature analysis on PUBMED • Which journals actually publish in the field? • Which countries publish the most? • Where is the journal located? • What is the publication status in the field o Initial publication, o Recent publication, o Evolution through the years.
  • 3. What is XML data O XML = eXtensible Markup Language O Use as a medium for data exchange O Just another Text file but with hierarchical structure O Understanding the structure is the key to successful processing O PUBMED data available in XML format PUBMED is an archive of biomedical and life sciences journal literature
  • 4. What is XML data <Books> <Article> <Title>SAS and Machine Learning for Beginner</Title> <Year>2015</Year> <Month>9</Month> <Day>25</Day> </Article> <Article> <Title>Sam Teach Yourself SAS in 21 days</Title> <Year>2013</Year> <Month>9</Month> <Day>25</Day> </Article> </Books> Title Year Month Day SAS and Machine Learning for Beginner 2015 9 25 Sam Teach Yourself SAS in 21 days 2013 3 6
  • 5. Thinking Excel? O File contains about 2 millions records
  • 6. tools for reading XML files O Using XML engine on a libname statement libname xmlread xml "Absolute path to xmlfile"; O XML Mapper for more complex file O GUI interface that create map for reading O SAS Programing for even more complex file O The option available to me
  • 7. Four Steps for Reading complex XML O Reading in complicated XML files into SAS can be accomplished using a four step process: 1. Explore and understand the structure 2. Decide what you want to extract 3. Write the program 4. Submit your program and validate the extract
  • 9. Decide what you want O Publication date O Name of journal O Title of Article O Authors affiliation O Country of affiliation
  • 10. Writing your program: SAS Function O Some important character processing functions for the program o Index, o Scan, o Substr, o Tranwrd o Strip o compress o Prxparse, o Prxmatch
  • 11. Writing your program: Algorithm O Reading a record filename inputfile "C:UsersmomoherDownloadspubmed_result.xml" lrecl=2048 ; data pmidc; INFILE &inputfile truncover; INPUT record $2048.; retain rtclid 1; if index(record, '<ArticleId IdType="pubmed">')=1 then output ; if index(record, '</PubmedArticle>')=1 then rtclid+1; run; O Cleaning a record data pmidn(keep= pmid rtclid); set pmidc; pos1=length('<ArticleId IdType="pubmed">'); pos2=find(record, '</ArticleId>'); pmid1=substr(record, pos1+1, pos2-pos1-1); pmid=input(pmid1, 12.); run; Very important
  • 12. Writing your program: Pattern Matching data affiliation; set affiliationtemp; length eml_add $60 affiliation $512; if _n_=1 then do; rx1=prxparse("/( +w*.ww*@ww*)|( +ww*@ww*)/"); rx2=prxparse("/Electronic address:/"); end; retain rx1 rx2; posn1=prxmatch(rx1, record); posn2=prxmatch(rx2, record); if posn1 > 0 and posn2 >0 then do; eml_add=strip(substr(record, posn2+length("Electronic address:")+1)); toremove= cat("Electronic address:","",eml_add); affiliation=strip(TRANWRD(record, toremove, '')); end; else if posn1 > 0 then do; eml_add=strip(substr(record, posn1)); affiliation=strip(TRANWRD(record, eml_add, '')); end; else do; affiliation=strip(record); email=""; end; run; Match email pattern Remove email from records
  • 13. Validating the extract: Geographical Data /*Join all tables */ proc sql; create table &outputlib..alldata as select p.*, a.*, j.*, d.*, t.articletitle from &outputlib..pmidn as p, author_country as a, &outputlib..journal as j, &outputlib..pubdaten as d, &outputlib..article_title as t where p.rtclid=a.rtclid and a.rtclid=j.rtclid and j.rtclid=d.rtclid and d.rtclid=t.rtclid ; quit; /*Validate countries*/ proc sort data=&outputlib..alldata out=alldata nodupkey; by country pmid journal; run; data real_country unknown_country; merge alldata (in=f1) GEOGRAPHICAL_LOOKUP(in=f2); by country; if f1; if f1 and f2 then output known_country; else output unknown_country; run; Require several iterations
  • 15. Some challenges encountered O Affiliation challenges o Same author with different contacts in the same country o Affiliation ending with email address after country o Affiliation ending with country o Country information missing o Several authors in the same country for the same article
  • 16. Country Analytics Row Country_Nam e numb 1 United States 2568 2 Japan 553 3 Germany 552 4 Italy 471 5 United Kingdom 398 6 France 229 7 Netherlands 223 8 Canada 218 9 China 213 10 Turkey 162 11 Australia 104 12 Spain 97 13 Belgium 89 14 India 82 15 Israel 76 16 Austria 74 17 Taiwan 55 18 Switzerland 55 19 Greece 54 20 Sweden 47 0 500 1000 1500 2000 2500 numb(Mean) A ustraliaA ustriaBelgiumCanadaChina FranceGermanyGreeceIndia Israel Italy Japan Netherlands Spain SwedenSwitzerland TaiwanTurkeyUnited Kingdom United States Country_Name It’s Official USA publishes more than anyone else
  • 17. Journal Analytics 0 100 200 300 numb(Mean) A JR A m JRoentgenol A cad Radiol A ctaRadiol A nn.Surg.Oncol. BrJRadiol BreastBreastCancer BreastCancerRes.Treat. BreastJEurJRadiol EurRadiol JComputA ssistTomogr JM agn Reson Imaging J.Clin.Oncol. M agn Reson Imaging M agn Reson M ed M ed Phys Plast.Reconstr.Surg. Radiology Rofojournal Ro w Journal Numb 1Radiology 327 2AJR Am J Roentgenol 323 3J Magn Reson Imaging 302 4Eur Radiol 236 5Eur J Radiol 225 6Breast J 190 7Magn Reson Med 161 8Breast Cancer 138 9Breast Cancer Res. Treat. 129 10Magn Reson Imaging 117 11Rofo 116 12Ann. Surg. Oncol. 102 13Acad Radiol 101 14Acta Radiol 97 15Med Phys 94 16J Comput Assist Tomogr 91 17Plast. Reconstr. Surg. 86 18Breast 83 19J. Clin. Oncol. 82 20Br J Radiol 82
  • 19. Thank you O Herve Momo o SAS Certified Base programmer for SAS 9 o SAS Certified Advanced Programmer for SAS 9 o Herve.momo@live.com o http://www.hervemomo.info/ o Tel: 647-308-6075