2. NAPE ICE
2019
Outline
● Why Process Petrophysical Big Data?
● What Big Data processing challenges?
● ETL Workflow
● Conclusion
● References
3. NAPE ICE
2019Why Process Petrophysical
Big Data?
● Re-evaluate old well logs for opportunities
● Conducting pre-drill analysis of offset wells
● Unable to effectively assess well / field reserves
● Challenge with inferring geological features
4. NAPE ICE
2019
● For 1 to 10 well log files?
- Copying the link and pasting on the browser is straightforward
- Quickly download log data
- Easier to perform ETL with such amount of data
What Big Data
Processing Challenges?
5. NAPE ICE
2019
● For 1 to 10 well log files?
● For 1000 well log files ???
What Big Data
Processing Challenges?
6. NAPE ICE
2019
● Link to ~ 1000 well log data from 5 fields in excel sheets
What Big Data
Processing Challenges?
7. NAPE ICE
2019
● Download each well log file individually from the web
● Read log data from each file
● Enrich metadata and actual data files and save as Apache Arrow data
format before loading to AWS S3 bucket
● GOAL: Making data ready for Apache Spark ML and Tensorflow
Deep Learning Pipeline
Extract Transform Load
8. NAPE ICE
2019
● Link to ~ 1000 well log data from 5 fields in excel sheets
● Download each well log file individually from the web
- get the links to the files
- append all the extracted links to a list
- account for errors
- save the file
ETL Workflow
11. NAPE ICE
2019
● Link to ~ 1000 well log data from 5 fields in excel sheets
● Download each well log file individually from the web
● Read log data from each file
- extract their actual data and Metadata / Header data
- account for errors
ETL Workflow
13. NAPE ICE
2019
● Link to ~ 1000 well log data from 5 fields in excel sheets
● Download each well log file individually from the web
● Read log data from each file
● Enrich metadata and actual data files and save as Apache Arrow data
format before loading to AWS s3 bucket
ETL Workflow
15. NAPE ICE
2019
● Link to ~ 1000 well log data from 5 fields in excel sheets
● Download each well log file individually from the web
● Read log data from each file
● Enrich metadata and actual data files and save as Apache Arrow data format before
loading to AWS s3 bucket
● Making data ready for Apache Spark ML / Keras Deep Learning Pipeline
- drop columns: 152 to 13 , drop duplicates , null / NA values, account for missing values
- Split-apply-combine on grouped data by field and API: @pandas_udf
- Caching dataframe
ETL Workflow
16. NAPE ICE
2019Conclusion
● Apache Airflow to orchestrate ETL process
● Moving towards real time data processing:
-WITSML data processing
● Apache Kafka, Apache Flink, Apache Storm, Apache Spark