Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

GI2016 ppt shi (big data analytics on the internet)

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 29 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (19)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à GI2016 ppt shi (big data analytics on the internet) (20)

Plus par IGN Vorstand (20)

Publicité

Plus récents (20)

GI2016 ppt shi (big data analytics on the internet)

  1. 1. BIG DATA ANALYTICS ON THE INTERNET Dr. Shaozhong SHI drshishaozhong@gmail.com
  2. 2. Drawing data from geographically dispersed data stores over the Internet  A showcase of internationally remote access to and use of open data and application over the Internet is presented.  It shows how automation in big data analytics can be achieved on the Internet.  It shows the importance of standardisation and accessibility of data.  It illustrates, with a live example, how Open Source tools can be utilised for advancing big data analytics.
  3. 3. Drawing data from geographically dispersed data stores over the Internet  It shows the design of a new application with use of Open Source tools such as Pandas, Numpy and Metplotlib.  It explains how full automation in sourcing and processing data and generating analytical output can be achieved.  It shows the importance of the standardisation of data and the role of geographical identifiers in automated data processing.
  4. 4. Some key solutions for working across multiple Pandas dataframes (tables)  This PowerPoint show covers some keys which are important to data linkage, data integration, working across multiple Pandas dataframes (tables), and automation in processing.  These are key solutions for automated exact processing of records.  The showcase implementation is provided in a IPython notebook. See at the link below:  http://dev.mapofagriculture.com:9999/ipython/notebooks/sshaozhong/ 2016-05-16_Automatic_Aggregation_Disaggregation_Showcase.ipynb
  5. 5. Original Online Data from USGS  The original data used is a large well structured Excel sheet at the following USGS website:  http://water.usgs.gov/pubs/sir/2006/5012/excel/Nutri ent_Inputs_1982-2001jan06.xls  It is used as the input to the program. It is geo- indexed with Federal Information Processing Standards (FIPS) codes.  The data is read in the newly developed program and stored as a Pandas dataframe table.  A subset of data was extracted for creation of a Pandas dataframe table to serve as the input table.
  6. 6. Original data: Nitrogen Input from Fertilizer Use (kilograms) in each year between 1987 and 2001 A subset of a large spread sheet
  7. 7. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The primary questions that this work set out to answer is whether automated means can be designed and developed for use in data integration and integrated processing of agricultural census dataset,  and whether automated aggregation by states and dis- aggregation of values at state level into values at county level.  To this end, an exploratory design, development and testing were carried out. An integrated set of algorithms were researched, designed, implemented and tested on the Map of Agriculture platform.  The integrated algorithms are collectively called Data Linkage for Data Integration and Automated Aggregation and Dis-aggregation.
  8. 8. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The Automated Aggregation and Dis-aggregation is a prototype program that was developed in order to enable rapid development of data integration and integrated processing with Open Source Python tools and libraries.  The automated Aggregation and Dis-aggregation use Python and Pandas, Numpy libraries.  It has efficient, exact data integration, data inflow and outflow in Pandas dataframe tables, integrated processing characteristics.
  9. 9. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The two sets of algorithmic solutions implemented are automatic online sourcing of structured data  and Automated Aggregation and Dis-aggregation itself. The first is to access, read in and take a set of data.  The second is to carry out an integrated processing for aggregating county level statistics into state level statistics and dis-aggregating state level statistics into county level statistics by rule.  Automated aggregation: addition and summing used.  A loop for summing up farm and non-farm statistics at county level for each year from 1987 to 2001.  Aggregated state level statistics are produced by using the State FIPS codes as the key.
  10. 10. Working of the processing Aggregation: Input
  11. 11. Working of the processing Aggregation: Output of Adding farm And nofarm Statistics Recursively Carried out For all years
  12. 12. Working of processing Aggregation: Output of Application of Groupby with The use of StateFIPS
  13. 13. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The output of automatic aggregation is a Pandas dataframe table which is indexed with the State FIPS codes.  Dis-aggregation:  The showcase uses a rule assuming that county level statistics contributing to state level statistics proportionally as determined by the area within the state.  The totals of land areas of the states are collected from the output of the aggregated output table through vLookup. It is stored as a Python dictionary as a geo-referenced dataset.  These are mapped exactly into right positions in a new column in the intermediary table for producing dis- aggregated statistics.
  14. 14.  Output of dis-aggregating statistics on Nitrogen Input
  15. 15. Characterisation of the new algorithm  Then, calculation of ratio between each county and its state takes place.  A loop is used to calculate dis-aggregated statistics for all counties for each of years from 1987 to 2001.  This results in a Pandas dataframe table as a dis- aggregated table.
  16. 16. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  New approach of dis-aggregating tabular statistics into smaller geographical units (no intersection of geometric objects is required):  Calculation of ratio between each county and its state takes place. A loop is used to calculate dis-aggregated Nitrogen input statistics for all counties for each of years between 1987 and 2001. The total of a state times the ratio yields a dis-aggregated sum for the county. This logic of dis-aggregation has been used in areal interpolations as a technique for spatial disaggregation (Flowerdew and Green, 1992&1994; Goodchild, Anselin and Deichmann, 1993). This results in a Pandas dataframe table as a dis-aggregated table.
  17. 17. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  Hitherto, areal interpolation and Dasymetric mapping (Flowerdew and Green, 1992&1994; Goodchild, Anselin  and Deichmann, 1993) are the only known approach and methods for spatially dis-aggregating statistics in relevance to the current work, particularly regarding the processing of tabular statistics in vector GIS datasets. The current work uses the logic of areal interpolation, as far as the datasets involved can currently allow. The difference between the current implementation of calculations and areal interpolation is that the current implementation does not involve intersection of area features/polygons.
  18. 18. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  There is a degree of uncertainty related to the estimates. Improvement in estimation requires further research in the future. Nevertheless, it is a step forward in enabling estimation given the situation where no data are collected at county level. It offers a means to provide a quantitative indication. It is particularly useful to the processing of tabular statistics or when patterns need to be visualised at large scales.
  19. 19. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The algorithmic solutions are characterised by their capabilities to track the geo-referenced data entries throughout cycles of processing, and exact geo- referenced data retrieval and mapping, namely data inflow and outflow from Pandas dataframe tables.  The dis-aggregation algorithm/procedure can be used for directly processing of tabular statistics without involving intersection of polygons, particularly in situations when neatly nested geospatial boundaries files of US states and counties are used.
  20. 20. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The new algorithm can carry out automatic online sourcing of datasets and integrated processing with Open Source Python libraries. The new algorithm can be further extended for linking geodata from various sources, and for creation of indexed tabular datasets with geographical identifiers.  It can carry out automatic aggregation and dis- aggregation of agricultural census datasets for all states and counties in the USA.
  21. 21. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The Federation Information Processing Standards (FIPS) codes were used as geographical identifiers for geo- referenced data entries. It plays a critical role in retrieving data from databases and mapping data into right positions. It plays an efficient role in enabling vLookup solutions for retrieving data and mapping to exact positions in tables as desired.  Geographical identifiers serve as the key and are critically important in linking data between tables and creating geo- indexed tabular datasets. Geographical identifiers track attribute data entries in reference to geospatial objects.  This vLookup solution can be modified and used for other geodata projects.
  22. 22. Output  The output of the program includes an aggregated statistical table by states and a dis-aggregated table by counties.
  23. 23. Dis-aggregating wheat statistics into all counties  Data columns of StateFIPS, State Abbreviation, County name, country FIPS and ratio are taken from the table of dis- aggregated nitrogen input to form a new Pandas DataFrame table.  Data on wheat is extracted from the QuickStats are used. These data are state level statistics. The data are dis-aggregated into all counties.
  24. 24. Dis-aggregating wheat statistics into all counties  Output of dis-aggregating wheat statistics
  25. 25. Issues encountered  Data type issues were encountered and resolved.  Clear understanding of data types and methods for changing and handling is required.  After application of groupby command in Pandas dataframe, the original indexing is found meaningless. The use of FIPS codes ensures that data indexing and linkage in records are maintained throughout processing cycles. Mapping geo-referenced data into exact positions in columns is very important.
  26. 26. Update Geo-databases and Create digital models in Geographical Information Systems to visualise spatial variation  A standard Geographical Information System has digital map associated with a tabular database of records.  Areal interpolation and Dasymetric mapping techniques have gained its popularity in using tabular records and combine these with area boundary files for creating map models.  The approach presented in this talk is based on the use of a neatly nested area boundary files in the administrative hierarchy of areas of the USA.  No intersection of digital boundaries is needed.
  27. 27. Analytical example: Change over time
  28. 28. Analytical example: Rate of Change
  29. 29. References  https://www.nass.usda.gov/Quick_Stats/  https://www.python.org/downloads/  https://www.scipy.org/scipylib/download.html  http://matplotlib.org/downloads.html  https://pypi.python.org/pypi/pylab  Contact  4 Haythrop Close, Downhead Park, Milton Keynes, Buckinghamshire, United Kingdom, MK15 9DD  Mobile: +44-7909844462  EMail: drshishaozhong@gmail.com

×