Maria esteva

•Télécharger en tant que PPTX, PDF•

0 j'aime•694 vues

Lucas Giraldo

Maria Esteva, Texas Advanced Computing
Center, University of Texas at Austin
PANEL

Cyberinfrastructura para la
administración de datos de
investigación
Maria Esteva, Texas Advanced Computing
Center, University of Texas at Austin
2Eie
Mayo 2013, Cali, Colombia

Datos & investigación
• Ciencia intensiva de
datos
– Teoría, experimentos, y
simulaciones en el
contexto de datos masivos
• Datos sustentables
– Documentados, estables,
auténticos
• Datos para diseminar
conocimientos, citar, y
reutilizar

Formación de colecciones
• Proyectos de investigación complejos y en evolución
constante
• Tecnología y conocimientos cambian continuamente
• Fondos para investigación inestables
• Las colecciones son mas vulnerables durante el proceso
de investigación
• Arquitectura y funcionalidades de una colección
pueden involucrar a varias tecnologías

Perspectivas
• La curación de datos tiene como tema
central el problema que trata la
investigación
• Enfoque desde las ciencias de la información
• Enfoque desde la infraestructura
– Considerar la infraestructura y servicios desde la
planificación del proyecto de investigación y a través
del ciclo de vida del proyecto

Infraestructura de datos @
TACC
• Equipo multidisciplinario
• Corral
• 6 Petabits de disco en línea
• Sistema de archivo paralelo
Lustre
• Transferencia de datos 1 -10
GB/seg
• Acceso Web
• Flexibilidad de
configuración
• Librerías de código abierto
• 24/7 seguridad y
mantenimiento de los
sistemas

Bases de datos
• Bases de datos
relacionales:
MySQL, PostgreSQL, S
QL Server
– Pecan Street Project
• ARK y Specify
• GIS (Sistema de
información
geográfica)
– FASTI
– Instituto de
Arqueología Clásica

Flexibilidad
• Centro para la Investigación del Espacio (CSR)
– Almacenamiento de datos provenientes de satélites,
radares y sensores
– Terremoto de Haití – 2010
– El repositorio de datos de CSR fue transformado en un
repositorio web para compartir datos con los rescatistas.

Multiples posibilidades
• Gestión de datos durante el proyecto de investigación
• Almacenamiento temporario de datos para procesos
computacionales
• Acceso a colecciones de investigación
• Archivo oscuro
• El investigador es el curador
• El equipo de TACC ofrece e implementa soluciones técnicas al
proceso de curación y colabora en la
organización, estandarización y acceso de datos

Implementación de colecciones
• TACC administra el
acceso a los
sistemas, instala los
servidores/bases de
datos/librerías y
dependencias.
• Los usuarios tienen
acceso a su código
• Triage de colecciones
– ICA, 5 petabytes de
datos desorganizados
• Usuarios de distintos
dominios
• Usuarios con distintos
niveles de
conocimientos técnicos

Flujos de trabajo
– Diferentes flujos de datos
– Transición sin fisuras entre
sistemas de
almacenamiento y de
análisis.

Acceso
• Acceso web abierto al
publico
• Acceso cerrado durante
el periodo de embargo
• WebDav
• Protegido por
contraseña
• Acceso restringido al
equipo de investigación
• Desde los sistemas de
visualización de TACC

Preservación
• iRODS: bróker de archivos
distribuidos
• Replica de archivos en
Ranch, un archivo de
cinta y replicación
geográfica
• Seguridad y
mantenimiento
• Chequeo de
autenticidad de los datos
• Captura automática de
metadatos técnicos
• Perspectiva sobre lo que

Modelo administrativo
• 5 TB de almacenamiento gratuito a
investigadores de la Universidad de Texas
• Estructura de costos anual, basada en
honorarios del staff
– Consultoría, curación de datos, bases de
datos y aplicaciones web
• Funciona como archivo oscuro para
costear hardware
• Participamos en subsidios de
investigación

Data@TACC
• Weijia Xu
• Christopher Jordan
• David Walling
• Tomislav Urban
• Siva Kulaskerian

Contenu connexe

Tendances

Base de datosChristian Paredes

Base de datosMARLENEDELROCIOANCHU

Experiência do repositório cooperativo catalão TDXCSUC - Consorci de Serveis Universitaris de Catalunya

Base de datosChristian Paredes

repositoriosStephanie Greise Alferez Soria

Presentacion barcelona 3.rafael.antonia.v2maredata

informaticajonasacm

9. Software colaborativoJesús Tramullas

7. Repositorios digitalesJesús Tramullas

Bases de datosStudioV Educación Virtual

Repositorios De InformacióNSolange Zambrano

Colegio%20 superior%20san%20martín[1]fariasbischoff

Mare d seminario-rmeleromaredata

Interop metadata tonymaredata

2. Introducción a la gestión de contenidos / Content Management 101Jesús Tramullas

Rumbo a la Biblioteca DigitalServicio de Difusión de la Creación Intelectual (SEDICI)

Base de datos mapa conceptual ticantonietagarciavelas

Tendances (17)

Base de datos

Experiência do repositório cooperativo catalão TDX

Base de datos

repositorios

Presentacion barcelona 3.rafael.antonia.v2

informatica

9. Software colaborativo

7. Repositorios digitales

Bases de datos

Repositorios De InformacióN

Colegio%20 superior%20san%20martín[1]

Mare d seminario-rmelero

Interop metadata tony

2. Introducción a la gestión de contenidos / Content Management 101

Rumbo a la Biblioteca Digital

Base de datos mapa conceptual tic

Similaire à Maria esteva

Desarrollo del repositorio institucional de producción científica de la Unive...CBUADY

Seminario 8 sample and data sharingBiobanco

Requisitos funcionales para la creación de repositorios consorciados de datos...Ricard de la Vega

Requisitos funcionales para la creación de repositorios consorciados de datos...CSUC - Consorci de Serveis Universitaris de Catalunya

Repositorio digitalVictor Mario Echeverry Leon

Gestión de datos de investigación. Fundación Juan MarchFESABID

Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...CSUC - Consorci de Serveis Universitaris de Catalunya

Gestión de datos de investigación, primeros pasos en argentinaFernando-Ariel Lopez

Implementación de un Repositorio de Datos Científicos usando DspaceRodrigo Torrens

Gestion de Datos de InvestigacionLEARN Project

Introduccion al uso y gestion de recursos de información electronicosAlvaro Cabezas Clavijo

Bibliotecari@s en la ‪Big Data‬Fernando-Ariel Lopez

SGNext ElasticsearchDomingo Suarez Torres

CSUC & RDM: CORA.RDR y CORA.eiNa DMP CSUC - Consorci de Serveis Universitaris de Catalunya

Preservación digital: estándares y buenas prácticasLibio Huaroto

HypatiaSalud. Respositorio institucional del SSPABiblioteca Virtual del Sistema Sanitario Publico de Andalucia (BV-SSPA)

Datos abiertos en un mundo de grandes datos (Acuerdo ICSU-IAP-ISSC-TWAS)CLACSOredbiblio - Consejo Latinoamericano de Ciencias Sociales CLACSO - Red de Bibliotecas Virtuales

Datos Abiertos en un Mundo de Grandes Datos (Acuerdo ICSU-IAP-ISSC-TWAS)CLACSO-Latin American Council of Social Sciences, Open Access

Implementação de serviços nacionais e estratégias institucionais para a Gestã...CSUC - Consorci de Serveis Universitaris de Catalunya

Aspectos gerenciales de una red de preservación digital distribuídaCariniana Rede

Similaire à Maria esteva (20)

Desarrollo del repositorio institucional de producción científica de la Unive...

Seminario 8 sample and data sharing

Requisitos funcionales para la creación de repositorios consorciados de datos...

Repositorio digital

Gestión de datos de investigación. Fundación Juan March

Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...

Gestión de datos de investigación, primeros pasos en argentina

Implementación de un Repositorio de Datos Científicos usando Dspace

Gestion de Datos de Investigacion

Introduccion al uso y gestion de recursos de información electronicos

Bibliotecari@s en la ‪Big Data‬

SGNext Elasticsearch

CSUC & RDM: CORA.RDR y CORA.eiNa DMP

Preservación digital: estándares y buenas prácticas

HypatiaSalud. Respositorio institucional del SSPA

Datos abiertos en un mundo de grandes datos (Acuerdo ICSU-IAP-ISSC-TWAS)

Datos Abiertos en un Mundo de Grandes Datos (Acuerdo ICSU-IAP-ISSC-TWAS)

Implementação de serviços nacionais e estratégias institucionais para a Gestã...

Aspectos gerenciales de una red de preservación digital distribuída

Plus de Lucas Giraldo

Moocs cali 05_08_13Lucas Giraldo

Daniela braunerLucas Giraldo

Panel e salud shimizuLucas Giraldo

Gabriela aillonLucas Giraldo

Alvaro malagutiLucas Giraldo

José luiz ribeiroLucas Giraldo

Panel innovaciòn educativaLucas Giraldo

Panel e infraestructura g.aillonLucas Giraldo

Kelly gaitherLucas Giraldo

Shuji shimizuLucas Giraldo

Plus de Lucas Giraldo (10)

Moocs cali 05_08_13

Daniela brauner

Panel e salud shimizu

Gabriela aillon

Alvaro malaguti

José luiz ribeiro

Panel innovaciòn educativa

Panel e infraestructura g.aillon

Kelly gaither

Shuji shimizu

Maria esteva

1. Maria Esteva, Texas Advanced Computing Center, University of Texas at Austin PANEL

2. Cyberinfrastructura para la administración de datos de investigación Maria Esteva, Texas Advanced Computing Center, University of Texas at Austin 2Eie Mayo 2013, Cali, Colombia

3. Datos & investigación • Ciencia intensiva de datos – Teoría, experimentos, y simulaciones en el contexto de datos masivos • Datos sustentables – Documentados, estables, auténticos • Datos para diseminar conocimientos, citar, y reutilizar

4. Formación de colecciones • Proyectos de investigación complejos y en evolución constante • Tecnología y conocimientos cambian continuamente • Fondos para investigación inestables • Las colecciones son mas vulnerables durante el proceso de investigación • Arquitectura y funcionalidades de una colección pueden involucrar a varias tecnologías

5. Perspectivas • La curación de datos tiene como tema central el problema que trata la investigación • Enfoque desde las ciencias de la información • Enfoque desde la infraestructura – Considerar la infraestructura y servicios desde la planificación del proyecto de investigación y a través del ciclo de vida del proyecto

6. Infraestructura de datos @ TACC • Equipo multidisciplinario • Corral • 6 Petabits de disco en línea • Sistema de archivo paralelo Lustre • Transferencia de datos 1 -10 GB/seg • Acceso Web • Flexibilidad de configuración • Librerías de código abierto • 24/7 seguridad y mantenimiento de los sistemas

7. Bases de datos • Bases de datos relacionales: MySQL, PostgreSQL, S QL Server – Pecan Street Project • ARK y Specify • GIS (Sistema de información geográfica) – FASTI – Instituto de Arqueología Clásica

8. Flexibilidad • Centro para la Investigación del Espacio (CSR) – Almacenamiento de datos provenientes de satélites, radares y sensores – Terremoto de Haití – 2010 – El repositorio de datos de CSR fue transformado en un repositorio web para compartir datos con los rescatistas.

9. Multiples posibilidades • Gestión de datos durante el proyecto de investigación • Almacenamiento temporario de datos para procesos computacionales • Acceso a colecciones de investigación • Archivo oscuro • El investigador es el curador • El equipo de TACC ofrece e implementa soluciones técnicas al proceso de curación y colabora en la organización, estandarización y acceso de datos

10. Implementación de colecciones • TACC administra el acceso a los sistemas, instala los servidores/bases de datos/librerías y dependencias. • Los usuarios tienen acceso a su código • Triage de colecciones – ICA, 5 petabytes de datos desorganizados • Usuarios de distintos dominios • Usuarios con distintos niveles de conocimientos técnicos

11. Flujos de trabajo – Diferentes flujos de datos – Transición sin fisuras entre sistemas de almacenamiento y de análisis.

12. Metadatos e integración

13. Acceso • Acceso web abierto al publico • Acceso cerrado durante el periodo de embargo • WebDav • Protegido por contraseña • Acceso restringido al equipo de investigación • Desde los sistemas de visualización de TACC

14. Preservación • iRODS: bróker de archivos distribuidos • Replica de archivos en Ranch, un archivo de cinta y replicación geográfica • Seguridad y mantenimiento • Chequeo de autenticidad de los datos • Captura automática de metadatos técnicos • Perspectiva sobre lo que

15. Modelo administrativo • 5 TB de almacenamiento gratuito a investigadores de la Universidad de Texas • Estructura de costos anual, basada en honorarios del staff – Consultoría, curación de datos, bases de datos y aplicaciones web • Funciona como archivo oscuro para costear hardware • Participamos en subsidios de investigación

16. Data@TACC • Weijia Xu • Christopher Jordan • David Walling • Tomislav Urban • Siva Kulaskerian

Notes de l'éditeur

Hoyvoy a presentar
Estamostransitando un cambio de paradigma de investigacionquerealza el descubrimiento de patrones y fenomenos en grandescantidades de datos, y la posibilidad de integrardatosprovenientes de distintosdominiospararelacionar, cruzar y contextualizarestosfenomenos. La ciencia de datosintensivoscombina la teoria, la experimentacion y la simulacion en relacion a datosmasivos. Sabemosque la postulacion de hipotesisabrelasdimensiones de la investigacion. Como estasdimensiones no puedenexplorarse de antemano, unodebeprepararsepararealizarmuchosexperimentos y siumlacionesquepueden a suvezotorgarrespuestas o abrirnuevoscuestionamientos. La gran diferenciaesque los datosdisponiblespara el estudio de un fenomeno no son los acotados a lo que se suponerelevante a la hipotesis original, sino ….Para practicarscienciaLa sciencia de datosintensivostomainductive reasoning or thought which turns a simple observation or thought into a general theory. In other words it takes one piece of information and tries to generalise it from there. A researcher’s thought path goes from the specific to general and a hypothesis is formed. One of the basic concepts of informatics is the ‘future value of primary data’. It is envisioned that the primary data—and, if possible, the actual samples—collected by one investigator will be archived and made available to other investigators, who may re-analyse the data from a different point of view, employ part of the data set not relevant to the first investigator, pool data with other studies or conduct new measurements on the original samples (Koslow, 2000). This is entirely compatible with hypothesis-driven research. Indeed, a good hypothesis is not one that is likely to be correct, but one that opens up a new arena of investigation. Since this arena cannot be fully perceived in advance, one must be prepared to carry out new analyses not included in the original hypothesis. Yet, most current experimental design simply ignores this fact: the investigator collects only those data that are deemed relevant to the original hypothesis, and when new information causes the original hypothesis to change, the investigator must plan a new experiment from scratch.
Could be longDecentralized Research data & processes are complex evolving systemsExpertise and technologies supporting data change at fast paceLifecycle of research may include technologies to aid each stageTransition from research data to data collection Data risk increases as data is being processedData can be easily disorganized during the research process Interoperability issues emergeSecurity Provenance and metadataAnalysis and interpretationSharing and accessProcesos yuxtapuestosLos procesos de investigacionintensiva de datos son complejos,dinamicos y se encuentran en constanteevolucion. Al cambio de tecnologias y experiencia/conocimientos se le suma la inestabilidadeconomica de . Es en esteproceso
The evolving collections cyberinfrastructure here proposed could be considered a cloud service,(cloud that we know where it is). It provides flexibilty for the user to develop and archive collections in a continuum and at any point in their research lifecycle, while they, other users or both perform data analysis and visualization tasks across a seamless storage and computing environments. In addition we can provide the same interfaces and/or functionality as DuraCloud on top of our resources. Incorporalasteorias y practicas del dominiocientifico a lasactividades de administracion de datosMapearrecursos, redes, servicios y funciones de lascolecciones a Encoded textsWorks of artC Borgman call for action challenges due to lack of fundingIprEvolvingCollection’s architectureMap resources, network, services, metadata to the collection’s goals, functionsMulti-disciplinary Paleo – CSR -
Las actividadesestancentralizadas en una Collections activities are centralized in a data applications facility thatprovides different technical environments as collections building blocks that users may select according to the functionalities that they need. It consists of 1.2 Petabytes of online disk and a number of servers providing high-performance storage for all types of digital data. A high-performance parallel file system is directly accessible from all of TACC's High Performance Computing (HPC) resources, enabling mathematical computation and visual analyses of petabyte-scale datasets. (IMPORTANCE). It has web-based access, and other network protocols for storage and retrieval of data. For database collections, the facility provides flexibility in terms of the RDBMSs that users may choose from. We also maintain open source domain specific databases such as ARK [7] and Specify [8], which support Archaeology and Natural History collections respectively. Collections requiring long-term preservation are managed within iRODS [9]. Off site replication is done in a Mass Storage tape system with a capacity of 10 PB, and geographical replication is accomplished through an agreement with Indiana University’s Research Computing Division [10]. Close supervision, parts replacement contracts, and frequent schedule of upgrades are in place for maintaining the infrastructure. This model is based on TACC’s experience managing systems to assure 24/7 services and data security,High Performance Computing (HPC)Petabytes - distributed storageNetworks – parallel access and processesIntegration to compute nodes Remote accessMechanisms for sharing and for restricting data accessRDBMS, GIS, WebFlexibility to configure systemsOpen source librariesAutomation and generalization24/7 administrationCorral consists of 6 Petabytes of online disk and a number of servers providing high-performance storage for all types of digital data. It supports MySQL and Postgres databases, high-performance parallel file system, and web-based access, and other network protocols for storage and retrieval of data to and from sophisticated instruments, HPC simulations, and visualization laboratories. A high-performance parallel file system is accessible directly from TACC's world-class computational resources, Stampede and Lonestar, as well as Stallion, the world's largest tile display, enabling both mathematical and visual analysis of petabyte-scale datasets.
An example of what such infrastructure allows are the activities of the The UT Center for Space Research (www.csr.utexas.edu) thatstores very large sensor, satellite, aerial and radar datasets that they curate for dissemination purposes. Within three days of the 2010 earthquake in Haiti in collaboration with TACC, the repository/file system used for managing CSR data in Corral, was turned into a web repository for sharing data. This allowed CSR to access, organize, retrieve, and post the data required by the emergency operations in the region [11]. This type of quick repurposing allowed a multi-terabyte collection managed through one application, to instantly become accessible through a password-protected web application on another server. Improve here
Coleccionesque
When possible, to facilitate collections organization and avoid manual metadata entry, descriptive metadata is automatically extracted from the collections record-keeping system at ingest. This process requires the existence of an informative and regular file naming and or directory labeling. It also involves previous work mapping the descriptive data points to standard metadata schemas such as Dublin Core (DC) or Visual Resources Association (VRA) Core. The latter results from consulting with our team and training users on the required standards and practices. Implemented as an iRODS rule, a Jython script parses directory labels and file naming conventions as files are ingested to iRODS. The extracted descriptive metadata is packaged along with the technical metadata, as a METS document and registered with the iRODS metadata catalog [13]. This process is being implemented in ICA and McFarland’s collection that for a long time has used a systematic naming convention including image title/terms, its geographical location and type of camera codes, and version control number. To access his files, he may search by any of these elements.
Preservation services for the collections stored in iRODS include: rules to generate file checksums, automatic off-site and geographical replication, massive extraction of metadata using FITS [12] and encoded as Preservation Metadata (PREMIS),within a METS and finally registering the metadata in the iRODS catalogue. Beyond basic bit level preservation and the services mentioned above, we address the domain scientists’ conception of data preservation. In the case of archaeology collections, preservation isassociated with maintaining the relationships between the objects found in a same context in the excavation. To assure that the archived data could render a representation of the site, The Institute of Classical Archaeology selected to have two collection instances within the storage facility. A presentation instance resides on the ARK database and web site, which provides interactivity features and the possibility for users to study data objects in relation to their geospatial location and to the researchers’ interpretations. The archival instance, stored in a hierarchical directory structure in iRODS, and off site replicated preserves contextual relationships between the raw imagesof the objects found on the excavation and correspondent image versions and documentation as generated by researchers on the site and through the research lifecycle. These relationships are gathered and preserved through a complex metadata system reflected in METS documents so that when one object is retrieved, all of the related objects are retrieved as well. In this way, if the ARK database ceases to be supported, the archival instance will serve to reconstruct the site.Flujo de trabajo de la benson. IPRES .
As a service and research organization, TACC offers up to five terabytes of free storage space and basic collection services to researchers on campus, and there is a fee structure for collections requiring more storage space and or consulting services. We are currently considering to follow Princeton’s institutional repository business model and To support complex collection services, the group faces the same limitations and possibilities as the researchers that create the collections. Thus, the group participates from grant proposals with research partners, and provides services in exchange for funding staff hours. In addition, campus organizations use the data facilities as a dark archive for annual fees, which are used to purchase hardware. TACC’s cyberinfrastructure intends to surpass the uncertainties of future research funding by embracing the notion that if a collection is built soundly, it will be used and supported, or it can be easily transferred to other archives or managed within other systems. After 4 years we are expanding to 5 petabytes of data. We have so much$500/TB/year,"Data Consulting" - This would just involve some simple assessment efforts and some recommendations as to proper data management practices, perhaps a willingness to help customize a template data management plan, setup of data replication scenarios and provision of tools to do metadata extraction and/or format conversion (but not ongoing involvement of staff in such metadata or format-related efforts. The basic idea here is that we help with the initial setup, and perhaps with periodic re-assessments, but don't actively manage data on behalf of users. I would propose this go at a rate of $750 per-TB/year. "Data Management or Curation" - This would be something more along the lines of what we've done for ICA, where we really get into the guts of the collection, help with reorganizing the data, develop tools or workflows for metadata extraction and/or format conversion, derivative generation, etc. It would again be more likely to be heavily frontloaded in terms of the work, but could involve things like monthly checksum verification and other reporting efforts, active collaboration with researchers to help categorize new data types and improve metadata or search mechanisms, and so on. I would propose this go for a rate of $1000/TB/year. In addition to these two levels of data service, I think we also have to have another category for database and/or web applications, which can be relatively small in terms of absolute size but very complex in terms of data structures, and require a lot more time on our part to help develop schemas, improve existing web frameworks, import external sources of data, and so on. Here I would propose that we start at a rate of $5000 for the basic service, with costs going up from there based either on the number of tasks we need to perform, the number of "moving parts" in the overall application framework, or other factors. I think for most projects we'd have to do an assessment and figure out what it will really cost, so the goal of the base price is really just to establish a level where "you must be at least this tall to ride", i.e. if you aren't willing to contemplate an investment of at least $5K, you should be looking to do it yourself or find someone else.

Maria esteva

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

Similaire à Maria esteva

Similaire à Maria esteva (20)

Plus de Lucas Giraldo

Plus de Lucas Giraldo (10)

Maria esteva

Notes de l'éditeur