SlideShare une entreprise Scribd logo
1  sur  7
White Paper: What Is DataStage?
Sometimes DataStage is sold to and installed in an organization and its IT support staff are expected to
maintain it and to solve DataStage users' problems. In some cases IT support is outsourced and may not
become aware of DataStage until it has been installed. Then two questions immediately arise: "what is
DataStage?" and "how do we support DataStage?”.
This white paper addresses the first of those questions, from the point of view of the IT support
provider. Manuals, web-based resources and instructor-led training are available to help to answer the
second.
DataStage is actually two separate things.
• In production (and, of course, in development and test environments) DataStage is just another
application on the server, an application which connects to data sources and targets and
processes ("transforms") the data as they move through the application. Therefore DataStage is
classed as an "ETL tool", the initials standing for extract, transform and load respectively.
DataStage "jobs", as they are known, can execute on a single server or on multiple machines in
a cluster or grid environment. Like all applications, DataStage jobs consume resources: CPU,
memory, disk space, I/O bandwidth and network bandwidth.
• DataStage also has a set of Windows-based graphical tools that allow ETL processes to be
designed, the metadata associated with them managed, and the ETL processes monitored.
These client tools connect to the DataStage server because all of the design information and
metadata are stored on the server. On the DataStage server, work is organized into one or more
"projects". There are also two DataStage engines, the "server engine" and the "parallel engine".
• The server engine is located in a directory called DSEngine whose location is recorded in a
hidden file called /.dshome (that is, a hidden file called .dshome in the root directory) and/or as
the value of the environment variable DSHOME. (On Windows-based DataStage servers the
folder name is Engine, not DSEngine, and its location is recorded in the Windows registry rather
than in /.dshome.).The parallel engine is located in a sibling directory called PXEngine whose
location is recorded in the environment variable APT_ORCHHOME and/or in the environment
variable PXHOME.
DataStage Engines
The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs
on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution
technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be
distributed over multiple logical "processing nodes" whether these are in a single machine or multiple
machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed
without change to the design of the job.
Design-Time Architecture
Let us take a look at the design-time infrastructure. At its simplest, there is a DataStage server and a
local area network on which one or more DataStage client machines may be connected. When clients
are remote from the server, a wide area network may be used or some form of tunneling protocol (such
as Citrix MetaFrame) may be used instead.Note that the Repository Manager and Version Control clients
do not exist in version 8.0 and later. Different mechanisms exist, which will be discussed later.
Connection to data sources and targets can use many different techniques, primarily direct access (for
example directly reading/writing text files), industry-standard protocols such as ODBC, and vendor
specific APIs for connecting to databases and packages such as Siebel, SAP, Oracle Financials, etc.
DataStage Client/Server Connectivity
Connection from a DataStage client to a DataStage server is managed through a mechanism based upon
the UNIX remote procedure call mechanism. DataStage uses a proprietary protocol called DataStage RPC
which consists of an RPC daemon (dsrpcd) listening on TCP port number 31538 for connection requests
from DataStage clients. Before dsrpcd gets involved, the connection request goes through an
authentication process. Prior to version 8.0, this was the standard operating system authentication
based on a supplied user ID and password (an option existed on Windows-based DataStage servers to
authenticate using Windows LAN Manager, supplying the same credentials as being used on the
DataStage client machine – this option was removed for version 8.0). With effect from version 8.0
authentications are handled by the Information Server through its login and security service. Each
connection request from a DataStage client asks for connection to the dscs (DataStage Common Server)
service.
The dsrpcd (the DataStage RPC daemon) checks its dsrpc services file to determine whether there is an
entry for that service and, if there is, to establish whether the requesting machine's IP address is
authorized to request the service. If all is well, then the executable associated with the dscs service
(dsapi_server) is invoked.
DataStage Processes and Shared Memory
Each dsapi_server process acts as the "agent" on the DataStage server for its own particular client
connection, among other things managing traffic and the inactivity timeout. If the client requests access
to the Repository, then the dsapi_server process will fork a child process called dsapi_slave to perform
that work.Typically, therefore, one would expect to see one dsapi_server and one dsapi_slave process
for each connected DataStage client. Processes may be viewed with the ps -ef command (UNIX) or with
Windows Task Manager.Every DataStage process attaches to a shared memory segment that contains
lock tables and various other inter-process communication structures. Further each DataStage process is
allocated its own private shared memory segment. At the discretion of the DataStage administrator
there may also be shared memory segments for routines written in the DataStage BASIC language and
for character maps used for National Language Support (NLS). Shared memory allocation may be viewed
using the ipcs command (UNIX) or the shrdump command (Windows). The shrdump command ships
with DataStage; it is not a native Windows command.
DataStage Projects Talking about the Repository is a little misleading. As noted earlier, DataStage is
organized into a number of work areas called "projects". Each project has its own individual local
Repository in which its own designs and technical and process metadata are stored.Each project has its
own directory on the server, and its local Repository is a separate instance of the database associated
with the DataStage server engine. The name of the project and the schema name of the database
instance are the same. System tables in the DataStage engine record the existence and location of each
project.Location of any particular project may be determined through the Administrator client, by
selecting that project from the list of available projects. The pathname of the project directory is
displayed in the status bar.
When there are no connected DataStage clients, dsrpcd may be the only DataStage process running on
the DataStage server. In practice, however, there are one or two more.The DataStage deadlock daemon
(dsdlockd) wakes periodically to check for deadlocks in the DataStage database and, secondarily, to
clean up locks held by defunct processes – usually improperly disconnected DataStage clients.Job
monitor is a Java application that captures "performance" data (row counts and times) from running
DataStage jobs. This runs as a process called JobMonApp.
Changes in Version 8.0
In version 8.0 all of the above still exists, but is layered on top of a set of services called collectively IBM
Information Server. Among other things, this stores metadata centrally so that the metadata are
accessible to many products, not just DataStage, and exposes a number of services including the
metadata delivery service, parallel execution services, connectivity services, administration services and
the already-mentioned login/security service. These services are managed through an instance of a
WebSphere Application Server.
The common repository, by default called XMETA, may be installed in DB2 version 9 or later, Oracle
version 10g or later, or Microsoft SQL Server 2003 or later. The DataStage server, WebSphere
Application server and Information Server may be installed on separate machines, or all on the same
machine, or some combination of machines. The main difference noticed by DataStage users when they
move to version 8 is that the initial connection request names the Information Server (not the DataStage
server) and must specify the port number (default 9080); on successful authentication they are then
presented with a list of DataStage server and project combinations being managed by that particular
Information Server.
Run-Time Architecture
Now let us turn our attention to run-time, when DataStage jobs are executing. The concept is a
straightforward one; DataStage jobs can run even though there are not clients connected (there is a
command line interface (dsjob) for requesting job execution and for performing various other tasks).
However, server jobs and parallel jobs execute totally differently. A job sequence is a special case of a
server job, and executes in the same way as a server job.
Server Job Execution
Server jobs execute on the DataStage server (only) and execute in a shell called uvsh (or dssh, a
synonym). The main process that runs the job executes a DataStage BASIC routine called DSD.RUN – the
name of this program shows in a ps –ef listing (UNIX). This program interrogates the local Repository to
determine the runtime configuration of the job, what stages are to be run and their interdependencies.
When a server job includes a Transformer stage, a child process is forked from uvsh also running uvsh
but this time executing a DataStage BASIC routine called DSD.StageRun. Server jobs only ever have uvsh
processes at run time, except where the job design specifies opening a new shell (for example sh in UNIX
or DOS in Windows) to perform some specific task; these will be child processes of uvsh.
Parallel Job Execution
Parallel job execution is rather more complex. When the job is initiated the primary process (called the
“conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor
also reads the parallel execution configuration file specified by the current setting of the
APT_CONFIG_FILE environment variable. Based on these two inputs, the conductor process composes
the “score”, another osh script that specifies what will actually be executed. (Note that parallelism is not
determined until run time – the same job might run in 12 nodes in one run and 16 nodes in another run.
This automatic scalability is one of the features of the parallel execution technology underpinning
Information Server (and therefore DataStage).Once the execution nodes are known (from the
configuration file) the conductor causes a coordinating process called a “section leader” to be started on
each; by forking a child process if the node is on the same machine as the conductor or by remote shell
execution if the node is on a different machine from the conductor (things are a little more dynamic in a
grid configuration, but essentially this is what happens). Each section leader process is passed the score
and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and
stderr are redirected to the conductor, which is solely responsible for logging entries from the job.
The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a
“player” (the metaphor clearly is one of an orchestra). Player processes’ stdout and stderr are redirected
to their parent section leader. Player processes also run the osh executable. Communication between
the conductor, section leaders and player processes in a parallel job is effected via TCP. The port
numbers are configurable using environment variables. By default, communication between conductor
and section leader processes uses port number 10000 (APT_PM_STARTUP_PORT) and communication
between player processes and player processes on other nodes uses port number 11000
(APT_PLAYER_CONNECTION_PORT).
To find all the processes involved in executing a parallel job (they all run osh) you need to know the
configuration file that was used. This can be found from the job's log, which is viewable using the
Director client or the dsjob command line interface.
What is scalability about?
A few years ago Ascential software had one of the most popular ETL tools on the market in DataStage
which integrated data with “Server Jobs” – programs that ran on the DataStage server. Problem was,
these programs were single process and single threaded and hard to scale across a cluster or grid of
servers. They bought Torrent Systems and created a new type of DataStage with the same look and feel
but running on a parallel engine from Torrent. This benchmark shows the result of the effort to create
DataStage Parallel Jobs.
This whitepaper provides results of a benchmark test performed on InfoSphere DataStage to illustrate
its capabilities to address customers’ high performance and scalability needs. The scenario focuses on a
typical – that is, robust – set of requirements customers face when loading a data warehouse from
source systems. In particular, the benchmark is designed to use the profiled situation to provide insight
about how InfoSphere DataStage addresses three fundamental questions customers ask when designing
their information integration architecture to achieve the performance and scalability characteristics
previously described:
• How will InfoSphere DataStage scale with realistic workloads?
• What are its typical performance characteristics for a customer’s integration tasks?
• What is the predictability of the performance in support of capacity planning?
ETL Scalability
The benchmark aims to demonstrate ETL scalability which IBM have described in another whitepaper
called The seven essential elements to achieve highest performance & scalability in information
integration:
1. A dataflow architecture supporting data pipelining
2. Dynamic data partitioning and in-flight repartitioning of data
3. Adapting to scalable hardware without requiring modifications of the data flow design
4. Support and take advantage of available parallelism of leading databases
5. High performance and scalability for bulk and real-time data processing
6. Extensive tooling to support resource estimation, performance analysis and optimization
7. Extensible framework to incorporate in-house and third-party software
DataStage is an ETL (Extract, Transform and Load) software tool for building data integration jobs 80%
faster than manual coding. It is part of the IBM Infosphere Information Server suite - later this week I
will have a post about what is new in the other Information Server 9.1 products.
DataStage 9.1 comes with an unstructured text stage that makes Excel easier to read even if it is not a
perfected formatted table. DataStage can go in and find column headings whether they are on row 1 or
row 10. It can parse the columns and turn them into relational data and even add on extra text strings
such as a single comment field. DataStage 9.1 can overcome a lot of the problems you get from Excel as
a source when people muck around with the spreadsheet format.
Faster Database Writes
DataStage 9.1 is faster when writing to DB2 and Oracle thanks to improved big buffering of data. Oracle
Bulk Load and DB2 Bulk Load are faster. Oracle Upsert is faster. The Connector stages are better.
Faster Netezza Table Copies
IBM introduces a new interface called Data Click - it promises to load tables from a source Oracle or DB2
table into Netezza with just two clicks. Two clicks! The new console will automatically generate the
DataStage job required to move the tables selected into Netezza. Great for fast PoC or agile BI
development. It can repeat the data movement later or automatically generates InfoSphere CDC table
subscriptions to keep those tables up to date in real time.
Write to Multiple Files
DataStage 9.1 makes it easier to take a data stream and write it to a bunch of output files, such as
splitting data up by customer.
New Expressions
A couple old Server Edition functions finally make their way onto the Parallel canvas - such as the
popular EReplace function for replacing values in text strings.
WorkLoad Management
The DataStage Operations Console, the web application introduced in DataStage 8.7, has a new tab for
workload management. Included in the bundles of goodies is the ability to cap the number of jobs
running at once, the amount of CPU being used and the amount of RAM being used by parallel jobs.
This should avoid problems you get when you pass 100% RAM utilisation and stagger the start up of
hundreds of simultaneous job requests. There is also the ability to create queues and apply workload
priorities to those queues.
What DataStage 9.1 Does Not Have
It is with some disappointment that I report that the following oft requested features will not be in
DataStage 9.1: DataStage will not have a plugin that chops fruit and vegetables. It will not be possible to
chop up an entire soup in three seconds using parallel chop-o-mastics. I'm afraid we will need to
continue to chop using the old fashioned methods. DataStage 9.1 National Language Support lets you
process data in different character sets such as Mandarin and Japanese. Unfortunately still no support
for Klingon and it is still not possible to process customer sentiments on Twitter that are written in
Klingon, which as we all know has a much richer dialect for insults.

Contenu connexe

Tendances

NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaBiju Nair
 
WEB PROGRAMMING USING ASP.NET
WEB PROGRAMMING USING ASP.NETWEB PROGRAMMING USING ASP.NET
WEB PROGRAMMING USING ASP.NETDhruvVekariya3
 
Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Naveen P
 
Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0kshanmug2
 
NonStop SQL/MX DBS demo with iTP Webserver
NonStop SQL/MX DBS demo with iTP WebserverNonStop SQL/MX DBS demo with iTP Webserver
NonStop SQL/MX DBS demo with iTP WebserverFrans Jongma
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaRavikumar Nandigam
 
Asp.net interview questions
Asp.net interview questionsAsp.net interview questions
Asp.net interview questionsAkhil Mittal
 
Rdbms Practical file diploma
Rdbms Practical file diploma Rdbms Practical file diploma
Rdbms Practical file diploma mustkeem khan
 
Online Datastage training
Online Datastage trainingOnline Datastage training
Online Datastage trainingchpriyaa1
 
DbVisualizer for NonStop SQL
DbVisualizer for NonStop SQLDbVisualizer for NonStop SQL
DbVisualizer for NonStop SQLFrans Jongma
 
Native tables in NonStop SQL database
Native tables in NonStop SQL databaseNative tables in NonStop SQL database
Native tables in NonStop SQL databaseFrans Jongma
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05Niit Care
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceBiju Nair
 
Parameter substitution in Aginity Workbench
Parameter substitution in Aginity WorkbenchParameter substitution in Aginity Workbench
Parameter substitution in Aginity WorkbenchMary Uguet
 

Tendances (20)

Oracle Complete Interview Questions
Oracle Complete Interview QuestionsOracle Complete Interview Questions
Oracle Complete Interview Questions
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 
WEB PROGRAMMING USING ASP.NET
WEB PROGRAMMING USING ASP.NETWEB PROGRAMMING USING ASP.NET
WEB PROGRAMMING USING ASP.NET
 
Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)
 
Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0Day 1 Data Stage Administrator And Director 11.0
Day 1 Data Stage Administrator And Director 11.0
 
NonStop SQL/MX DBS demo with iTP Webserver
NonStop SQL/MX DBS demo with iTP WebserverNonStop SQL/MX DBS demo with iTP Webserver
NonStop SQL/MX DBS demo with iTP Webserver
 
Discover database
Discover databaseDiscover database
Discover database
 
Rise of NewSQL
Rise of NewSQLRise of NewSQL
Rise of NewSQL
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in India
 
Asp.net interview questions
Asp.net interview questionsAsp.net interview questions
Asp.net interview questions
 
Rdbms Practical file diploma
Rdbms Practical file diploma Rdbms Practical file diploma
Rdbms Practical file diploma
 
Online Datastage training
Online Datastage trainingOnline Datastage training
Online Datastage training
 
Ado.net
Ado.netAdo.net
Ado.net
 
DbVisualizer for NonStop SQL
DbVisualizer for NonStop SQLDbVisualizer for NonStop SQL
DbVisualizer for NonStop SQL
 
Native tables in NonStop SQL database
Native tables in NonStop SQL databaseNative tables in NonStop SQL database
Native tables in NonStop SQL database
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05
 
Datastores
DatastoresDatastores
Datastores
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Oracle tutorial
Oracle tutorialOracle tutorial
Oracle tutorial
 
Parameter substitution in Aginity Workbench
Parameter substitution in Aginity WorkbenchParameter substitution in Aginity Workbench
Parameter substitution in Aginity Workbench
 

Similaire à DataStage_Whitepaper

Whitepaper tableau for-the-enterprise-0
Whitepaper tableau for-the-enterprise-0Whitepaper tableau for-the-enterprise-0
Whitepaper tableau for-the-enterprise-0alok khobragade
 
01_Intro_SAP BO DATA Integrator.docx
01_Intro_SAP BO DATA Integrator.docx01_Intro_SAP BO DATA Integrator.docx
01_Intro_SAP BO DATA Integrator.docxsivakumar269245
 
IT Summit - Modernizing Enterprise Analytics: the IT Story
IT Summit - Modernizing Enterprise Analytics: the IT StoryIT Summit - Modernizing Enterprise Analytics: the IT Story
IT Summit - Modernizing Enterprise Analytics: the IT StoryTableau Software
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAucfan
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
Sayed database system_architecture
Sayed database system_architectureSayed database system_architecture
Sayed database system_architectureSayed Ahmed
 
Sayed database system_architecture
Sayed database system_architectureSayed database system_architecture
Sayed database system_architectureSayed Ahmed
 
What are Tableau Server Components and their working.docx
What are Tableau Server Components and their working.docxWhat are Tableau Server Components and their working.docx
What are Tableau Server Components and their working.docxPankajNagla2
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfan
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfseo18
 
data stage-material
data stage-materialdata stage-material
data stage-materialRajesh Kv
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
Stored-Procedures-Presentation
Stored-Procedures-PresentationStored-Procedures-Presentation
Stored-Procedures-PresentationChuck Walker
 
Peoplesoft PIA architecture
Peoplesoft PIA architecturePeoplesoft PIA architecture
Peoplesoft PIA architectureAmit rai Raaz
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developersBiju Nair
 
Light sayed database_system_architecture
Light sayed database_system_architectureLight sayed database_system_architecture
Light sayed database_system_architectureSayed Ahmed
 

Similaire à DataStage_Whitepaper (20)

Whitepaper tableau for-the-enterprise-0
Whitepaper tableau for-the-enterprise-0Whitepaper tableau for-the-enterprise-0
Whitepaper tableau for-the-enterprise-0
 
01_Intro_SAP BO DATA Integrator.docx
01_Intro_SAP BO DATA Integrator.docx01_Intro_SAP BO DATA Integrator.docx
01_Intro_SAP BO DATA Integrator.docx
 
IT Summit - Modernizing Enterprise Analytics: the IT Story
IT Summit - Modernizing Enterprise Analytics: the IT StoryIT Summit - Modernizing Enterprise Analytics: the IT Story
IT Summit - Modernizing Enterprise Analytics: the IT Story
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
 
Windows server Interview question and answers
Windows server Interview question and answersWindows server Interview question and answers
Windows server Interview question and answers
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Sayed database system_architecture
Sayed database system_architectureSayed database system_architecture
Sayed database system_architecture
 
Sayed database system_architecture
Sayed database system_architectureSayed database system_architecture
Sayed database system_architecture
 
What are Tableau Server Components and their working.docx
What are Tableau Server Components and their working.docxWhat are Tableau Server Components and their working.docx
What are Tableau Server Components and their working.docx
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
 
Gg steps
Gg stepsGg steps
Gg steps
 
data stage-material
data stage-materialdata stage-material
data stage-material
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Stored-Procedures-Presentation
Stored-Procedures-PresentationStored-Procedures-Presentation
Stored-Procedures-Presentation
 
Peoplesoft PIA architecture
Peoplesoft PIA architecturePeoplesoft PIA architecture
Peoplesoft PIA architecture
 
td1.ppt
td1.ppttd1.ppt
td1.ppt
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
Light sayed database_system_architecture
Light sayed database_system_architectureLight sayed database_system_architecture
Light sayed database_system_architecture
 

DataStage_Whitepaper

  • 1. White Paper: What Is DataStage? Sometimes DataStage is sold to and installed in an organization and its IT support staff are expected to maintain it and to solve DataStage users' problems. In some cases IT support is outsourced and may not become aware of DataStage until it has been installed. Then two questions immediately arise: "what is DataStage?" and "how do we support DataStage?”. This white paper addresses the first of those questions, from the point of view of the IT support provider. Manuals, web-based resources and instructor-led training are available to help to answer the second. DataStage is actually two separate things. • In production (and, of course, in development and test environments) DataStage is just another application on the server, an application which connects to data sources and targets and processes ("transforms") the data as they move through the application. Therefore DataStage is classed as an "ETL tool", the initials standing for extract, transform and load respectively. DataStage "jobs", as they are known, can execute on a single server or on multiple machines in a cluster or grid environment. Like all applications, DataStage jobs consume resources: CPU, memory, disk space, I/O bandwidth and network bandwidth. • DataStage also has a set of Windows-based graphical tools that allow ETL processes to be designed, the metadata associated with them managed, and the ETL processes monitored. These client tools connect to the DataStage server because all of the design information and metadata are stored on the server. On the DataStage server, work is organized into one or more "projects". There are also two DataStage engines, the "server engine" and the "parallel engine". • The server engine is located in a directory called DSEngine whose location is recorded in a hidden file called /.dshome (that is, a hidden file called .dshome in the root directory) and/or as the value of the environment variable DSHOME. (On Windows-based DataStage servers the folder name is Engine, not DSEngine, and its location is recorded in the Windows registry rather than in /.dshome.).The parallel engine is located in a sibling directory called PXEngine whose location is recorded in the environment variable APT_ORCHHOME and/or in the environment variable PXHOME. DataStage Engines The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be distributed over multiple logical "processing nodes" whether these are in a single machine or multiple machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed without change to the design of the job.
  • 2. Design-Time Architecture Let us take a look at the design-time infrastructure. At its simplest, there is a DataStage server and a local area network on which one or more DataStage client machines may be connected. When clients are remote from the server, a wide area network may be used or some form of tunneling protocol (such as Citrix MetaFrame) may be used instead.Note that the Repository Manager and Version Control clients do not exist in version 8.0 and later. Different mechanisms exist, which will be discussed later. Connection to data sources and targets can use many different techniques, primarily direct access (for example directly reading/writing text files), industry-standard protocols such as ODBC, and vendor specific APIs for connecting to databases and packages such as Siebel, SAP, Oracle Financials, etc. DataStage Client/Server Connectivity Connection from a DataStage client to a DataStage server is managed through a mechanism based upon the UNIX remote procedure call mechanism. DataStage uses a proprietary protocol called DataStage RPC which consists of an RPC daemon (dsrpcd) listening on TCP port number 31538 for connection requests from DataStage clients. Before dsrpcd gets involved, the connection request goes through an authentication process. Prior to version 8.0, this was the standard operating system authentication based on a supplied user ID and password (an option existed on Windows-based DataStage servers to authenticate using Windows LAN Manager, supplying the same credentials as being used on the DataStage client machine – this option was removed for version 8.0). With effect from version 8.0 authentications are handled by the Information Server through its login and security service. Each connection request from a DataStage client asks for connection to the dscs (DataStage Common Server) service. The dsrpcd (the DataStage RPC daemon) checks its dsrpc services file to determine whether there is an entry for that service and, if there is, to establish whether the requesting machine's IP address is authorized to request the service. If all is well, then the executable associated with the dscs service (dsapi_server) is invoked. DataStage Processes and Shared Memory Each dsapi_server process acts as the "agent" on the DataStage server for its own particular client connection, among other things managing traffic and the inactivity timeout. If the client requests access to the Repository, then the dsapi_server process will fork a child process called dsapi_slave to perform that work.Typically, therefore, one would expect to see one dsapi_server and one dsapi_slave process for each connected DataStage client. Processes may be viewed with the ps -ef command (UNIX) or with Windows Task Manager.Every DataStage process attaches to a shared memory segment that contains lock tables and various other inter-process communication structures. Further each DataStage process is allocated its own private shared memory segment. At the discretion of the DataStage administrator there may also be shared memory segments for routines written in the DataStage BASIC language and for character maps used for National Language Support (NLS). Shared memory allocation may be viewed
  • 3. using the ipcs command (UNIX) or the shrdump command (Windows). The shrdump command ships with DataStage; it is not a native Windows command. DataStage Projects Talking about the Repository is a little misleading. As noted earlier, DataStage is organized into a number of work areas called "projects". Each project has its own individual local Repository in which its own designs and technical and process metadata are stored.Each project has its own directory on the server, and its local Repository is a separate instance of the database associated with the DataStage server engine. The name of the project and the schema name of the database instance are the same. System tables in the DataStage engine record the existence and location of each project.Location of any particular project may be determined through the Administrator client, by selecting that project from the list of available projects. The pathname of the project directory is displayed in the status bar. When there are no connected DataStage clients, dsrpcd may be the only DataStage process running on the DataStage server. In practice, however, there are one or two more.The DataStage deadlock daemon (dsdlockd) wakes periodically to check for deadlocks in the DataStage database and, secondarily, to clean up locks held by defunct processes – usually improperly disconnected DataStage clients.Job monitor is a Java application that captures "performance" data (row counts and times) from running DataStage jobs. This runs as a process called JobMonApp. Changes in Version 8.0 In version 8.0 all of the above still exists, but is layered on top of a set of services called collectively IBM Information Server. Among other things, this stores metadata centrally so that the metadata are accessible to many products, not just DataStage, and exposes a number of services including the metadata delivery service, parallel execution services, connectivity services, administration services and the already-mentioned login/security service. These services are managed through an instance of a WebSphere Application Server. The common repository, by default called XMETA, may be installed in DB2 version 9 or later, Oracle version 10g or later, or Microsoft SQL Server 2003 or later. The DataStage server, WebSphere Application server and Information Server may be installed on separate machines, or all on the same machine, or some combination of machines. The main difference noticed by DataStage users when they move to version 8 is that the initial connection request names the Information Server (not the DataStage server) and must specify the port number (default 9080); on successful authentication they are then presented with a list of DataStage server and project combinations being managed by that particular Information Server.
  • 4. Run-Time Architecture Now let us turn our attention to run-time, when DataStage jobs are executing. The concept is a straightforward one; DataStage jobs can run even though there are not clients connected (there is a command line interface (dsjob) for requesting job execution and for performing various other tasks). However, server jobs and parallel jobs execute totally differently. A job sequence is a special case of a server job, and executes in the same way as a server job. Server Job Execution Server jobs execute on the DataStage server (only) and execute in a shell called uvsh (or dssh, a synonym). The main process that runs the job executes a DataStage BASIC routine called DSD.RUN – the name of this program shows in a ps –ef listing (UNIX). This program interrogates the local Repository to determine the runtime configuration of the job, what stages are to be run and their interdependencies. When a server job includes a Transformer stage, a child process is forked from uvsh also running uvsh but this time executing a DataStage BASIC routine called DSD.StageRun. Server jobs only ever have uvsh processes at run time, except where the job design specifies opening a new shell (for example sh in UNIX or DOS in Windows) to perform some specific task; these will be child processes of uvsh. Parallel Job Execution Parallel job execution is rather more complex. When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable. Based on these two inputs, the conductor process composes the “score”, another osh script that specifies what will actually be executed. (Note that parallelism is not determined until run time – the same job might run in 12 nodes in one run and 16 nodes in another run. This automatic scalability is one of the features of the parallel execution technology underpinning Information Server (and therefore DataStage).Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Each section leader process is passed the score and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and stderr are redirected to the conductor, which is solely responsible for logging entries from the job. The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a “player” (the metaphor clearly is one of an orchestra). Player processes’ stdout and stderr are redirected to their parent section leader. Player processes also run the osh executable. Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP. The port numbers are configurable using environment variables. By default, communication between conductor and section leader processes uses port number 10000 (APT_PM_STARTUP_PORT) and communication
  • 5. between player processes and player processes on other nodes uses port number 11000 (APT_PLAYER_CONNECTION_PORT). To find all the processes involved in executing a parallel job (they all run osh) you need to know the configuration file that was used. This can be found from the job's log, which is viewable using the Director client or the dsjob command line interface. What is scalability about? A few years ago Ascential software had one of the most popular ETL tools on the market in DataStage which integrated data with “Server Jobs” – programs that ran on the DataStage server. Problem was, these programs were single process and single threaded and hard to scale across a cluster or grid of servers. They bought Torrent Systems and created a new type of DataStage with the same look and feel but running on a parallel engine from Torrent. This benchmark shows the result of the effort to create DataStage Parallel Jobs. This whitepaper provides results of a benchmark test performed on InfoSphere DataStage to illustrate its capabilities to address customers’ high performance and scalability needs. The scenario focuses on a typical – that is, robust – set of requirements customers face when loading a data warehouse from source systems. In particular, the benchmark is designed to use the profiled situation to provide insight about how InfoSphere DataStage addresses three fundamental questions customers ask when designing their information integration architecture to achieve the performance and scalability characteristics previously described: • How will InfoSphere DataStage scale with realistic workloads? • What are its typical performance characteristics for a customer’s integration tasks? • What is the predictability of the performance in support of capacity planning? ETL Scalability The benchmark aims to demonstrate ETL scalability which IBM have described in another whitepaper called The seven essential elements to achieve highest performance & scalability in information integration: 1. A dataflow architecture supporting data pipelining 2. Dynamic data partitioning and in-flight repartitioning of data 3. Adapting to scalable hardware without requiring modifications of the data flow design 4. Support and take advantage of available parallelism of leading databases
  • 6. 5. High performance and scalability for bulk and real-time data processing 6. Extensive tooling to support resource estimation, performance analysis and optimization 7. Extensible framework to incorporate in-house and third-party software DataStage is an ETL (Extract, Transform and Load) software tool for building data integration jobs 80% faster than manual coding. It is part of the IBM Infosphere Information Server suite - later this week I will have a post about what is new in the other Information Server 9.1 products. DataStage 9.1 comes with an unstructured text stage that makes Excel easier to read even if it is not a perfected formatted table. DataStage can go in and find column headings whether they are on row 1 or row 10. It can parse the columns and turn them into relational data and even add on extra text strings such as a single comment field. DataStage 9.1 can overcome a lot of the problems you get from Excel as a source when people muck around with the spreadsheet format. Faster Database Writes DataStage 9.1 is faster when writing to DB2 and Oracle thanks to improved big buffering of data. Oracle Bulk Load and DB2 Bulk Load are faster. Oracle Upsert is faster. The Connector stages are better. Faster Netezza Table Copies IBM introduces a new interface called Data Click - it promises to load tables from a source Oracle or DB2 table into Netezza with just two clicks. Two clicks! The new console will automatically generate the DataStage job required to move the tables selected into Netezza. Great for fast PoC or agile BI development. It can repeat the data movement later or automatically generates InfoSphere CDC table subscriptions to keep those tables up to date in real time. Write to Multiple Files DataStage 9.1 makes it easier to take a data stream and write it to a bunch of output files, such as splitting data up by customer. New Expressions A couple old Server Edition functions finally make their way onto the Parallel canvas - such as the popular EReplace function for replacing values in text strings. WorkLoad Management The DataStage Operations Console, the web application introduced in DataStage 8.7, has a new tab for workload management. Included in the bundles of goodies is the ability to cap the number of jobs running at once, the amount of CPU being used and the amount of RAM being used by parallel jobs. This should avoid problems you get when you pass 100% RAM utilisation and stagger the start up of hundreds of simultaneous job requests. There is also the ability to create queues and apply workload priorities to those queues.
  • 7. What DataStage 9.1 Does Not Have It is with some disappointment that I report that the following oft requested features will not be in DataStage 9.1: DataStage will not have a plugin that chops fruit and vegetables. It will not be possible to chop up an entire soup in three seconds using parallel chop-o-mastics. I'm afraid we will need to continue to chop using the old fashioned methods. DataStage 9.1 National Language Support lets you process data in different character sets such as Mandarin and Japanese. Unfortunately still no support for Klingon and it is still not possible to process customer sentiments on Twitter that are written in Klingon, which as we all know has a much richer dialect for insults.