2. Topics
• Characters and Encoding
• ASCII standards
• Glyphs and Fonts
• Extended ASCII and issues
• Character Sets and Code Pages
• Little and Big Endian
• Unicode
• Unicode Transformation Formats [UTF-8, etc]
• Unicode SAP system
• SAP Unicode Overhead
• SAP File Interface
• SAP Authorization for File Access
• Files on the Application Server
• File Interface Statements (Open, Transfer, Read, Get, Set, etc)
• Error Handling
• Attributes and Other commands
• Files on the Presentation Server
2
3. Characters and Encoding
• Characters are represented by character
codes
• This coding is a called character Encoding
• Character codes are generated and stored
when a user inputs and saves a document
• When a document is read by the system, it
interprets the character codes that were
stored and displays them as characters in
the format that we understand
3
4. ASCII standards
• The American National Standards Institute (ANSI)
created the American Standard Code for Information
Interchange (ASCII) standard
• For example in ASCII, character ‘A’ is represented by
decimal code 65 or hexadecimal code 41 and is stored
as binary code 01000001
• Single-Byte character sets provide 256 character codes.
This is an adequate number to encode most of the
characters needed for Western Europe
• BTW: Extended Binary Coded Decimal Interchange
Code [EBCDIC] (that existed before ASCII) is an 8-bit
character encoding used on IBM mainframe operating
systems – is not being discussed here
4
5. Glyphs and Fonts
• A Glyph (glif) is a visual representation of a
character – example: A A A A A A A A
• Users don't view or print characters they views
or print Glyphs
• The character "Capital Letter A" represented by
the Glyph in Times New Roman Bold is different
from the Glyph in Arial Bold (each Glyph look
visually different)
• A single character can be represented by
several different Glyphs in a Font
• A Font is a collection of glyphs
5
6. Extended ASCII and issues
• ASCII represent every character using a number between 32 and 127.
Space was 32, the letter "A" was 65, etc. This could conveniently be stored
in 7 bits because the total characters were less than 128 (27)
• Historically most computers used 8-bit bytes, therefore there was still 1 bit
to spare
• Extended ASCII that made use of this spare bit was not standardized all
over the world
• The IBM-PC had something that came to be known as the OEM [Original
Equipment Manufacturer] character set which provided some accented
characters for European languages and text-mode PCs could display and
print vertical and horizontal line drawing characters
• An assortment of 256-character Windows ANSI character sets cover all the
8-bit languages targeted by Windows
• Programmers from Israel, Russia (USSR), Asia used the 8th bit to represent
their own language characters, so there were no universal standard left for
the characters from 128 and up – confusion prevailed with the 8th bit
• Something was required to map various Character Code created and used -
not only for Extended ASCII but also for any new mapping developed
6
7. Character Sets and Code Pages
• A Character Set is any specific collection of characters
• Code Page is a list of selected character codes for a Character Set
in a particular order
• Code Page is another name for encoding of each character in a
Character Set (Fonts could have their own Character Set)
• Code Page is a character set encoding that can include numbers,
punctuation marks, and other glyphs. Code Pages are not the same
for each language
• Many Code Pages are single-byte Character Sets - that is, they
contain no more than 256 characters.
• A Code Page is a representation of Character Set used by a
computer (OS) to support a specific language or set of languages.
Character Sets Windows Code Page
US-ASCII 20127
German (IA5) 20106
Korean (ISO) 50225
• Some languages, such as Japanese have multi-byte characters,
while others, like English and German, only need one byte to
represent each character
7
8. Character Sets and Code Page
(cont…)
Within each Code Page,
the Characters from Character Set
are mapped to the Character Codes
(Encoded)
8
9. Character Sets and CodePage
(cont…)
So potentially we could have hundreds of
Character Sets and these have to be mapped
to numerous Code Pages which is a
maintenance nightmare
9
10. Character Sets and CodePage
(cont…)
• All Code pages may not exist on all the
computers, or they can be different on
different computers, or they can be
changed for a single computer.
• This will result in confusion and emails like
these:
– Dear □ □ ??? Thank □□□ █ █ █ █ ???
10
11. Little and Big Endian
• Some examples of ABAP build-in Data Types are:
b 1 Byte - 1 byte Integer (internal)
i 4 Bytes - 4 byte integer
f 8 Bytes - Floating point number
• Question: For the multi-byte data (say, i or f shown above), where does the biggest
(most significant or highest-order) byte appear in the memory?
• Little Endian: as used in Intel processors stores low-order byte of a number in
memory at the lowest address
• Big Endian: as used by Motorola processors and IBM's 370 mainframes, and most
RISC-based computers store the high-order byte of a number in memory at the
lowest address
(Example 1: 4 byte Long Int [Byte3 Byte2 Byte1 Byte0]. In the memory the arrangement is as shown)
Base Address+0 +1 +2 +3 Base Address+0 +1 +2 +3
Little Endian Byte0 Byte1 Byte2 Byte3 Big Endian Byte3 Byte2 Byte1 Byte0
11
12. Little and Big Endian (cont..)
• Example 2: to store two bytes required for the hexadecimal number 4F52, the
following shows the representation by the two methods (BTW: this is equal to 2*16^0
+ 5*16^1 + 15*16^2 + 4*16^3 = 20306 in decimal)
• Little Endian – representation in memory:
Base Address+0 52
Base Address+1 4F
• Big Endian – representation in memory:
Base Address+0 4F
Base Address+1 52
• Big Endian is easy to understand, because it is consistent with the order we use
naturally - when we read and write text and numbers.
• Irrespective of the BYTE order which depends on the Big Endian or Little Endian
representation, the BIT order within each Byte is always big-endian
01001001 = (0 + 2^6 + 0 + 0 + 2^3 + 0 + 0 + 2^0 = 64 + 8 + 1 = 73)
12
13. Need for Standards - Unicode
• We have seen the confusion that arises when each entity including
hardware manufacturers, Software companies, Regions, Countries, Groups
create Code Pages as per their own requirements and for their own
Character sets
• Without any set standards, and with the advent of internet, sharing of
information could be almost impossible
• What if we have one standard Code Page, having a set of all possible
character codes that any computer or software could decipher?
• Well, Unicode is the answer. It is not a Code Page, but more like a “meta-
Code Page”
• Unicode is a brave effort to create a single character set that included every
reasonable writing system on the planet
• Think of Unicode as a set of all possible character codes.
• Unicode is a single very large (and still growing) character set and
encoding, which encompasses essentially all the standard computer
character sets that predated it.
13
14. Unicode
• Unicode provides a unique number (or encoding or code
point) for every character
NO matter what the platform
NO matter what the program
NO matter what the language
• Unicode is an international standard that assigns a
unique number to characters from virtually every
language and script
• Unicode currently defines more than 90,000 characters,
with room for more than 1 million characters. With
Unicode, all characters used in business-relevant
languages can be represented
14
15. Unicode (cont…)
• Most any computer Code Page can be mapped to Unicode and
back. However, in computer systems Unicode is largely replacing
Code Page based approaches
• Instead of having dozens of Code Pages each using and re-using
the same numbered slots for different characters, each character
gets its own unique numbered slot in Unicode
• Think of Unicode as a label attached to the character via which the
character can be accessed by applications and operating systems
• Example: The English letter A is U+0041, Hebrew letter alef is
U+05D0, Greek letter alpha (α) would be U+03B1, etc – basically we
have covered them all
15
16. Does Unicode encode Language,
Font, Size, Positioning, Glyphs?
• The Unicode Standard does not attempt to encode features such as
language, font, size, positioning, glyphs, and so forth. For example, it does
not preserve language as a part of character encoding: just as French i
grec, German ypsilon, and English wye are all represented by the same
character code, U+0057 “Y”. The Unicode Standard deals only with
character codes.
• Glyphs represent the shapes that characters can have when they are
rendered or displayed. In contrast to characters, glyphs appear on the
screen or paper as particular representations of one or more characters. A
repertoire of glyphs makes up a font. Glyph shape and methods of
identifying and selecting glyphs are the responsibility of individual font
vendors and of appropriate standards and are not part of the Unicode
Standard.
• AAAAAAAA All represented by Latin capital letter A (U+0041)
• aaaaaaaaaa All represented by Latin small letter a (U+0061)
16
17. Unicode Challenges
• But, have we addressed all the issues?
• Of course not, Unicode has mapped all the
characters uniquely, but how to store this in
memory or represent it in an email message.
The English letter A would be U+0041, but in
memory should it be stored as [00 41] or as [41
00] – Endianness?
• What about all those zeros. Are we doubling the
disk space, resulting in more cooling costs and
more greenhouse issues? [TX okay, but CA?]
• Welcome to the UTF-8 Standards!
17
18. Unicode UTF-8 standard
• UTF-8 (8-bit UCS/UTF) is a variable-length character encoding for
Unicode. In UTF-8, every code point from 0-127 is stored in a single
byte. Only code points 128 and above are stored using 2, 3, in fact,
up to 6 bytes
• If a legacy system can understand ASCII, they can understand the
English portion of the UTF-8, therefore old programs can still
decipher English text from UTF-8. They cannot decipher any other
language in UTF-8 that has two or more bytes (they were not
designed to read other languages so are basically not effected)
• With UTF-8 standard, memory and disk space is conserved
• UTF-8 is interpreted as a sequence of bytes, there is no endian
problem as there is for encoding forms that use 16-bit or 32-bit code
units.
• UCS stands for Universal Character Set
• UTF stands for Unicode Transformation Format
18
19. Unicode other standards
• UCS-2 (2 bytes) or UTF-16 (16 bits)
– High Endian UCS-2 or Low Endian UCS-2
• UTF-7 (similar to UTF-8 but guarantees that the high bit
will always be zero to be consistent with old programs
requirements)
• UTF-32 (32 bits)
• UTF-8 is most popular standard today
• A byte order mark (BOM) consists of the character code
U+FEFF at the beginning of a data stream, where it can
be used as a signature defining the byte order
• Where a BOM is used with UTF-8, it is only used as an
encoding signature to distinguish UTF-8 from other
encodings — it has nothing to do with byte order
19
20. Conveying the Encoding used
• How do we preserve this information about what
encoding a string uses?
– For an email message, you are expected to have a string in the
header of the form
Content-Type: text/plain; charset="UTF-8"
– For HTML page by using some kind of special tag.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
• For the most consistent results, any new applications
developed should use Unicode, such as UTF-8 or UTF-
16, instead of a specific code page
• For Unicode UTF-8, the Windows Code Page is 65001
20
21. Unicode SAP system
• Enables you to harness Internet technologies better
• Allows better integration with non-SAP products and
seamless integration with existing SAP systems
• Offers a superior platform for collaborative, cross-system
business applications
• Work with all languages and language combinations in
the world
• Allows you to install a central system for worldwide
business processes, e.g. to gather and store aggregate
customer data
• Enables you to optimize your system landscape and
reduce your costs
21
22. Unicode SAP system (cont…)
• Unicode Program: A Unicode program is an
ABAP program in which the Unicode checks are
run effectively and in which certain statements
involve different semantics from those that apply
in non-Unicode program.
• Unicode System: Single-code-page system in
which characters are coded in Unicode
character representation.
• The Unicode check was tightened as of Release
6.10
22
23. SAP Unicode Overhead
• Main Memory:
– Average increase +40...50% -> Reason: Application
servers are based on UTF-16
• Network load:
– ~0% -> Almost no change due to efficient
compression.
• Database size: Average increase
– UTF-8: +10% (smaller systems (< 200GB) might grow
more)
– UTF-16: +20...60%
23
24. SAP File Interface and Unicode
• It is possible to exchange file between Unicode and non-
Unicode systems, between different Unicode systems
and between different non-Unicode systems with
different code pages
• Instead of implicit programming with standard settings on
which we have no control, programmers are required to
do explicit programming and all important parameters
need to be specified (with stringent requirements to
maintain good programming practice)
• Examples of explicit programming are: file must be
opened before each read/write, access type and type of
data storage needs to be specified, file opened with
read-only access remains that way through out the
program, file opened as text can have text only, etc
24
25. SAP Authorization for File
Access
• Operating system check
System automatically checks the entries in the database
table SPTH for access to individual files - none of the
following (S_PATH / S_DATASET) can override this.
• Program independent authorization check
The check against the authorization object S_PATH is independent of ABAP
program used and is not restricted to an individual file but all files in the
PATH/folder.
• User and program authorization check
The check against the authorization object S_DATASET, and is based on
the program name, filename and activity (Delete, Read, Write, Read with
filter and Write with Filter).
25
26. File Interface Statements
• OPEN DATASET
• TRANSFER
• READ DATASET
• GET DATASET
• SET DATASET
• TRUNCATE DATASET
• CLOSE DATASET
• DELETE DATASET
26
27. Opening a File
• OPEN DATASET dset FOR access IN mode
[position] [os_addition] [error_handling].
– dset is the file name including path (/usr/tmp/test.dat)
– access can be
• INPUT (opens only for reading, the file pointer is set at the start of the file, if
file does not exist, sy-subrc is set to 8, In Unicode program, it is not
possible to write to a file open for reading, whereas non-Unicode program
allows both)
• OUTPUT (opens a new file for writing, if file already exists, its content are
deleted. Read access is permitted)
• APPENDING (opens the file for appending, and the file pointer set at the end
of the file, if file does not exist, it is created. Read attempt fails and sy-subrc
is set to 4)
• UPDATE (opens the file for updating, and the file pointer set at the start of
the file, if file does not exist, sy-subrc is set to 8)
27
28. INPUT command (continued)
– Syntax of mode
• BINARY MODE (opens the file as a binary file, and the
binary content of a data object is transferred unchanged)
• TEXT MODE ENCODING code (opens the file as a text file,
when writing and the content of a data object is converted
to the representation specified after code [UTF-8 or non-
Unicode] and transferred to file. For characters, closing
blank values are truncated, but not for strings. When
reading, the content of file is read until the next end-of-line
marking, converted from the format specified after code
into the current character format [UTF-8 or non-Unicode
specified in database table TCP0C] and transferred to a
data object)
• LEGACY BINARY MODE [endian] [codepage]
• LEGACY TEXT FILE [endian] [codepage]
28
29. INPUT command (continued)
• AT POSITION pos
When opening file with this option pos defines where the file
pointer is positioned in bytes (0 means start of fine, -1 means end
of file and any value i means i bytes from the start of the file)
• TYPE attr
For Non MS O/S, attr can contain O/S specific parameters for a
file to be opened (OS/400 ‘blksize=8000’, etc). On MS O/S if attr
contains “NT” the end-of line is marked by “CRLF”, and if it
contains “UNIX” the end-of-line is marked by “LF”.
• FILTER opcom
Using Filter option, opcom can be an OS command that is started
when OPEN DATASET is executed, example: FILTER ‘compress’ or
FILTER ‘uncompress”
OPEN DATASET filexyz FOR OUTPUT in BINARY MODE FILTER ‘compress’.
OPEN DATASET filexyz FOR INPUT in BINARY MODE FILTER ‘uncompress’.
29
30. Error Handling
• [MESSAGE msg]
When errors occurs the O/S error message is assigned to the data
object msg to be displayed by the ABAP program to the user
• [IGNORING CONVERSION ERRORS]
This addition can suppress treatable exceptions defined by class
CX_SY_CONVERSION_CODEPAGE, each unconvertible character is
replaced by literal ‘#’
• [REPLACEMENT CHARACTER rc]
Same as above, except that each unconvertible character is
replaced by the single character specified by rc – not applicable
for binary files
30
31. TRANSFER and READ
Commands
• TRANSFER dobj TO dset [LENGTH len]
[NO END OF LINE]
The content are written to the file from the current file
pointer, Length determines how many characters/bytes are
written to the file, NO END OF LINE avoids the end-of-line
marking to be appended to the data transferred
• READ DATASET dset INTO dobj [MAXIMUM
LENGTH mlen] [[ACTUAL] LENGTH alen]
This exports the data from the file specified in dset into the
data object dobj starting from the current file pointer. Using
the Maximum length addition, the number of characters or
bytes to be read from the file can be limited. Using the
Actual Length the number of characters or bytes actually
used can be determined (mlen can be 100, but actual can
be 60 if the file is small, so alen is returned with 60)
31
32. GET and SET Commands
• GET DATASET dset [POSITION pos]
[ATTRIBUTES attr]
Position determines the current position of the file pointer. Attributes
enables us to read/get the value of fixed and changeable file
attributes
• SET DATASET dset [POSITION pos|{END OF
FILE}] [ATTRIBUTES attr]
Position sets the position of the file pointer to new position
indicated by pos. Attributes enables us to update the value
of changeable file attributes
32
33. ATTRIBUTES
• Fixed Attributes
– Indicator (sub-structure with the following fields and indicates ‘X’ if the following
are significant)
– Mode (Text (T), Binary (B), Legacy Binary (LB) and Legacy Text (LT))
– Access_type (Reading (I), writing (O), appending (A) and editing (U))
– Encoding (UTF-8 and NON-UNICODE)
– Filter (filter command, example ‘compress’)
• Changeable Attributes
– Indicator (sub-structure with the following fields and indicates ‘X’ if the following
are significant)
– Repl_char (replacemen character rc)
– Conv_error (contains ‘I’ if IGNORE conversion errors addition ws used ‘R’
otherwise)
– Code_page (code page that was specified, initial otherwise)
– Endian (B for Big Endian, L for Little Endian, initial otherwise)
Example:
DATA attr TYPE dset_attributes. “dset_attributes SAP defined in type group DSET.
GET DATASET dset ATTRIBUTES attr.
IF attr-fixed-indicator-filter <> ‘X’
… ENDIF.
33
34. Other commands
• TRUNCATE DATASET dset AT
{Current Position} | {POSITION pos}
File size is modified by setting the end of the file indicator at the
current or pos position. When shortened the file is truncated after
the new end of file, when extended (pos > current file size) the
file is filled with hexadecimal null from the old to the new end of
file.
• CLOSE DATASET dset
Closes file on the application server.
• DELETE DATASET dset
Deletes file on the application server.
34
35. Files on the Presentation
Server
• The CL_GUI_FRONTEND_SERVICES class of the
class library contains the required methods for
processing files on the presentation server
(client/PC). There are no ABAP statements
available for processing files here.
– GUI_DOWNLOAD for writing files
– GUI_UPLOAD for reading files
– DIRECTORY_CREATE and DIRECTORY_DELETE for
creating and deleting a directory
– FILE_DELETE, FILE_COPY, FILE_EXIST, etc., for file
operations
• The above is the class, but function modules
GUI_DOWNLOAD and GUI_UPLOAD can also be
used.
35