Oracle’s ASCIISTR() and Unicode Characters

Oracle provides an interesting function, ASCIISTR(), to return ASCII strings from a VARCHAR2 or CLOB column, and in general it does an admirable job. There are non-printing characters however, that ‘put a spanner in the works’, returning HEX strings instead of characters. Since each HEX string is five bytes long, such strings can significantly increase the overall length of the text so that it won’t ‘fit’ in the allocated buffer space. Let’s look at one of those ‘annoying’ characters, look at its source and provide a possible solution.

Word-processing programs (which will remain unnamed), spreadsheets, PDFs and HTML documents are great tools, that is until text from any of those sources is inserted into a text or CLOB column in an Oracle database using the popular ‘copy and paste’ procedure. Many times, those programs use what is called a ‘non-breaking space’ to separate words in a line of text, and if the Oracle database in question is configured to use UTF-8 character encoding those non-breaking spaces get inserted into the column without error or translation. A simple select of the column data won’t reveal these insidious pests, but ASCIISTR() will. This is mentioned in the Oracle online documentation; the description from that document is reproduced below:


ASCIISTR takes as its argument a string, or an expression that resolves to a string, in any character set and returns an ASCII
version of the string in the database character set. Non-ASCII characters are converted to the form xxxx, where xxxx represents
a UTF-16 code unit.

So what is a ‘non-breaking space’? According to Wikipedia:


In word processing and digital typesetting, a non-breaking space (" "), also called no-break space, non-breakable space (NBSP),
hard space, or fixed space,[1] is a space character that prevents an automatic line break at its position. In some formats,
including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.

In the database world the ASCIISTR() Function can’t translate it into an ASCII character so it returns a text string of the HEX representation of it, which is ‘0A0’. Concatenate enough of those spaces into a single line and the line length expands, sometimes dramatically. When there are many of these in a line, paragraph, or block of text Oracle can throw the following errors from ASCIISTR():


ORA-24920: column size too large for client. 
ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion (actual: %s, maximum: %s)

Reports can fail because of those errors, and to the user, the cause is unknown because these ‘non-breaking spaces’ look like regular spaces.

Can this issue be fixed? Yes, but it may not be a really simple fix, depending upon the configured database block size. For block sizes of 8192 or less the string will need to be ‘walked’ in ‘chunks’ to update it completely. If the block size is 16K, successful single-statement updates are possible. Looking at an example that searches the first 2400 characters of a string for ‘0A0’ the PL/SQL block might look like this:


declare
        v_chunk         number:=2400;
        v_start_pos     number:=1;
        v_next_pos      number:=0;
        v_passes        number:=0;
begin
        --
	-- Set the number of passes through the loop
        --
        select round(length(col)/v_chunk,0) into v_passes from mytable;
        --
	-- Execute the loop, walking the string in pre-defined blocks
        --
        for i in 1..v_passes loop
                v_next_pos := v_start_pos + v_chunk;
                update mytable
                set col = replace(asciistr(substr(col, v_start_pos, v_chunk)), 'Z','')
                where id in
                        (select id from mytable where asciistr(substr(col, v_start_pos, v_chunk)) like '%Z%');
        --
        -- Set the new starting position
        --

                v_start_pos := v_next_pos;
        end loop;
        --
        -- Commit the changes
        --
        commit;
end;
/

Feel free to use this code, making any changes to table and column names or chunk size.

The above script works because it calculates the number of ‘steps’ it will take to update the entire string, based on the string length and the ‘chunk’ size. This determines the number of steps required to ‘walk’ the entire string. The loop itself calculates the starting position and positions the ‘pointer’ where it needs to be to pick up from the previous update. Eventually the entire string is ‘walked’ and any errant characters replaced.

Of course, not everyone will have need of this script; this problem arises when the copy and paste operations described above are in use at insurance companies, medical offices, law offices and others who regularly insert text into a VARCHAR2 or CLOB column that is found online or in word processing/spreadsheet/PDF files.

Although this may seem like a remote issue, to some it can be a problem on a daily, weekly, or monthly basis depending upon when reports where this character is liberally sprinkled are generated. It’s nice to know, though, that if this problem ever does appear there’s a possible solution handy.

See all articles by David Fitzjarrell

David Fitzjarrell
David Fitzjarrell has more than 20 years of administration experience with various releases of the Oracle DBMS. He has installed the Oracle software on many platforms, including UNIX, Windows and Linux, and monitored and tuned performance in those environments. He is knowledgeable in the traditional tools for performance tuning – the Oracle Wait Interface, Statspack, event 10046 and 10053 traces, tkprof, explain plan and autotrace – and has used these to great advantage at the U.S. Postal Service, American Airlines/SABRE, ConocoPhilips and SiriusXM Radio, among others, to increase throughput and improve the quality of the production system. He has also set up scripts to regularly monitor available space and set thresholds to notify DBAs of impending space shortages before they affect the production environment. These scripts generate data which can also used to trend database growth over time, aiding in capacity planning. He has used RMAN, Streams, RAC and Data Guard in Oracle installations to ensure full recoverability and failover capabilities as well as high availability, and has configured a 'cascading' set of DR databases using the primary DR databases as the source, managing the archivelog transfers manually and montoring, through scripts, the health of these secondary DR databases. He has also used ASM, ASMM and ASSM to improve performance and manage storage and shared memory.

Latest Articles