Oracle provides an interesting function, ASCIISTR(), to return ASCII strings from a VARCHAR2 or CLOB column, and in general it does an admirable job. There are non-printing characters however, that ‘put a spanner in the works’, returning HEX strings instead of characters. Since each HEX string is five bytes long, such strings can significantly increase the overall length of the text so that it won’t ‘fit’ in the allocated buffer space. Let’s look at one of those ‘annoying’ characters, look at its source and provide a possible solution.
Word-processing programs (which will remain unnamed), spreadsheets, PDFs and HTML documents are great tools, that is until text from any of those sources is inserted into a text or CLOB column in an Oracle database using the popular ‘copy and paste’ procedure. Many times, those programs use what is called a ‘non-breaking space’ to separate words in a line of text, and if the Oracle database in question is configured to use UTF-8 character encoding those non-breaking spaces get inserted into the column without error or translation. A simple select of the column data won’t reveal these insidious pests, but ASCIISTR() will. This is mentioned in the Oracle online documentation; the description from that document is reproduced below:
ASCIISTR takes as its argument a string, or an expression that resolves to a string, in any character set and returns an ASCII
version of the string in the database character set. Non-ASCII characters are converted to the form xxxx, where xxxx represents
a UTF-16 code unit.
So what is a ‘non-breaking space’? According to Wikipedia:
In word processing and digital typesetting, a non-breaking space (" "), also called no-break space, non-breakable space (NBSP),
hard space, or fixed space,[1] is a space character that prevents an automatic line break at its position. In some formats,
including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.
In the database world the ASCIISTR() Function can’t translate it into an ASCII character so it returns a text string of the HEX representation of it, which is ‘