2015年3月17日

【SAS 小技巧】用K函數處理中文字串

在Facebook中的「SAS戰術應用精研社」,網友偶會提到使用文字函數處理中文時遇到一些的狀況,如果使用一搬文字函數如compress、scan、substr、index等可能是得到一堆亂碼,或是找不到設定文字,又或是找到非預期中的字,或等,例如鄭姓網友所提的問題:


data a1;
input string $20.;
cards;
洛瓦
北極光
野生動物
黃色小鴨
窈窕曲線包
手藝有限
;
data a2;
set a1;
idx1=index(string,'孕');
idx2=index(string,'嬰');
run;
proc print data=a2;
run;

F1

這6個字串裡面沒有「看到」任何「孕」或「嬰」,但用index函數卻得到前3個字串有孕字,後三個字串有嬰字。這一類的問題主要為中文為雙字元編碼,SAS傳統的文字函數適合處理單字元編碼的函數。

在SAS Base建有處理雙位元的函數,在說明檔案中,SAS將這一系列稱為『K Functions』,詳細K函數請見SAS官方網頁

最後擷取說明檔中部分K函數的說明:

函數

說明

KCOMPARE

Returns the result of a comparison of character expressions.

KCOMPRESS

Removes specified characters from a character expression.

KCOUNT

Returns the number of double-byte characters in an expression.

KINDEX

Searches a character expression for a string of characters.

KINDEXC

Searches a character expression for specified characters.

KLEFT

Left-aligns a character expression by removing unnecessary leading DBCS blanks and SO/SI.

KLENGTH

Returns the length of an argument.

KLOWCASE

Converts all letters in an argument to lowercase.

KPROPCASE

Converts Chinese, Japanese, Korean, Taiwanese (CJKT) characters.

KPROPCHAR

Converts special characters to normal characters.

KPROPDATA

Removes or converts unprintable characters.

KREVERSE

Reverses a character expression.

KRIGHT

Right-aligns a character expression by trimming trailing DBCS blanks and SO/SI.

KSCAN

Selects a specified word from a character expression.

KSTRCAT

Concatenates two or more character expressions.

KSUBSTR

Extracts a substring from an argument.

KSUBSTRB

Extracts a substring from an argument according to the byte position of the substring in the argument.

KTRANSLATE

Replaces specific characters in a character expression.

KTRIM

Removes trailing DBCS blanks and SO/SI from character expressions.

KTRUNCATE

Truncates a string to a specified length in byte unit without breaking multibyte characters.

KUPCASE

Converts all letters in an argument to uppercase.

KUPDATE

Inserts, deletes, and replaces character value contents.

KUPDATEB

Inserts, deletes, and replaces the contents of the character value according to the byte position of the character value in the argument.

KVERIFY

Returns the position of the first character that is unique to an expression.

沒有留言:

張貼留言