奇怪的中文字符导致多个开源sas7bdat解析程序崩溃

最近在用开源软件解析sas7bdat时,经常碰到这样的错误(haven,readstat),

ReadStat: Error parsing page 0, bytes 262144-524287
Error processing test.sas7bdat: Unable to convert string to the requested encoding (invalid byte sequence)

或者(parso),

 --- exec-maven-plugin:1.2.1:exec (default-cli) @ sas ---
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" org.h2.jdbc.JdbcSQLException: Exception calling user-defined function: "sasRead(conn1: url=jdbc:columnlist:connection user=SA, /Users/steven/sas/data/test.sas7bdat, null): String index out of range: 44"; SQL statement:
create table sas as SELECT * FROM SASREAD('/Users/steven/sas/data/test.sas7bdat', null) [90105-191]
 at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
 at org.h2.message.DbException.get(DbException.java:168)
 at org.h2.message.DbException.convertInvocation(DbException.java:312)
 at org.h2.engine.FunctionAlias$JavaMethod.getValue(FunctionAlias.java:493)
 at org.h2.expression.JavaFunction.getValueForColumnList(JavaFunction.java:126)
 at org.h2.table.FunctionTable.<init>(FunctionTable.java:66)
 at org.h2.command.Parser.readTableFilter(Parser.java:1237)
 at org.h2.command.Parser.parseSelectSimpleFromPart(Parser.java:1884)
 at org.h2.command.Parser.parseSelectSimple(Parser.java:2032)
 at org.h2.command.Parser.parseSelectSub(Parser.java:1878)
 at org.h2.command.Parser.parseSelectUnion(Parser.java:1699)
 at org.h2.command.Parser.parseSelect(Parser.java:1687)
 at org.h2.command.Parser.parseCreateTable(Parser.java:6007)
 at org.h2.command.Parser.parseCreate(Parser.java:4217)
 at org.h2.command.Parser.parsePrepared(Parser.java:360)
 at org.h2.command.Parser.parse(Parser.java:315)
 at org.h2.command.Parser.parse(Parser.java:287)
 at org.h2.command.Parser.prepareCommand(Parser.java:252)
 at org.h2.engine.Session.prepareLocal(Session.java:560)
 at org.h2.engine.Session.prepareCommand(Session.java:501)
 at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1188)
 at org.h2.jdbc.JdbcStatement.executeInternal(JdbcStatement.java:170)
 at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:158)
 at com.alitrack.h2.Function.main(Function.java:40)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 44
 at java.lang.String.substring(String.java:1950)
 at com.epam.parso.impl.SasFileParser$ColumnNameSubheader.processSubheader(SasFileParser.java:1209)
 at com.epam.parso.impl.SasFileParser.processPageMetadata(SasFileParser.java:361)
 at com.epam.parso.impl.SasFileParser.processSasFilePageMeta(SasFileParser.java:331)
 at com.epam.parso.impl.SasFileParser.getMetadataFromSasFile(SasFileParser.java:231)
 at com.epam.parso.impl.SasFileParser.<init>(SasFileParser.java:208)
 at com.epam.parso.impl.SasFileParser.<init>(SasFileParser.java:44)
 at com.epam.parso.impl.SasFileParser$Builder.build(SasFileParser.java:996)
 at com.epam.parso.impl.SasFileReaderImpl.<init>(SasFileReaderImpl.java:63)
 at com.alitrack.h2.Function.sasRead(Function.java:51)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at org.h2.engine.FunctionAlias$JavaMethod.getValue(FunctionAlias.java:481)
 ... 20 more

 

造成这样错误的sas7bdat由下面的代码生成,

libname lib "Z:\data";
options validvarname=any;

/*生成的sas文件,解析报错*/
data lib.fails;
柯妮丝麗=234;
run;

/*不报错,变量可以是别的中文字符
data lib.works;
测试=123;
run;
*/

继续尝试了sas7bdat.py,通过,

from sas7bdat import SAS7BDAT
f= SAS7BDAT('c:/soft/fails.sas7bdat',encoding='GBK') 
df = f.to_data_frame()
df
f.convert_file('c:/soft/fails.csv')

screen-shot-2016-10-18-at-12-05-52-am

P.S.

刚刚Mac下测试了最新的haven通过,

>library(haven)
>read_sas("/data/fails.sas7bdat",encoding = "GBK")
# A tibble: 1 x 1
  柯妮丝麗
     <dbl>
1      234

 

sas7bdat (R版本),可以解析,但不输出中文变量,而是用一组特殊字符(比如,X.bf..c2..c4..dd…fb..90)替代。

>library(sas7bdat)
>read.sas7bdat("/sas/data/fails.sas7bdat",debug = F)
X.bf..c2..c4..dd...fb..90.
1                        234