奇怪的中文字符导致多个开源sas7bdat解析程序崩溃
最近在用开源软件解析sas7bdat时,经常碰到这样的错误(haven,readstat),
ReadStat: Error parsing page 0, bytes 262144-524287 Error processing test.sas7bdat: Unable to convert string to the requested encoding (invalid byte sequence)
或者(parso),
--- exec-maven-plugin:1.2.1:exec (default-cli) @ sas --- SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Exception in thread "main" org.h2.jdbc.JdbcSQLException: Exception calling user-defined function: "sasRead(conn1: url=jdbc:columnlist:connection user=SA, /Users/steven/sas/data/test.sas7bdat, null): String index out of range: 44"; SQL statement: create table sas as SELECT * FROM SASREAD('/Users/steven/sas/data/test.sas7bdat', null) [90105-191] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) at org.h2.message.DbException.get(DbException.java:168) at org.h2.message.DbException.convertInvocation(DbException.java:312) at org.h2.engine.FunctionAlias$JavaMethod.getValue(FunctionAlias.java:493) at org.h2.expression.JavaFunction.getValueForColumnList(JavaFunction.java:126) at org.h2.table.FunctionTable.<init>(FunctionTable.java:66) at org.h2.command.Parser.readTableFilter(Parser.java:1237) at org.h2.command.Parser.parseSelectSimpleFromPart(Parser.java:1884) at org.h2.command.Parser.parseSelectSimple(Parser.java:2032) at org.h2.command.Parser.parseSelectSub(Parser.java:1878) at org.h2.command.Parser.parseSelectUnion(Parser.java:1699) at org.h2.command.Parser.parseSelect(Parser.java:1687) at org.h2.command.Parser.parseCreateTable(Parser.java:6007) at org.h2.command.Parser.parseCreate(Parser.java:4217) at org.h2.command.Parser.parsePrepared(Parser.java:360) at org.h2.command.Parser.parse(Parser.java:315) at org.h2.command.Parser.parse(Parser.java:287) at org.h2.command.Parser.prepareCommand(Parser.java:252) at org.h2.engine.Session.prepareLocal(Session.java:560) at org.h2.engine.Session.prepareCommand(Session.java:501) at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1188) at org.h2.jdbc.JdbcStatement.executeInternal(JdbcStatement.java:170) at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:158) at com.alitrack.h2.Function.main(Function.java:40) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 44 at java.lang.String.substring(String.java:1950) at com.epam.parso.impl.SasFileParser$ColumnNameSubheader.processSubheader(SasFileParser.java:1209) at com.epam.parso.impl.SasFileParser.processPageMetadata(SasFileParser.java:361) at com.epam.parso.impl.SasFileParser.processSasFilePageMeta(SasFileParser.java:331) at com.epam.parso.impl.SasFileParser.getMetadataFromSasFile(SasFileParser.java:231) at com.epam.parso.impl.SasFileParser.<init>(SasFileParser.java:208) at com.epam.parso.impl.SasFileParser.<init>(SasFileParser.java:44) at com.epam.parso.impl.SasFileParser$Builder.build(SasFileParser.java:996) at com.epam.parso.impl.SasFileReaderImpl.<init>(SasFileReaderImpl.java:63) at com.alitrack.h2.Function.sasRead(Function.java:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.h2.engine.FunctionAlias$JavaMethod.getValue(FunctionAlias.java:481) ... 20 more
造成这样错误的sas7bdat由下面的代码生成,
libname lib "Z:\data"; options validvarname=any; /*生成的sas文件,解析报错*/ data lib.fails; 柯妮丝麗=234; run; /*不报错,变量可以是别的中文字符 data lib.works; 测试=123; run; */
继续尝试了sas7bdat.py,通过,
from sas7bdat import SAS7BDAT f= SAS7BDAT('c:/soft/fails.sas7bdat',encoding='GBK') df = f.to_data_frame() df f.convert_file('c:/soft/fails.csv')
P.S.
刚刚Mac下测试了最新的haven通过,
>library(haven) >read_sas("/data/fails.sas7bdat",encoding = "GBK") # A tibble: 1 x 1 柯妮丝麗 <dbl> 1 234
sas7bdat (R版本),可以解析,但不输出中文变量,而是用一组特殊字符(比如,X.bf..c2..c4..dd…fb..90)替代。
>library(sas7bdat) >read.sas7bdat("/sas/data/fails.sas7bdat",debug = F) X.bf..c2..c4..dd...fb..90. 1 234