本文来自网易云社区
作者:董添
使用Druid连接池和PreparedStatement Cache后,应用访问数据库的效率提高了,针对连接池的监控手段也变多了,但是随之而来也出现了一些新的问题,最近发现在给表增加字段后,应用不断地抛出"ORA-17401:违反协议"异常。这个问题影响了系统的可扩展性,别的组也遇到了,遂解决之,最终发现是连接池一个很不起眼的配置错误导致了这个问题。
我们在线上库中给表加入了一个字段,然后线上应用开始不断抛出"ORA-17401:违反协议"异常,起初涉及的SQL语句是跟增加字段的表相关的,后来无关的SQL也开始不断抛出该异常。异常一直持续,直到我们把应用重启。
我们首先想到的是上网搜索"ORA-17401:违反协议"到底是什么含义。查阅了Oracle的官方文档后发现这个异常是TTC(Two-Task Common)的一个错误。TTC是oracle通信协议里较为靠上的一层,用于处理客户端与数据库服务器通信时的协议差异,比如转换字符集和数据类型。有了这一层,oracle就能确保自己的客户端和服务器在不同的网络交互模式(比如TCP/IP、共享内存)下可以透明地交互。
但是,oracle官方并没有提供更多关于这个错误的描述了。而网上出现相同错误的情况也是五花八门,甚至连oracle密码过期都有,这些显然不是我们要找的答案。没办法,重新梳理出问题的过程,我们发现:
那么接下来,我们希望通过在测试环境重现这个错误来确定我们的猜测。
实验验证了我们的猜测,确实是PreparedStatement Cache导致了异常的出现。我们知道,PreparedStatment对象是ojdbc提供的,而它并不是开源项目,所以我们不能通过修改它的实现来避免异常的出现。那难道为了数据库环境的可扩展性,我们就必须关闭PreparedStatement Cache了吗?这显然不是我们希望的结果,所以只能继续探索。
仔细想想,增加字段这种行为在线上环境并不常见,我们只是不希望每次为了增加字段做太多额外的工作。那我们就换一个思路,让这个异常出现以后应用能够马上自动修复它。查阅了Druid github上的issues和release history,我们发现1.0.27版本有一个有趣的修复:
难道出错的PreparedStatement关闭后这个异常能够自动修复?我们使用的Druid版本是1.0.25,连接池确实没有这个功能,所以升级以后异常自动修复的可能性很高。遂升级版本,重复之前的测试过程,但不幸的是,这次异常并没有修复。将这个问题在github上反馈给作者后,也一直没有得到一个很好的答复。没办法,还得靠自己动手,所以我们又看了一下这个修复涉及的源码:
从原理上来看,就是对每个PreparedStatement加了一个计数器,如果出现异常就加1,最后在用完PreparedStatement时检查计数器,如果有异常,就不再把PreparedStatement放回池子,而是直接关闭。看上去并没有问题,但是上图红框中的代码引起了我们的注意:PreparedStatement出错时,除了计数+1,还会调用DruidPooledConnection
类的checkException()
方法进行处理。原先我们在研究Druid时,曾经看到其一个优势是自动抛弃一些不可修复的连接,那有没有可能是这个异常没有引起连接的抛弃呢?遂下载Druid的源码,分析源码,并增加日志进行调试:
com.alibaba.druid.pool.DruidPooledConnection:
public SQLException handleException(Throwable t) throws SQLException {
final DruidConnectionHolder holder = this.holder;
if (holder != null) {
DruidAbstractDataSource dataSource = holder.getDataSource();
dataSource.handleConnectionException(this, t); //调用数据源类的异常处理类
}
if (t instanceof SQLException) {
throw (SQLException) t;
}
throw new SQLException("Error", t);
}
com.alibaba.druid.pool.DruidDataSource:
public void handleConnectionException(DruidPooledConnection pooledConnection, Throwable t) throws SQLException {
final DruidConnectionHolder holder = pooledConnection.getConnectionHolder();
errorCount.incrementAndGet();
lastError = t;
lastErrorTimeMillis = System.currentTimeMillis();
if (t instanceof SQLException) { //抛出的异常确实是SQLException
SQLException sqlEx = (SQLException) t;
// broadcastConnectionError
ConnectionEvent event = new ConnectionEvent(pooledConnection, sqlEx);
for (ConnectionEventListener eventListener : holder.getConnectionEventListeners()) {
eventListener.connectionErrorOccurred(event);
}
// exceptionSorter.isExceptionFatal
//这里ExceptionSorter是null,看来问题出在这里
if (exceptionSorter != null && exceptionSorter.isExceptionFatal(sqlEx)) {
if (pooledConnection.isTraceEnable()) {
synchronized (activeConnections) {
if (pooledConnection.isTraceEnable()) {
activeConnections.remove(pooledConnection);
pooledConnection.setTraceEnable(false);
}
}
}
//发生Fatal异常的连接在这里被标记为抛弃
boolean requireDiscard = false;
synchronized (pooledConnection) {
if ((!pooledConnection.isClosed()) || !pooledConnection.isDisable()) {
holder.setDiscard(true);
pooledConnection.disable(t);
requireDiscard = true;
}
}
if (requireDiscard) {
this.discardConnection(holder.getConnection());
holder.setDiscard(true);
}
LOG.error("discard connection", sqlEx); //这一行日志并没有打印
}
throw sqlEx;
} else {
throw new SQLException("Error", t);
}
}
com.alibaba.druid.pool.vendor.OracleExceptionSorter:
public boolean isExceptionFatal(final SQLException e) {
final int error_code = Math.abs(e.getErrorCode()); // I can't remember if the errors are negative or positive.
switch (error_code) {
case 28: // your session has been killed
case 600: // Internal oracle error
case 1012: // not logged on
case 1014: // Oracle shutdown in progress
case 1033: // Oracle initialization or shutdown in progress
case 1034: // Oracle not available
case 1035: // ORACLE only available to users with RESTRICTED SESSION privilege
case 1089: // immediate shutdown in progress - no operations are permitted
case 1090: // shutdown in progress - connection is not permitted
case 1092: // ORACLE instance terminated. Disconnection forced
case 1094: // ALTER DATABASE CLOSE in progress. Connections not permitted
case 2396: // exceeded maximum idle time, please connect again
case 3106: // fatal two-task communication protocol error
case 3111: // break received on communication channel
case 3113: // end-of-file on communication channel
case 3114: // not connected to ORACLE
case 3134: // Connections to this server version are no longer supported.
case 3135: // connection lost contact
case 3136: // inbound connection timed out
case 3138: // Connection terminated due to security policy violation
case 3142: // Connection was lost for the specified session and serial number. This is either due to session
// being killed or network problems.
case 3143: // Connection was lost for the specified process ID and thread ID. This is either due to session
// being killed or network problems.
case 3144: // Connection was lost for the specified process ID. This is either due to session being killed
// or network problems.
case 3145: // I/O streaming direction error
case 3149: // Invalid Oracle error code, Cause: An invalid Oracle error code was received by the server.
case 6801: // TLI Driver: listen for SPX server reconnect failed
case 6802: // TLI Driver: could not open the /etc/netware/yellowpages file
case 6805: // TLI Driver: could not send datagram SAP packet for SPX
case 9918: // Unable to get user privileges from SQL*Net
case 9920: // Unable to get sensitivity label from connection
case 9921: // Unable to get information label from connection
// TTC(Two-Task Common) ERROR CODE
case 17001: // Internal Error
case 17002: // Io exception
case 17008: // Closed Connection
case 17024: // No data read
case 17089: // internal error
case 17409: // invalid buffer length
case 17401: // Protocol violation //ora-17401确实是fatal类异常,正常情况下遇到这个异常连接会被抛弃
case 17410: // No more data to read from socket
case 17416: // FATAl
case 17438: // Internal - Unexpected value
case 17442: // Refcursor value is invalid
case 25407: // connection terminated
case 25408: // can not safely replay call
case 25409: // failover happened during the network operation,cannot continue
case 25425: // connection lost during rollback
case 29276: // transfer timeout
case 30676: // socket read or write failed
return true;
default:
if (error_code >= 12100 && error_code <= 12299) { // TNS issues
return true;
}
break;
}
final String error_text = (e.getMessage()).toUpperCase();
// Exclude oracle user defined error codes (20000 through 20999) from consideration when looking for
// certain strings.
if ((error_code < 20000 || error_code >= 21000)) {
if ((error_text.contains("SOCKET")) // for control socket error
|| (error_text.contains("套接字")) // for control socket error
|| (error_text.contains("CONNECTION HAS ALREADY BEEN CLOSED")) //
|| (error_text.contains("BROKEN PIPE")) //
|| (error_text.contains("管道已结束")) //
) {
return true;
}
}
return fatalErrorCodes.contains(error_code);
}
原来,Druid针对不同的数据库,都有一套自己的ExceptionSorter来对错误信息进行分类,分类成fatal的异常,只要出现,连接就会被抛弃。而我们遇到的ora-17401错误就是这类fatal异常。但奇怪的是,运行时发现ExceptionSorter是null,所以整个连接抛弃机制都无法工作了。为什么会是null呢?我们想到的是去查看这个类的初始化过程:
com.alibaba.druid.pool.DruidDataSource:
private void initValidConnectionChecker() { //除了ExceptionSorter,ValidConnectionChecker也按照同样的方式初始化
String realDriverClassName = driver.getClass().getName();
if (realDriverClassName.equals(JdbcConstants.MYSQL_DRIVER)
|| realDriverClassName.equals(JdbcConstants.MYSQL_DRIVER_6)) {
this.validConnectionChecker = new MySqlValidConnectionChecker();
} else if (realDriverClassName.equals(JdbcConstants.ORACLE_DRIVER)) {
this.validConnectionChecker = new OracleValidConnectionChecker();
} else if (realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER)
|| realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER_SQLJDBC4)
|| realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER_JTDS)) {
this.validConnectionChecker = new MSSQLValidConnectionChecker();
} else if (realDriverClassName.equals(JdbcConstants.POSTGRESQL_DRIVER)) {
this.validConnectionChecker = new PGValidConnectionChecker();
}
}
private void initExceptionSorter() { //初始化ExceptionSorter
if (exceptionSorter instanceof NullExceptionSorter) {
if (driver instanceof MockDriver) {
return;
}
} else if (this.exceptionSorter != null) {
return;
}
String realDriverClassName = driver.getClass().getName();
if (realDriverClassName.equals(JdbcConstants.MYSQL_DRIVER)
|| realDriverClassName.equals(JdbcConstants.MYSQL_DRIVER_6)) {
this.exceptionSorter = new MySqlExceptionSorter();
} else if (realDriverClassName.equals(JdbcConstants.ORACLE_DRIVER)) {
this.exceptionSorter = new OracleExceptionSorter(); //OracleExceptionSorter初始化在这里
} else if (realDriverClassName.equals("com.informix.jdbc.IfxDriver")) {
this.exceptionSorter = new InformixExceptionSorter();
} else if (realDriverClassName.equals("com.sybase.jdbc2.jdbc.SybDriver")) {
this.exceptionSorter = new SybaseExceptionSorter();
} else if (realDriverClassName.equals(JdbcConstants.POSTGRESQL_DRIVER)) {
this.exceptionSorter = new PGExceptionSorter();
} else if (realDriverClassName.equals("com.alibaba.druid.mock.MockDriver")) {
this.exceptionSorter = new MockExceptionSorter();
} else if (realDriverClassName.contains("DB2")) {
this.exceptionSorter = new DB2ExceptionSorter();
}
}
com.alibaba.druid.util.JdbcConstants:
public static final String ORACLE = "oracle";
//问题出在这里,我们的配置文件里面oracle driver class name配置的是oracle.jdbc.driver.OracleDriver ...
public static final String ORACLE_DRIVER = "oracle.jdbc.OracleDriver";
我们自己的druid.properties:
jdbc-1.druid.driverClassName=oracle.jdbc.driver.OracleDriver
原来竟然是driver class name配置的不一样!!!这实在是一个意想不到的原因。那druid写得有问题还是我们写得有问题呢?来看这一段说明:
看来还是我们自己的问题啊!这个配置是一个遗留配置,没人注意到oracle.jdbc.driver.OracleDriver
已经deprecated了,虽然oracle.jdbc.OracleDriver
里面只维护了一份版本信息,但却成为了现在外界普遍认可的driver入口。而且,因为这个配置不对,Druid的ValidConnectionChecker
也没有正常启动,看来这确实是一个必须尽快修复的问题。
修改了这个配置以后,ExceptionSorter
和ValidConnectionChecker
都正常启动了,而产生ora-17401异常的连接也终于能够被正确地抛弃了。至此,增加数据库表字段终于对使用Druid PreparedStatement Cache的应用不再产生负作用了。
增加数据库字段没问题了,那减少和修改数据库字段呢?道理是一样的,所以我们推测也会有异常抛出。实际尝试后,我们发现抛出的异常是"ora-01007:变量不在选择列表中",这个异常码并不在OracleExceptionSorter默认的fatal异常码里。那怎么才能让druid自动抛弃抛出这个异常的连接呢?还是看源码:
com.alibaba.druid.pool.vendor.OracleExceptionSorter:
public void configFromProperties(Properties properties) {
String property = properties.getProperty("druid.oracle.fatalErrorCodes");
if (property != null) {
String[] items = property.split("\\,");
for (String item : items) {
if (item != null && item.length() > 0) {
try {
int code = Integer.parseInt(item);
//fatalErrorCodes中的错误码也会被当成fatal异常,它通过druid.oracle.fatalErrorCodes这个System Property来配置
fatalErrorCodes.add(code);
} catch (NumberFormatException e) {
LOG.error("parse druid.oracle.fatalErrorCodes error", e);
}
}
}
}
}
所以,如果想要在线上环境删除或者更改字段,就必须增加JVM参数-Ddruid.oracle.fatalErrorCodes=01007
或者调用System.setProperty("druid.oracle.fatalErrorCodes","01007")
。
该问题已经提交到Druid社区,作者已经在最新的1.0.29版本添加了对于oracle旧驱动类的支持,如果不想修改配置文件,可以直接升级到1.0.29来解决这个问题。