Wednesday, July 21, 2010

A tale of two SCNs

A few weeks ago, I had the opportunity to investigate an issue on a live system where a regular J2EE web application was hitting the well-known error ORA-01555: snapshot too old: rollback segment number <number> with name <name> too small. But in this particular case, it was accompanied by an ORA-02063, which means that the error was received from a database link. Folks with Oracle dblink experience are probably aware that unexpected traps lurk beneath the attractive abstraction that dblinks provide. A snapshot too old on a dblink - that had to be interesting !



As per the description provided by the admins, the errors were appearing in the logs under two different conditions :-
  • almost every midnight during the execution of a background job that is scheduled to run once a day at midnight
  • once in a rare while upon clicking a particular action button in one of the application pages

But things worked fine immediately after the error was encountered. If the error was seen from the UI action, a repeat click would not produce any error and things worked fine from then on. A manual re-execution of the background job immediately after the failed run at midnight would succeed and the error would again recur next midnight. The undo tablespace was reported to be large enough and no abnormal activity was seen in it.

So how do we get to the root of the problem ?

Most Java developers would probably start thinking about data sources and the app server connection pools but snapshot too old across a dblink was likely to be caused by something deeper down in the database layer. It turned out that the failing queries were both similar selects involving a join between a local table and a local view based on a remote table referenced using a dblink.

The "old" snapshot in question is from the remote database since the error was received from the link. Now how would the snapshot be constructed i.e. which SCN is used here ? There are two possibilities, the SCN of the local database or that of the remote database. Great, looks like we have caught a scent here ! The two SCNs could be very different since they belong to two different databases. It is likely that the local SCN is used (think Oracle statement-level read consistency) based on how logical timestamps are defined. If that is the case, a snapshot too old could occur if the remote SCN is way ahead. Unintentionally, an attempt will be made to resurrect a very old snapshot (at a much lower SCN) and it is quite possible that such an old undo would have been evicted from the undo tablespace. But if such SCN differences occur in practice there would be frequent and constant problems on most queries involving dblinks. Surely there must be more to this.

Let us get into a bit more detail on why we think the local SCN is likely to be used. It does not really sound convincing yet and as we already saw, it can lead to problems. Might be helpful to take a short detour and reinforce some of the important ideas related to transaction concurrency in Oracle.

Oracle was one of the earliest adopters of MVCC (Multi Version Concurrency Control) and they implemented it in a certain manner that is often considered unconventional. The familiar ANSI transaction isolation levels READ_UNCOMMITTED, READ_COMMITTED, REPEATABLE_READ and SERIALIZABLE were defined with respect to pure lock-based
implementations, the basic idea being that shared and exclusive locks are placed on rows for reads and writes respectively. Oracle does not use the pure lock-based implementation, and its MVCC ensures that readers do not block writers and vice versa. Writers can indeed block writers. Despite this, Oracle does claim to support the isolation levels READ_COMMITTED and SERIALIZABLE in addition to a special READ_ONLY level. But they interpret these two terms to mean something slightly different and the more appropriate names for them are statement-level read consistency and transaction-level read consistency respectively.

Statement-level read consistency is the default in Oracle and means that for any particular statement, all data are looked up as they existed at the point of time when the statement started executing. In other words, all reads performed during the execution of the statement are consistent with respect to a single point in time. This also means that any updates committed by other transactions while the statement is executing would not be "seen" by the statement. But a subsequent statement in the same transaction that starts after the commits were performed would see the new data.

This would seem to imply that the database either needs to make a private workspace for each transaction into which row data is copied over, or it needs to store the different versions of each row along with their timestamps. Actually, the latter is not essential and older versions of rows (i.e. at past points in time) can be reconstructed by applying a sequence of undo records in reverse order on the current row data, and that is what Oracle does. Of course, there is a performance penalty associated with this form of runtime reconstruction and potentially we may not be able to find undo records prior to a certain point of time as they may have been evicted from the undo tablespace and written over by newer undo records. This is exactly when the snapshot too old error is encountered.

Some similarities with source code version control systems are apparent but it should be kept in mind that, unlike a database which has to be highly performant and make best use of its buffer pool in memory, these systems can afford to be a bit slower and do diff merges on files for generating old versions.

Transaction-level read consistency has a similar meaning - all data used by all queries in the transaction are looked up as they existed when the transaction was started. This too can produce snapshot too old errors as one would no doubt realize. But what happens if this transaction updates rows that were modified and committed by a different transaction after this transaction started ? We have a conflict situation and what happens is that our transaction fails at commit time with the error ORA-08177: can't serialize access for this transaction. Surprised ? That is what optimistic concurrency control implies - you need to design and implement any serializable transactions in such a way that you can tackle commit failures and retry the operation with any data modifications based on your application logic.

The SCN (System Change Number) is a logical timestamp at the per-database level that gets incremented on commits. Each value of the SCN can be thought of as identifying a consistent state of the database. When a statement starts executing, it records the current SCN and uses it for all reads. Writes place this SCN on the undo records. During commit, the SCN is incremented and this new SCN gets associated with the redo log records and the modified data blocks. Thus, a consistent read needs to generate a version of the data at an SCN strictly less than its read-SCN. If the SCN on the current data block is greater than or equal to the read-SCN, the necessary version is generated by applying undo records in descending SCN order until one with SCN strictly less than the read SCN is encountered. Things can get slightly different if the row was deleted etc. but this is the basic idea.

Now it is time to return to our main story. We were trying to convince ourselves on whether the local SCN or the remote SCN would be used for getting data from the remote table. For the remote SCN to be used, there has to a mapping maintained between the local and the remote SCNs. In Oracle RAC, there is global SCN co-ordination through Lamport clocks, SCN broadcasts on commit etc. but here we are dealing with single instance databases connected by a dblink. Coming to think of it, maintaining a mapping between SCNs is quite unlikely. How would each database keep track of what is going on in the other ? To complicate matters, you could have a link from A to B and another from B to A implying that the mapping has to maintained at both ends and in a consistent fashion. Or you could even have links like A->B, B->C, C->A. Very quickly it starts getting messy.

Maybe, it is easier to sync SCNs periodically. Hmmm, would that work well ? Probably it would work more or less fine but not perfectly as there will be windows between the syncs when the SCNs can potentially diverge. Wait ... we may be on to something here. If the SCN sync is very infrequent and the remote database is much more heavily loaded than the local one, there will be a wide SCN divergence due to many more commits on the remote side and voila, we will get the exact problem that we are trying to analyze ! The local SCN will lag the remote SCN by a big margin and a read using the local SCN would attempt to reconstruct a very old version on the remote database. An undo record that old could very well have disappeared from the undo tablespace.

A quick check with the app guys confirms that the distributed queries are very infrequent, maybe at most a few times in a day, sometimes even just the midnight query. The remote database sees relatively heavy and constant transactional activity all day.

A few Google searches later, this is what comes up - the last section Managing Read Consistency in the Chapter titled Managing Distributed Transactions in the Oracle 10g Database Administrator Guide. It validates deductions pertaining to the SCNs, the only thing it does not really explicitly mention is the snapshot too old but we have covered that convincingly enough. It also gives the suggested fix/workaround. We also understand why the error goes away immediately every time it appears and occurs again only after a long time - the failed query does trigger the SCN sync.

All that remains is to reproduce the error on a simple setup keeping a small undo tablespace on the remote database, and then try out the potential fixes - either do the dummy query before every query involving the link or have a job to do the dummy query periodically in a transaction. The first option is more robust but needs application code changes - the second could result in some rows being missed but can be handled just by creating a job with dbms_scheduler. Using some neat scripts, the problem was recreated and the potential fixes were also tried out successfully.

But I still don't fully understand why Oracle does not do the SCN syncup itself before a distributed query. Maybe there are complications doing it in a general way.

No comments:

Post a Comment