I’ve got OCP

It is hard for me to keep writing on this blog; fortunately this morning i’ve found a comment on the post “RAC Virtual IP” and so after i’ve anwered to that comment i’ve decided to post a little update. On 20 November 2008 i’ve take and passed Oracle 10g R2 Database OCP exam so i’m OCP!

I’m still alive

… but not OCP. Yes in the last times i’ve been very busy and very lazy. So i’ve not written nothing here, i’m still studying (?) to get OCP but work does not leave me so much time to study and to write here something useful or interesting. Now i write this useless post only to say that i’m here and i’m still working with Oracle. by the way, one custemer of my company as recently upgraded from 9iR2 to 11g. That’s incredible but it seems that 11g works fine.

OCP: Oracle Certified Professional

On 6 December 2007 i’ve got OCA certification. Now i dream to get OCP certification so i’ve started with “OCP Oracle 10g Administration II” book by Sybex. I like the challenges

Oracle Certification

I’m studying to get Oracle Certification, OCA Oracle Certified Associate, the first level of Oracle 10g database certification, so i’ve create on this blog a new page (certification) where i put my personal notes mostly on the arguments where i’m least prepared. This means that probably it that page there are not a lot of interesting informations, on the other hand to get certification the necessary informations are all contained in Oracle public documentation, the manuals.

Write that page helps me to remind the things.

Another Interesting Issue

On the same database of which i’ve already blogged i’ve encountered another interesting issue. Some times the database get down without messages on alert.log. Database is used by two applications, on log of one of this applications there was:

java.sql.SQLException: ORA-01114: IO error writing block to file 201 (block # 188712)
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.
ORA-01114: IO error writing block to file 201 (block # 188712)
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.
ORA-01114: IO error writing block to file 201 (block # 188712)
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.

I’ve also opened a SR on Metalink but with only suggestion that it was an hardware problem. The strange things were two:

  1. On alert.log there was nothing
  2. The file indicated by the message in application log (#201) did not exist on the database

After a while we was able to reproduce the problem, it was a query with a group by on a large data set. After a couple of test my intuition was that the problem were the TEMPORARY tablespace, so i’ve created a new TEMPORARY tablespace, i’ve setted it as new default temporary tablespace ad re-tried the test with success. It is clear that there is a bug that cause Oracle db (9.2.0.1) to crash with particular corruption on TEMPORARY tablespace.

Undo Tablespace Corruption

Some time ago i’ve encountered a problem with a database of a customer. It is Oracle 9.2.0.1 on Windows 2000 with Oracle Fail Safe.

On Alert.log we found:

KCF: write/open error block=0x4351 online=1
file=2 O:\ORACLE\ORADATA\GEOP\UNDOTBS01.DBF
error=27070 txt: 'OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.'
Automatic datafile offline due to write error on
file 2: O:\ORACLE\ORADATA\GEOP\UNDOTBS01.DBF
Tue Jul 10 02:08:42 2007
Errors in file o:\oracle\admin\geop\udump\geop_ora_844.trc:
ORA-00376: file 2 cannot be read at this time
ORA-01110: data file 2: 'O:\ORACLE\ORADATA\GEOP\UNDOTBS01.DBF'
ORA-00372: file 2 cannot be modified at this time
ORA-01110: data file 2: 'O:\ORACLE\ORADATA\GEOP\UNDOTBS01.DBF'

where the two lines

ORA-00376: file 2 cannot be read at this time
ORA-01110: data file 2: 'O:\ORACLE\ORADATA\GEOP\UNDOTBS01.DBF'

were repeated thousands of times.

at same time in windows event viewer:

Event Type: Warning
Event Source: Ftdisk
Event Category: None
Event ID: 50
Date: 07/07/2007
Time: 07:55:15
User: N/A
Computer: GEOCALL2
Description:
{Lost Delayed-Write Data} The system was attempting to transfer file data from buffers to \Device\HarddiskVolume5. The write operation failed, and only some of the data may have been written to the file.
Data:
0000: 00 00 04 00 02 00 56 00 ......V.
0008: 00 00 00 00 32 00 04 80 ....2..€
0010: 00 00 00 00 00 00 00 00 ........
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 0e 00 00 c0 ...À

and repeated messages


Event Type: Warning
Event Source: Disk
Event Category: None
Event ID: 51
Date: 07/10/2007
Time: 02:21:01
User: N/A
Computer: GEOCALL2
Description:
An error was detected on device \Device\Harddisk3\DR3 during a paging operation.
Data:
0000: 04 00 22 00 01 00 72 00 .."...r.
0008: 00 00 00 00 33 00 04 80 ....3..€
0010: 2d 01 00 00 0e 00 00 c0 -......À
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 04 00 00 00 03 00 00 00 ........
0030: 00 00 00 00 2a 00 00 00 ....*...
0038: 00 08 00 00 00 00 00 00 ........
0040: 2a 00 02 4f db 2f 00 00 *..OÛ/..
0048: 08 00 ..

I’ve to say that this Oracle installation is not very lucky, messages of disk problems in event viewer sometimes returns, but hardware vendor tell us that there are no problems on the hardware.

Another thing that i’ve to remember is to read with my eyes the alert.log. In fact i was called by a collaborator and i did not see the line

Automatic datafile offline due to write error on

Immediately i’ve thought a corruption on the file, i’ve created a new UNDO tablespace, i’ve changed UNTOTBS parameter to point to the new tablespace. Then we tried to remove the old tablespace but we got a message that a rollback segment was “active”. In V$TRANSACTION there was no records. I was not able to understand why Oracle was telling us that. The database could be opened but the application on one step was still given an error message by Oracle stating that old undo tablespace datafile were not available. So we decided to restore from backup the tablespace, we recovered. After that i onlined the tablespace, i made a “select count(*)” from a table (the table used by application that had given the error). After that i’ve been able to drop tablespace with datafile.

Conclusion

My description has been confused but the conclusion, and the lesson i’ve learned is that UNDO TABLESPACE may contain data needed to the integrity of the database. I think that is the case of “delayed block cleanout”. If there are active trasactions that is obvious and it is visible by V$TRANSACTION system view, but in the case of “delayed block cleanout” i think that information is not easily available.

English Dictionary

I’ve been busy for a while, so i’ve haven’t written new posts here. But two days ago i’ve bought a new fantastic English Dictionary: my battle with English language is very hard but i’ll never give up, please forgive me for this. In the meanwhile i still write on my original blog.

Windows or *NIX?

I’ve found very interesting the discussion on comp.databases.oracle.server started from a migration question and continued on a comparison between Windows and Linux or Unix, at least as Oracle platforms. The battle ended with yet another no-winner contest. What the posters say is for me very interesting. The OS religious wars are really a no-sense, but a lot of things in this world are no-sense, also me writing here.

So, i’m back to what Tom has already said: there is no a better os, you must choose that with wich YOU are more comfortable, there are no other scientific and absolute motivations to prefer one over another. I feel better with Linux, but i’ve learned a lot of things about Windows that make me better.

Personally i’ve to say that Windows has good performance, but Oracle threaded architecture on Windows it seems not so good as multi-process architecture on Unix.

Pro-Unix i want remember always a more consolidated architecture. (Unix is here from 1970 🙂 )

To RAC or not to RAC

Today i’ve re-read mogens nogard (his name is unwritable for me, but also for him)  post on high avalability of last month. It is very interesting what mogens says.  I’ve recently read the book “Oracle Insights” from witch i desume that mogens is a step above others.

I completely agree with the quote “Complexity is the enemy of availability” , i’m conservative. So i think that technologies pushed to simplify management may became a boomerang.  On the other hand i think that RAC, with 10g standard edition has reason to be. It give us a little scalability at a competitive cost.

What mogens says in a comment is the real point:

For political reasons you might have to implement all sorts of things, and I still haven’t found an effective way of preventing that from happening.”

Marketing and politics really drive our (mine) managers so i have to implement things that for me give no advantages.

however i’ve to say that my little experience with RAC, not about HA, is at last good.  What remains are all the bugs that we can encounter, with RAC as with Standalone instance.

Bug Number 4323868 (index full scan instead of index range scan)

In attempto to solve a performance problem encountered on Oracle 10.2.0.2 with CBO i’ve encountered another problem.

Let’s start from the beginning, we have an application developed and tuned on Oracle 9iR2 with RBO. We started with 9.2.0.1 about four years ago and there were not great problems with performance. Application is OLTP. RBO has a great advantage: stabilitity. On the other hand tuning an application with CBO is complex. One day our manager decided to make an installation of a two node RAC standard edition, with 10g (applications will have had 400 users). It was a 10gR1 we made the great jump in to the dark migrating to CBO. Things gone well (apart from installation). Then we made another jump in the dark with installation of 10gR2 and there started the problems. With my experience it seems that optimizer in 10gR2 is worse than optimizer in 10gR1.

The problem

Well, i’m not guilty, our application has a table (i will call it X) that can join alternatively with other three tables (i repeat, i’m not guilty, i think that not a good idea and a bad design, i will call such tables A, B and C). The table X as three columns ID_A that is foreign key (?) with table A, ID_B that is foreign key with table B and ID_C that is foreign key with table C. Our application is a product, customizable, so for a client happens that this table has milions of records that join only table C, foreign keys for tables A and B have value “-1”. So application make a query like this:

SELECT * FROM X
WHERE
ID_A = n1 or ID_B = n2 or ID_C = n3

Where n1>0, n2>0, n3>0 and in table X all values for ID_A and ID_B are “-1”. There are three indexes on columns ID_A, ID_B and ID_C. Using histograms optimizer knows that the thee values n1,n2,n3 (>0) are very selective so it uses the three indexes on 10gR1. Also on 10gR2 if we use litteral values. Our application is a java application that uses Bind Variables. So it happened that suddenly in 10gR2 optimizer decided to do a triple full scan on the table making database server hang on I/O (table has over 20 milion of records). I think that it is a problem with bind peeking, but i’m not been able to reproduce it, so we tried STORED OUTLINES to stabilize execution plan using indexes.

So i encountered the BUG 4323868, optimizer as required from hints of stored outlines used indexes but making index full scan instead of index range scan, making database server hang on I/O.

It is interesting to note what Oracle has made, i’ve not found patch for this bug, with patchset 10.2.0.3 Oracle has introduced “code improvement” adding two new hints: INDEX_RS_ASC and INDEX_RS_DESC