DIAGNOSING ORACLE DATABASE HANGING ISSUES

 

    This bulletin describes the methods and tools used in diagnosing

database hanging issues and performance issue).  These issues may be due to 

tuning problems, design problems or Oracle bugs such as latching issues. 

We determine the cause by narrowing down the problem and getting as much

information as possible WHILE THE DATABASE IS HANGING.

 

    The following diagnostic steps are discussed. 

 

     1. Describe the Problems. Under all situations go through this sections and 

        describe the problem.

     2  Look for errors.

     3. Do the requested queries.

     4. Gather OS level data.

     5. Get systemstate and hanganalyze dumps.

     6. Get BSTAT-ESTAT OR STATPACK output.

     7. Generate a core dump and run a debugger (contact support

        first).

     8. If certain process get PROCESSSTATE dumps

     9. Check List for diagnostics.

 

Note: It may not be necessary to shutdown the database to stop the

hanging.  But if you MUST shutdown before speaking to support, please

attempt to get these diagnostics so that we can debug the problem. 

Without these diagnostics it may be impossible for us to determine the cause of 

the problem.

 

STEPS:

 

1. DESCRIBE THE PROBLEM

   What releases of all involved products are you running?

      - complete release like database release 8.0.5.2.

 

   Is the "hanging" a true freeze or is activity just very slow?

   (Are logs switching in the alert log? Be sure to check cpu, I/O and memory 

    utilization -  see step 4).

 

   When does the hanging start?

 

   Does it stop?  How long does it last?

 

   Does the hanging happen suddenly or is it a gradual performance

   degradation, perhaps precipitated by increased activity?

 

   How many users are connected?

 

   Has load on system increased recently?

 

   Check for OTRACE creating files. (see [NOTE:45482.1])

          Unix                              NT 

          ~~~~                              ~~ 

          cd $ORACLE_HOME/otrace/admin      cd orant\otrace73\admin 

          ls *.dat                          dir *.dat 

  

          Note: If .dat files are being created and OTRACE utility is not being used,

          disable this option by setting the init.ora parameter oracle_trace_enable = FALSE.

 

   Are any events set in the init<sid>.ora?

        

   Which ones are affected?

 

   What kind of work are they doing?

 

   How much data are we talking about?

 

   Are you running Oracle Parallel Server (OPS)?

     - Does the problem persist if you shutdown all but one instance?

     - Please note:  Some of the solutions discussed here are useful for

       OPS; many are not.  Different tuning rules apply; for example, a

       "smaller" buffer cache often means better performance in OPS.  Also,

       the most common sources of hanging issues in OPS are not discussed

       here. These include: PCM locking problems, pinging, space management

       problems, lack of work partitioning, internode parallel query tuning

       requirements, shared disk or virtual shared disk problems, network

       problems, and DLM problems.

 

   Are you running Multi-Threaded Server (MTS)?

     - Does the problem persist without it?

     - How many users?

 Are you using any Oracle applications or tools

   (financials, forms, etc.)?

 

   Did you recently upgrade the rdbms, application or tool,

   operating system, memory or hardware?

 

   Does the problem reproduce?

 

   How often?

 

   Can you reproduce it at will?

 

   When does it NOT reproduce?

 

   What EXACTLY is hanging? What is NOT?  Get all the information

   requested for each area that applies:

 

 

   (a)  The entire database?

 

        - All instances?

        - All connections?

        - All operations?

        - All nodes?

 

        - Can you start a session and select * from dual?

 

       - How often are log switches occurring?

 

        - If there are archive-related errors in the alert log

          (see step 2), stop reading this bulletin:  Archiving

          problems always freeze the database.  Review the error(s)

          and resolve the problem.  Example: the device may be full.

          Another common cause is that the ARCH process is stopped while

          the database is in ARCHIVELOG mode.  Check this by connecting

          internal and typing ARCHIVE LOG LIST.  If ARCHIVELOG mode and

          automatic archival is DISABLED, that's the problem. Either

          restart the ARCH process by typing ALTER SYSTEM ARCHIVE LOG

          START (or log_archive_start = true in init.ora) or shutdown,

          startup mount exclusive and ALTER DATABASE NOARCHIVELOG.

 

        - If archiving problems have been ruled out and the database is 

          truly frozen, complete steps 1, 2, 3, 4, 5 and 6.

 

        - If database is not really frozen but is suffering performance 

          degradation due to increased activity, complete steps 1, 2, 3, 4 and 6.

          Do not forget 1(d).

 

 

   (b)  A specific SQL operation? Be sure you also do step 8.

   Get a tkprof output with  timed_statistics and explain plan. 

   See [NOTE:10585.1] QUERY AND APPLICATION TUNING USING EXPLAIN 

                                 AND TKPROF  UTILITY

          [NOTE:41634.1] TKPROF and Problem Solving

 

          [NOTE:1020282.6] TFTS FORMATTED SELECT OF EXPLAIN PLAN TABLE

          [NOTE:1019631.6] TFTS SCRIPT TO SIMPLIFY THE USE OF EXPLAIN PLAN

        - Select statement?

            - It may need tuning. (This is useful for more than sql syntax.)

            - If a complex sql statement is hanging, can you break successfully?

            - See 1 (c)

 

        - Parallel query?

 

            - See [NOTE:240762.1] script to show Explain Plan for Parallel Queries

            - Could be space transaction contention.  If there are

              ora-1575 errors in the alert log, try setting

              pct_increase = 0 in the temporary tablespace to

              disable SMON from coalescing contiguous extents,

              thus reducing contention with query slaves. Also

              try spreading datafiles over several disks, increasing

              sort_area_size (but don't introduce paging) and

              possibly "reducing" degree of parallelism. Refer

              to additional tuning documentation.

 

        - Update, insert or delete?

            - possibly locking.  Be sure to get v$lock

              output in step 3.  There are the following scripts available

              to help debug locking problems.

             [NOTE:1020012.6] TFTS SCRIPT TO RETURN MEDIUM DETAIL LOCKING INFO

             [NOTE:1020008.6] TFTS FULLY DECODED LOCKING SCRIPT

             [NOTE:1019527.6] TFTS CHECK FOR FOREIGN KEY LOCKING

             [NOTE:1020007.6] TFTS: DISPLAY LOCKS & GIVE SID AND SERIAL# TO KILL

            - Do you have constraints and/or triggers on the object(s)

              you're updating?  Could be a cascading locking problem.

              Try putting index(es) on relevant foreign key columns.

              This modifies the locking behavior on the parent. (See

              Application Developers Guide for more info).

              See [NOTE:33453.1] REFERENTIAL INTEGRITY AND LOCKING

            - also see steps 1(c) and 1(d).

 

        - DDL?

            - possibly a data dictionary related problem.  See step 1(c).

            - if "create index" it might be a space transaction contention issue.

 

        - Does the operation run faster the second time around?

            - Try tuning i/o.  The following could be helpful:

              clustering, increasing db_block_size, distributing

              i/o (i.e. striping, separating indexes and data,

              parallelizing).  Also try setting pre_page_sga = true

              in the init.ora.

            - Is the operation always slower immediately after a large

              update? This is delayed block cleanout.

 

   (c)  Specific object?

 

        - Can you do ANY operations on the object?

            - Try select count(*)

            - If updates only are the problem, could be locking.

            Scripts are available to help resolve these issues see:

            - Be sure to do get lock information

              1(b) and 1(d).

              There are the following scripts available

              to help debug locking problems.

              [NOTE:1020012.6] TFTS SCRIPT TO RETURN MEDIUM DETAIL LOCKING INFO

              [NOTE:1020008.6] TFTS FULLY DECODED LOCKING SCRIPT

              [NOTE:1019527.6] TFTS CHECK FOR FOREIGN KEY LOCKING

              [NOTE:1020007.6] TFTS: DISPLAY LOCKS & GIVE SID AND SERIAL# TO KILL

 

 

        - Did you preallocate extents to the object?  This will bump

          up the high water mark and cause full table scans to appear

          to "hang". Full table scans always scan up to the high

          water mark, even when the table contains only a few rows.  The

          solution is to avoid preallocating extents except immediately

          before a large parallel procedural insert or conventional load.

          Don't preallocate extents for a direct load.

 

        - Try ANALYZE TABLE <tablename> VALIDATE STRUCTURE CASCADE;

            - any errors?  Errors signify table or index corruption.

              Call support.

 

        - If no errors after ANALYZE, run these queries and send the output and 

          trace files to support:

 

            SPOOL analyze.log;

 

            SELECT *

            FROM sys.dba dba_tables

            WHERE table_name = '<TABLENAME>';

 

            - block level space utilization (i.e. initrans, maxtrans,

              pctfree), a high chain count, etc., could be part of

              the problem.

 

            SELECT i.*

            FROM sys.index_stats i, sys.dba_indexes d

            WHERE i.name = d.index_name

            AND d.table_name = '<TABLENAME>';

 

            - an unwieldy index could be part of the

              problem.  Do you do a lot of updates/deletes?

 

            SPOOL OFF;

 

        - Get tkprof output as in 1(b).

 

        - If a view, do the above for the underlying tables.

 

        - To identify the underlying tables:

 

                SELECT text

                FROM sys.dba_views

                WHERE view_name = '<VIEWNAME>';

 

                - you may need to SET ARRAYSIZE 1 / SET LONG 2000 to

                  see all the text.

                - definitely get tkprof output as in 1(b).

 

        - Data dictionary objects might also be affected by the above.

 

 

  (d)   Large scale update operation (i.e. perhaps using sqlloader,

        import or batch job)?   

 

        - Do you have index(es) on the table?

 

        - Does the update run during periods of peak activity?

 

        - Do you see "checkpoint not complete" messages in the alert.log?

 

        - This signifies that the redo logs are too small; they

              are having trouble keeping up.

 

        - Is the tablespace wherein the table resides in hot backup mode?

          (see v$backup)

 

         - A tablespace in hot backup mode generates redo "records"

            rather than "vectors". During a large update this

            can cause severe contention and performance degradation.

 

        - Are any of the following init.ora parameters set to

          non-default values:

 

                db_block_checkpoint_batch

                db_block_checksum

                db_block_lru_statistics

                db_block_lru_extended_statistics

                row_locking

                serializable

                timed_statistics

 

        - If using sqlloader, are you loading with conventional

          path?  Using REPLACE option?  (Try using TRUNCATE instead).

          Any sql functions in the control file?  Have you tried tuning

          with READBUFFERS, BINDSIZE, ROWS, PARALLEL?

 

        - If using import, are you using commit=y?  indexes=y?

          constraints = y? Have you tried increasing BUFFER?  Any

          info in LOG?

 

        - Try increasing the number of system buffers and tuning i/o

          (see 1(b)).

 

        - If many users are active during the update, resource

          contention could be contributing to system slowdown.

          Rollback segments, redo latches, i/o and db_block_buffers

          are some possible areas of contention.  V$session_wait

          and Bstat Estat output are good places to start researching

          contention issues. See section on estat bstat.

 

        -Do steps 1(b), 1(c), 3, 4  and 5.

 

 

   (e)  Specific package, procedure or PRO*C application?

 

        - What does it do?

 

        - What SQL statement is it stuck on? see step.  Can you

          remove the statement and get it to run successfully in sqlplus?

 

        - Get tkprof output on the application, procedure or isolated

          sql statement with timed_statistics and explain plan.

 

        - Optional: if a procedure, utilize DBMS_ALERT to pinpoint where

          the hanging starts.

 

        - If a PRO*C program, and tkprof identifies "parsing" as the

          bottleneck of the embedded sql statement, tune this with

          precompiler parameters hold_cursor and release_cursor.  See

          bulletin on precompiler features.

 

        - If a package, can you execute any of the individual procedures

          successfully?

 

        - Is the package/procedure swapping in and out of the shared pool?

          Pinning might be advisable (a package would have to be created

          for a standalone procedure).

 

            SPOOL objcache.log

 

            SELECT *

            FROM v$db_object_cache

            WHERE name = '<NAME>';

 

            SPOOL OFF;

 

        - Resource contention could be contributing to system

          slowdown.  For example, library cache contention

          (affected by shared_pool_size) is possible in addition

          to the short list of contention areas mentioned in 1(d).

          Refer to additional tuning documentation.

 

        - Be sure to do step 3.

 

        - What is the cpu / io utilization? (get this from

          v$sesstat above or see step 4)

 

        - Is the process swapping in and out of memory? (see step 4)

 

        - Did you change the process priority or "nice" it?

 

        - Steps 3 and 4 recommended.

 

   (f)  remote access only?

 

        - Can you do select * from dual@db_link?

 

        - Can you connect to the remote machine and perform the operation

          locally?  Can you select * from dual locally?

 

        - If you can do these tests and the hanging reproduces, you

              should answer all these questions for the REMOTE machine.

 

        - Are you doing a distributed update?

 

        - What is distributed_lock_timeout set to in the init.ora.

 

        - Are you refreshing snapshots?  using symmetric replication?

 

        - Is it a remote select statement that's hanging?

 

        - Get a tkprof output with timed_statistics and explain

              plan.  On the explain plan,  you'll see that certain

              statements just say "REMOTE"; the actual statements sent

              across the network are recorded in the OTHERS column

              of the PLAN_TABLE - spool a select of that column.

        - If joining two tables on a remote node, try putting a

              'joining view' on that node and then selecting against the

              view.

              See [NOTE:41634.1] TKPROF and Problem Solving

                  [NOTE:10585.1] QUERY AND APPLICATION TUNING USING EXPLAIN 

                  AND TKPROF UTILITY

 

        - The following may reduce any network overhead that's contributing

          to the problem: setting ARRAYSIZE in sql operations, using

          anonymous pl/sql blocks rather than independent sql statements,

          using explicit rather than implicit cursors. Refer to additional

          tuning documentation.

 

   (g)  Operations using 3rd party application (Power Builder, etc).

 

        - Can you get the problem to reproduce in sqlplus?  (If you can

          do this test but can't reproduce the problem, you need to

          call the 3rd party vendor).

 

   (h)  Database shutdown/startup?

 

        - How was it shutdown? normal? immediate? abort? did it crash?

 

            - If startup hangs and the database went down abnormally (crash

              or shutdown abort)and there are no errors in the alert log, 

              this could be normal instance recovery, an internal matter 

              (if errors),system problems (see system logs), inadequate cleanup

              at shutdown (try shutdown abort / startup), unsupported

              software/ os combination or (less likely) data corruption.

              Wait 30 minutes then call support if it doesn't open.

 

              During normal instance recovery an example of the messages

              that will be written to the alert log are listed below:

                   Starting ORACLE instance (normal)

                   Starting up ORACLE RDBMS Version: 8.1.6.3.0.

                   System parameters with non-default values:

                      processes                = 50

                      ...

                   Crash recovery completed successfully

                   Tue May 22 16:15:26 2001

                   Thread 1 advanced to log sequence 7139

                   Thread 1 opened at log sequence 7139

                   Current log# 2 seq# 7139 mem# 0: /u02/oradata/V817/redo02.log

                   Successful open of redo thread 1.

                   Tue May 22 16:15:26 2001

                   SMON: enabling cache recovery

                   SMON: enabling tx recovery

                   Tue May 22 16:15:51 2001

                   Completed: alter database open

 

            - If shutdown normal hangs, this means Oracle is waiting

              for active sessions to log off.

 

            - If shutdown immediate hangs, this could be for

                  the same reasons as a hang at startup.

 

        - Be sure to check for errors in step 2. Tracing the process

          may also be useful:  see step 4.  V$session_wait may be

          useful:  see step 3.

 

        - Note: if using a system debugger to trace a hanging startup

          or shutdown operation, be sure to get the relevant svrmgr or 

          sqlplus "shadow" (child) pid.   On unix boxes:

 

                ps -ef | grep svrmgrl

                usupport 26615 19866    ...etc...       svrmgrl

                ps -ef | grep 26615

                usupport 52472 26615    ...etc...       oracleV732 (DESCR...

 

           Pid 52472 is what you want to trace.

 

 2. LOOK FOR ERRORS

 

   (a) Check the tail of the alert_<sid>.log (or portion of the alert

       corresponding to the hang).  Check for ANY messages other than

       log switches ("advanced to log sequence...")

 

       Note: to find the alert.log directory, check the

       background_dump_dest parameter in the init<sid>.ora or

       config.ora. Or do the following command.

 

       $ sqlplus "/as sysdba"

         show parameter dest

 

   (b) Check the same directory for trace files generated around the time

       of the hang.  Look in them for error messages. Don't search

       whole files; relevant messages appear towards the beginning.

       On unix ls -ltr will show you files that have been updated lately last.

 

   (c) If having remote access problems, check the sql*net trace directory

       for traces as well.

   (d) Check system error logs for messages. On many unix systems

        cd /var/adm

        ls -ltr

         Check for messages files recently updated.

 

 

   (e) Be prepared to send a portion of the alert<sid>.log at least 4 hours 

       prior to the problem, init<sid>.ora and any traces to support.

 

       For fastest service:

    - Move trace files to oracle-ftp 

   cd server/incoming

   mkdir <tar number>

   cd <tar number>

   place traces files, alert log and any other files here

   Or

         - each file should go into a separate email message

         - put the tar number and file description in the subject line

         - don't uuencode or attach the files if you don't have to

           (i.e. unless they're too big) - just read them into the

           message as ascii.

 

       If you can't comply with all of the above, just do your best.

 

       Faxing is also possible but may not be practical.

 

       At least, please have the information READY to communicate to

       support before you call.

 

 

 

3.  SPOOL OUT RELEVANT V$ VIEWS

 

    WHILE the database is hanging, run the following queries:

 

    SPOOL v_views.log;

 

    SELECT *

    FROM v$parameter;

 

    SELECT class, value, name

    FROM v$sysstat;

 

    SELECT sid, id1, id2, type, lmode, request

    FROM v$lock;

 

    SELECT l.latch#, n.name, h.pid, l.gets, l.misses,

       l.immediate_gets, l.immediate_misses, l.sleeps

    FROM v$latchname n, v$latchholder h, v$latch l

    WHERE l.latch# = n.latch#

    AND l.addr = h.laddr(+);

 

    SELECT *

    FROM v$session_wait

    ORDER BY sid;

 

    /* repeat last query 3 times - we want to see who's repeatedly waiting*/

 

    SPOOL OFF;

 

    If a specific session hangs do these queries:

 

 SPOOL session.log;

 

    - Get the sid from the pid or vice versa:

 

            SELECT s.sid, p.spid

            FROM v$session s, v$process p

            WHERE s.paddr = p.addr

            AND ... < p.spid = <os pid> or perhaps

                    s.sid = <sid from v$session> >

 

    - What SQL statement is being executed?

 

            SELECT s.sid, s.status, q.sql_text

            FROM v$session s, v$sqltext q

            WHERE s.sql_hash_value = q.hash_value

            AND s.sql_address = q.address

            AND s.sid = <sid>

            order by q.piece;

 

    -Get output from V$SESSION_WAIT

 

           column sid format 990

           column seq# format 99990

           column wait_time heading 'WTime' format 99990

           column event format a30

           column p1 format 9999999990

           column p2 format 9999999990

           column p3 format 9990 

           select sid,event,seq#,p1,p2,p3,wait_time from V$session_wait

           where sid=<SID>

           order by sid;

          

        *** It is needed to run this statement a few times to see if the wait events

            actually change

 

 

    - What is LOCKWAIT column set to in V$SESSION?

 

            SELECT lockwait

            FROM v$session

            WHERE sid = <sid>;

 

    - If not null, who's holding what lock on which object?

 

            col Username format A15

            col Sid format 9990 heading SID

            col Type format A4 

            col Lmode format 990 heading 'HELD'

            col Request format 990 heading 'REQ'

            col Id1 format 9999990

            col Id2 format 9999990 

 

            select SN.Username, M.Sid, M.Type,

            DECODE(M.Lmode, 0, 'None', 1, 'Null', 2, 'Row Share', 3, 'Row

            Excl.', 4, 'Share', 5, 'S/Row Excl.', 6, 'Exclusive',

            LTRIM(TO_CHAR(Lmode,'990'))) Lmode,

            DECODE(M.Request, 0, 'None', 1, 'Null', 2, 'Row Share', 3, 'Row

            Excl.', 4, 'Share', 5, 'S/Row Excl.', 6, 'Exclusive',

            LTRIM(TO_CHAR(M.Request, '990'))) Request,

            M.Id1, M.Id2 from V$SESSION SN, V$LOCK M

            WHERE (SN.Sid = M.Sid and M.Request ! = 0)

            or (SN.Sid = M.Sid and M.Request = 0 and Lmode != 4 and (id1, id2)

            in (select S.Id1, S.Id2 from V$LOCK S where Request != 0 and S.Id1

             = M.Id1 and S.Id2 = M.Id2) ) order by Id1, Id2, M.Request; 

 

        - What is LATCHWAIT column set to in V$PROCESS?

 

            SELECT latchwait

            FROM v$process

            WHERE spid = <pid>;

 

        - If not null, who's holding what latch?

 

            column name format a32 heading 'LATCH NAME'

            column pid heading 'HOLDER PID'

            select c.name,a.addr,a.gets,a.misses,a.sleeps,

            a.immediate_gets,a.immediate_misses,b.pid

            from v$latch a, v$latchholder b, v$latchname c

            where a.addr   = b.laddr(+) and a.latch# = c.latch#

            and c.name like '&latch_name%' order by a.latch#;

 

            

        - Did the lock holder or latch holder shut off his/her

          terminal without logging off?  This can leave a zombie shadow

          process holding the resource.  Kill it with "alter system

          kill session '<sid, serial# from v$session>'".  If on sqlnet 2.1

          or above, try setting sqlnet.expire_time in the sqlnet.ora

          on the server side.  (For more info, see page A-16 of

          "Understanding SQL*NET" v2.1).

 

        - If session is not frozen but slow, get session-specific

          statistics:

 

            SELECT s.sid, s.value, t.name

            FROM v$sesstat s, v$statname t

            WHERE s.statistic# = t.statistic#

            AND s.sid = <sid>;

 

        - If session is extremely slow or truly frozen, get session-

          specific wait info:

 

            SELECT *

            FROM v$session_wait

            where sid = <sid>;

 

            - repeat above query 3 times.

 

            SPOOL OFF;

 

- If this is distributed transaction run the following statements from EVERY 

  NODE involved in the transaction:

              

              SPOOL 2pc.log;

              SELECT * FROM dba_2pc_pending;

              SELECT * FROM pending_sessions$;

              SELECT * FROM pending_sub_sessions$;

              SELECT * FROM dba_2pc_neighbors;

              SPOOL OFF;

 

 The following spool will help determine the cause in MTS. 

     SPOOL mts.log

     - How busy the dispatchers are:

     

    select name,network,status,

      (busy /(busy + idle)) * 100 "% of time busy"   from v$dispatchers

 

-  query the shared servers view V$SHARED_SERVERS:

   select name,status,requests,  (busy /(busy + idle)) * 100 "% of time busy"

   from v$shared_servers

   Spool off

 

 4. GATHER OS-LEVEL INFO

 

    a. Briefly describe your architecture. Include number of

       CPUs / disks.  Are you using raw devices, NFS mounted

       files, shared disks, RAM disks...Are you mirroring?  etc.

 

    b. Measure unusual OS level activity: excessive CPU or i/o,

       paging, swapping, etc. There are other monitor tools such

       as TOP. Sun has utilities on their webpage that can be downloaded

       for monitoring the system.   

       If you're not comfortable with the

       utilities listed below, be sure to do step 3 and,

       if you have isolated a troubled process, the v$sesstat

       output in step.

 

    -  On unix: use SAR, VMSTAT and NETSTAT (man sar)

         - You can use truss, trace (man truss) or oradbx

           (call support) for tracing specific processes.

         - You can also unix debuggers such as dbx, xdb, adb or truss

           to dump process activity.  Example:

 

                dbx -a <shadow pid>  /*attach to hanging process*/

                where                /*dump Oracle stack */

                detach               /*detach from process*/

 

                 truss -o /tmp/mytruss -fae -p <pid of process>

 

 

    -  On vms: use MONITOR (help monitor)

        - you can use analyze/process (help analyze) or orambx

          (call support) for tracing specific processes.

 

    -  On novl: use LOAD MONITOR (gui - no online documentation)

 

    -  On Windows NT: use Performance Monitor, Event Monitor,

       Dr. Watson (see Resource kit for documentation).

         - use pview for monitoring specific processes

           (resource kit).

 

    c. Check the system log for additional information. On many platforms 

    it is at /var/adm the files are the message files.

 

 

 

5.  GET SYSTEMSTATE and HANGANALYZE DUMPS.

 

    This creates a large trace file in the user_dump_dest (30M or more is not 

    unusual).

    Note: the init<sid>.ora parameter MAX_DUMP_FILE_SIZE controls

    the maximum trace file size.  Using Oradebug and setting unlimit will 

    allow a complete dump which we will need.

 

    Do this step for sure if the entire database is frozen or nearly

    frozen and if this condition came on suddenly and there are no archive

    errors in the alert log. Please note:  As systemstate dumps are instance

    specific, they tend to be inconclusive with hanging problems involving

    Oracle Parallel Server (OPS) unless you get them from each node. You will 

    need 3 system state dumps from each node for OPS.

 

    NOTE: Do systemstate dump 3 times in a row, so we can determine whether

    the processes are hung or active.

  

    For Oracle 8.0.5.x to 8.1.7.x

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     $ svrmgrl

     svrmgr> connect internal

 

     svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3';

             wait 90 seconds

     svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3';

 

          EXIT ... then reconnect

 

     svrmgr>ALTER SESSION SET MAX_DUMP_FILE_SIZE=UNLIMITED;

     svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';

            wait 90 seconds

     svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';

            wait 90 seconds

     svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';

 

    For Oracle 9.2.0.1 or higher

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

     $ sqlplus /nolog 

       connect / as sysdba                 

       oradebug setmypid  

       oradebug unlimit 

       oradebug dump systemstate 10 

 

 

6.  GET BSTAT-ESTAT OR (for 8.1.6 or higher) STATSPACK output

 

    If the entire database is suffering performance degradation .

    See [NOTE:62161.1]  System wide Tuning using UTLESTAT Reports in Oracle7/8

    See [NOTE:149113.1] Installing and Configuring StatsPack Package

    See [NOTE:149124.1] Creating a StatsPack performance report

 

    Send this report to support along with the other files requested.

 

7.  GENERATE A CORE DUMP AND RUN A DEBUGGER.

 

    This step is a last resort.  It is included to advise

    you of an option when others are exhausted (e.g. when

    systemstate dump hangs).  You should be in contact with

    support before proceeding. Do not send a core to Oracle unless requested.

 

 

                  Core:                Debugger:        Documentation:

 

    Unix:         kill -11             dbx, adb         man

    VMS:          N/A                  analyze/system   help

    Windows NT:   pview                Dr. Watson       resource kit

    Novl:         vdb3 or vdb4         vdb3 or vdb4     Appx of Ora7 Guide

                                                        for Netware

 

8.  If certain processes get PROCESSSTATE dump.

- Trace the process from the os level (see step 4).

 

        Get processstate dumps. Do 3 times. This generates a trace file in your 

        user_dump_dest

  (from sqlplus: show parameter user_dump_dest).

 

$ sqlplus "/as sysdba"

      oradebug setospid <process ID>

      oradebug unlimit

      oradebug dump processstate 10

        

        Get errorstacks from the process. Do 3 times. This generates a trace 

        file in your user_dump_dest

  (from sqlplus: show parameter user_dump_dest).

 

$ sqlplus "/as sysdba"

      oradebug setospid <process ID>

      oradebug unlimit

      oradebug dump errorstack 3

 

9.   Check List for diagnostics.

Go over the information you have gathered and then place it on Oracle External 

FTP site.

 

Please include all configuration information releases and options used.

Please include any changes you made to the system recently hardware or software.

Please include results from locking scripts if this may be a locking issue.

If you did an analyze please include results.

Did you run an analyze.log if so include it?

Did you include init parameters and config files?

Did you include the statements that are having the problem?

If this is a tool such as import, export, sqlloader did you include the exact 

statements used to run and the parameter files?

Did you answer the questions and include the answers?

Did you get the explain plan and tkprof output?

If you have a small reproducible test case did you include it?

n Statements that are being run.

n All table, index, view definitions.

n All constraints.  In short everything it would take for us to reproduce it.

Did you trace the processes with a debugger? If so did you include the trace 

from the debugger?

Have you included all trace files?

Have you included the alert<sid>.log?

If core dumps where generated did you translate them and send the stack? 

(Do not send core files.)

Have you sent the sql*net trace files?

Did you include v_views.log?

Have you included session.log;

Did you spool session-specific wait info:

If this is distributed transaction have you included 2PC.log?

If MTS have you included mts.log? 

Did you  gather os-level information on CPU, Memory and I/O?

Did you  GET SYSTEMSTATE DUMPS and palce these on ftp site?

Did you run BSTAT-ESTAT and include the report?

Did you include  PROCESSSTATE and ERRORSTACK dumps.

 

 

 

APPENDIX A: RECOMMENDED BULLETINS

   [NOTE:41634.1] TKPROF and Problem Solving

   [NOTE:1020282.6] TFTS FORMATTED SELECT OF EXPLAIN PLAN

 

1