Thursday, April 9, 2009

Read Consistency, "ORA-01555 snapshot too old" errors and the SCN_ASCENDING hint

Oracle uses for its read consistency model a true multi-versioning approach which allows readers to not block writers and vice-versa, writers to not block readers. Obviously this great feature allowing highly concurrent processing doesn't come for free, since somewhere the information to build multiple versions of the same data needs to be stored.

Oracle uses the so called undo information not only to rollback on-going transactions but also to re-construct old versions of blocks if required. Very simplified when reading data Oracle knows the point in time (which corresponds to an internal counter called SCN, System Change Number) that data needs to be consistent with. In the default READ COMMITTED isolation mode this point in time is defined when a statement starts to execute. You could also say at the moment a statement starts to run its result is pre-ordained. When Oracle processes a block it checks if the block is "old" enough and if it discovers that the block content is too new (has been changed by other sessions but the current access is not supposed to see this updated content according to the point-in-time assigned to the statement execution) it will start to create a copy of the block and use the information available from the corresponding undo segment to re-construct an older version of the block. Note that this process can be iterative: If after re-constructing the older version of the block it's still not sufficiently old more undo information will be used to go further back in time.

Since the undo information of transactions that have been committed is marked as re-usable Oracle is free to overwrite the corresponding undo data under certain circumstances (e.g. no more free space left in the UNDO tablespace). If now an older version of a block needs to be created but the corresponding undo information required to do so has been overridden, the infamous "ORA-01555 snapshot too old" error will be raised, since the required read-consistent view of the data can not be generated any longer.

In order to avoid this error starting from 10g on you only need to have a sufficiently large UNDO tablespace in automatic undo management mode so that the undo information required to create old versions of the blocks doesn't get overridden prematurely. In 9i you need to set the UNDO_RETENTION parameter according to the longest expected runtime of your queries and of course have sufficient space in the UNDO tablespace to allow Oracle to adhere to this setting.

So until now Oracle was either able to provide a consistent view of the data according to its read-consistency model, or you would get an error message if the required undo data wasn't available any longer.

Enter the SCN_ASCENDING hint: As already mentioned by Martin Berger and Chandra Pabba Oracle officially documented the SCN_ASCENDING hint for Oracle 11.1.0.7 in Metalink Note 6688108.8 (Enhancement: Allow ORA-1555 to be ignored during table scan).

I've run some tests using this hint on 9.2.0.8, 10.2.0.4 and 11.1.0.7.

In order to increase the probability to run into the dreaded ORA-01555 error you should perform the following preparation steps (note this applies to all examples that are provided here):


-- create a small undo tablespace
create undo tablespace undo_small datafile '' size 2M;

-- activate small UNDO tablespace
alter system set undo_tablespace = 'UNDO_SMALL' scope = memory;

-- small cache so that old copies of the blocks won't survive in the buffer cache
-- and delayed block cleanout probability increases
alter system set db_cache_size = 2M scope = memory;


Note that all examples here use DBMS_JOB to simulate the simultaneous modification and reading of data, therefore you need to have the JOB_QUEUE_PROCESSES parameter set accordingly, otherwise the job won't get executed.

I've started with a variation of Tom Kyte's example how to deliberately force an ORA-01555 error, which looks like this:


drop table t purge;

create table t
as
select
a.*, 1 as my_id
from
all_objects a
order by
dbms_random.random;

alter table t add constraint t_pk primary key (object_id);

exec dbms_stats.gather_table_stats(null, 'T', cascade=>true)

set serveroutput on timing on

alter session set nls_language = 'AMERICAN';

spool tomkyte_ora1555_error_demo_modified2.log

declare
cursor c
is
select
*
from
(
select /*+ first_rows */
my_id
from
t
order by
object_id
)
where rownum <= 3000;

l_my_id t.my_id%type;
l_rowcnt number := 0;
l_job_id binary_integer;

function submit_job(what in varchar2)
return binary_integer
is
pragma autonomous_transaction;
job_id binary_integer;
begin
dbms_job.submit(job_id, what);
commit;
return job_id;
end;
begin
select
distinct
my_id
into
l_my_id
from
t;

dbms_output.put_line('The MY_ID as of now: ' || l_my_id);

-- result of this query
-- is as of opening the cursor
-- so it needs to return the same MY_ID
-- for all rows as above query demonstrates
open c;

-- now start to wreck the undo
l_job_id := submit_job('
begin
for x in (
select
rowid as rid
from
t
where
rownum <= 10000
) loop
update
t
set
object_name = reverse(object_name),
my_id = 1 - my_id
where
rowid = x.rid;
commit;
end loop;
end;
');

-- start fetching from result set
loop
fetch c into l_my_id;
exit when c%notfound;
l_rowcnt := l_rowcnt + 1;
dbms_output.put_line('Row: ' || l_rowcnt || ' ID: ' || l_my_id);
dbms_lock.sleep(0.01);
end loop;
close c;
exception
when others then
dbms_output.put_line('Rows fetched: ' || l_rowcnt);
raise;
end;
/

spool off


What this code snippet basically does is the following:

1. It creates a table copy of the ALL_OBJECTS view ordered randomly, and adds a primary key index on the OBJECT_ID

2. It issues a query that uses the FIRST_ROWS hint to force an index access to the table because of the available primary key index and the corresponding ORDER BY. It's one of the built-in heuristic rules of the (deprecated) FIRST_ROWS cost based optimizer mode that an ORDER BY is going to use an index if possible to avoid a sort operation. By using this inefficient approach it is ensured that each block of the table will be accesses multiple times due to the random row access driven by the ordered index.

3. It then spawns a job simulating a separate session that starts to overwrite the data row-by-row the query is supposed to read. Specifically the MY_ID column which has been generated with 1 will be set to 0. By committing each single row update operation the small undo tablespace will eventually be filled up and old undo data can be and needs to be overwritten due to insufficient space.

4. While the update loop is running the data from the query gets slowly fetched. Due to the fact that each block will be visited many times according to the index access it's almost guaranteed that the undo information required to re-construct the old version of the block has been overwritten (due to the artificially small undo tablespace) and therefore the ORA-01555 error will occur.

And sure enough, when running this in 11.1.0.7 with the pre-requisites met, the output will look similar to the following. Note that the first line shows what we expect to get from the second query: Only one distinct value, namely 1


The MY_ID as of now: 1
Row: 1 ID: 1
Row: 2 ID: 1
Row: 3 ID: 1
Row: 4 ID: 1
Row: 5 ID: 1
Row: 6 ID: 1
Row: 7 ID: 1
Row: 8 ID: 1
Row: 9 ID: 1
Row: 10 ID: 1
.
.
.
Row: 1768 ID: 1
Row: 1769 ID: 1
Row: 1770 ID: 1
Row: 1771 ID: 1
Row: 1772 ID: 1
Row: 1773 ID: 1
Rows fetched: 1773
declare
*
ERROR at line 1:
ORA-01555: snapshot too old: rollback segment number 15 with name "_SYSSMU15$"
too small
ORA-06512: at line 83


Elapsed: 00:00:52.21


So you can see that the expected error occurred.

Now I've modified the example to use the SCN_ASCENDING hint for the query that fails:


.
.
.
declare
cursor c
is
select /*+ scn_ascending */
*
from
(
select /*+ first_rows */
my_id
from
t
order by
object_id
)
where rownum <= 3000;
.
.
.


Re-running the test case shows that you still get the same error, and obviously the hint doesn't help to avoid the error in this case.

Now if you read the Metalink note subject again, you might notice that it says: "Allow ORA-1555 to be ignored during table scan". Presumably since our example doesn't use a full table scan but an table access by ROWID the hint may be doesn't work as expected.

Let's modify our test case a little bit to use a full table scan instead of the index access path:


drop table t purge;

create table t
as
select
a.*, 1 as my_id
from
all_objects a
order by
dbms_random.random;

alter table t add constraint t_pk primary key (object_id);

exec dbms_stats.gather_table_stats(null, 'T', cascade=>true)

set serveroutput on timing on

alter session set nls_language = 'AMERICAN';

spool tomkyte_ora1555_error_demo_modified3_no_index_usage.log

declare
cursor c
is
select
*
from
(
select /*+ all_rows */
my_id
from
t
--order by
-- object_id

)
where rownum <= 5000;

l_my_id t.my_id%type;
l_rowcnt number := 0;
l_job_id binary_integer;

function submit_job(what in varchar2)
return binary_integer
is
pragma autonomous_transaction;
job_id binary_integer;
begin
dbms_job.submit(job_id, what);
commit;
return job_id;
end;
begin
select
distinct
my_id
into
l_my_id
from
t;

dbms_output.put_line('The MY_ID as of now: ' || l_my_id);

-- result of this query
-- is as of opening the cursor
-- so it needs to return the same MY_ID
-- for all rows as above query demonstrates
open c;

-- now start to wreck the undo
l_job_id := submit_job('
begin
for x in (
select
rowid as rid
from
t
where
rownum <= 10000
) loop
update
t
set
object_name = reverse(object_name),
my_id = 1 - my_id
where
rowid = x.rid;
commit;
end loop;
end;
');

-- start fetching from result set
loop
fetch c into l_my_id;
exit when c%notfound;
l_rowcnt := l_rowcnt + 1;
dbms_output.put_line('Row: ' || l_rowcnt || ' ID: ' || l_my_id);
dbms_lock.sleep(0.01);
end loop;
close c;
exception
when others then
dbms_output.put_line('Rows fetched: ' || l_rowcnt);
raise;
end;
/

spool off


So now we're reading simply the table row-by-row without the index usage, which will be a full table scan operation. Let's check the result without any special SCN_ASCENDING hint:


The MY_ID as of now: 1
Row: 1 ID: 1
Row: 2 ID: 1
Row: 3 ID: 1
Row: 4 ID: 1
Row: 5 ID: 1
Row: 6 ID: 1
Row: 7 ID: 1
Row: 8 ID: 1
Row: 9 ID: 1
Row: 10 ID: 1
.
.
.
Row: 4662 ID: 1
Row: 4663 ID: 1
Row: 4664 ID: 1
Row: 4665 ID: 1
Row: 4666 ID: 1
Row: 4667 ID: 1
Row: 4668 ID: 1
Row: 4669 ID: 1
Rows fetched: 4669
declare
*
ERROR at line 1:
ORA-01555: snapshot too old: rollback segment number 12 with name "_SYSSMU12$"
too small
ORA-06512: at line 83


Elapsed: 00:01:03.41


OK, great. It takes a bit longer, and may be you need to increase the ROWNUM limits accordingly to encounter the error, but it's still reproducible.

Let's try again with the hint:


.
.
.
declare
cursor c
is
select /*+ scn_ascending */
*
from
(
select /*+ all_rows */
my_id
from
t
--order by
-- object_id
)
where rownum <= 5000;
.
.
.


And here's the (scary) result:


The MY_ID as of now: 1
Row: 1 ID: 1
Row: 2 ID: 1
Row: 3 ID: 1
Row: 4 ID: 1
Row: 5 ID: 1
Row: 6 ID: 1
Row: 7 ID: 1
Row: 8 ID: 1
Row: 9 ID: 1
Row: 10 ID: 1
.
.
.
Row: 530 ID: 1
Row: 531 ID: 0
Row: 532 ID: 1
Row: 533 ID: 1
Row: 534 ID: 0
Row: 535 ID: 1
Row: 536 ID: 1
Row: 537 ID: 1
Row: 538 ID: 1
Row: 539 ID: 1
Row: 540 ID: 1
Row: 541 ID: 1
Row: 542 ID: 1
Row: 543 ID: 1
Row: 544 ID: 0
Row: 545 ID: 1
Row: 546 ID: 1
.
.
.
Row: 4973 ID: 1
Row: 4974 ID: 0
Row: 4975 ID: 1
Row: 4976 ID: 1
Row: 4977 ID: 1
Row: 4978 ID: 1
Row: 4979 ID: 1
Row: 4980 ID: 1
Row: 4981 ID: 1
Row: 4982 ID: 1
Row: 4983 ID: 1
Row: 4984 ID: 1
Row: 4985 ID: 1
Row: 4986 ID: 1
Row: 4987 ID: 1
Row: 4988 ID: 1
Row: 4989 ID: 1
Row: 4990 ID: 1
Row: 4991 ID: 1
Row: 4992 ID: 1
Row: 4993 ID: 1
Row: 4994 ID: 1
Row: 4995 ID: 1
Row: 4996 ID: 1
Row: 4997 ID: 1
Row: 4998 ID: 1
Row: 4999 ID: 1
Row: 5000 ID: 1

PL/SQL procedure successfully completed.

Elapsed: 00:00:54.26


It can be clearly seen that those 0 returned by the query shouldn't be there according to the first line of the output, so this scary feature seems to have worked in this case.

Interestingly you get the same result and behaviour when running the test case against 10.2.0.4, so although the hint is not documented for that version it seems to work there, too.

I couldn't reproduce this on 9.2.0.8, so obviously it wasn't backported there.

Here's another, slightly more complex but even more impressive test case, which basically does the same, but introduces some further stuff. Note that it might require the following to be granted as user SYS to the user executing the test case. The grant on DBMS_LOCK is actually also required for Tom Kyte's demonstration code above:


grant execute on sys.dbms_pipe to cbo_test;

grant execute on sys.dbms_lock to cbo_test;


Here is the code:


drop table scn_ascending_demo purge;

create table scn_ascending_demo
as
select
1 as col1
, rpad('x', 100, 'x') as filler
from dual
connect by level <= 2000;

create index scn_ascending_demo_idx1 on scn_ascending_demo(filler);

drop table scn_ascending_demo_wreck_undo purge;

create table scn_ascending_demo_wreck_undo
as
select * from all_objects
where rownum <= 10000;

create or replace function slow_fetch (the_cursor sys_refcursor)
return sys.ku$_objnumset
-- use this in 9i
-- return mdsys.sdo_numtab authid current_user
pipelined
is
n_num number;
begin
loop
fetch the_cursor into n_num;
if the_cursor%notfound then
close the_cursor;
exit;
end if;
pipe row(n_num);
dbms_lock.sleep(0.01);
end loop;
return;
end slow_fetch;
/

set serveroutput on timing on

alter session set nls_language = 'AMERICAN';

spool scn_ascending_demo.log

declare
job_id binary_integer;
msg_buffer varchar2(2000);
pipe_id integer;
pipe_name constant varchar2(20) := 'scn_ascending_demo';
pipe_status integer;
c sys_refcursor;
l sys_refcursor;
n_result number;
n_row number;
function local_submit_job(what in varchar2)
return binary_integer
is
pragma autonomous_transaction;
job_id binary_integer;
begin
dbms_job.submit(job_id, what);
commit;
return job_id;
end;
begin
pipe_id := dbms_pipe.create_pipe(pipe_name);
open c for
select
col1
from
scn_ascending_demo;

job_id := local_submit_job('
declare
n_status integer;
begin
update
scn_ascending_demo
set
col1 = 1 - col1
, filler = rpad(''y'', 100, ''y'')
;
commit;
dbms_pipe.pack_message(''DONE'');
n_status := dbms_pipe.send_message(''' || pipe_name || ''');
exception
when others then
dbms_pipe.pack_message(''ERROR: '' || sqlerrm);
n_status := dbms_pipe.send_message(''' || pipe_name || ''');
end;
');

pipe_status := dbms_pipe.receive_message(pipe_name);
dbms_pipe.unpack_message(msg_buffer);
if msg_buffer != 'DONE' then
raise_application_error(-20001, 'Error in updating scn_ascending_demo: ' || msg_buffer);
end if;

job_id := local_submit_job('
declare
n_status integer;
snapshot_too_old exception;
pragma exception_init(snapshot_too_old, -1555);
no_space_left exception;
pragma exception_init(no_space_left, -30036);
begin
loop
begin
update
scn_ascending_demo_wreck_undo
set
owner = dbms_random.string(''a'', 30)
, object_name = dbms_random.string(''a'', 30)
, subobject_name = dbms_random.string(''a'', 30)
, object_type = dbms_random.string(''a'', 18)
;
commit;
exception
when snapshot_too_old or no_space_left then
commit;
end;
n_status := dbms_pipe.receive_message(''' || pipe_name || ''', 0);
exit when n_status != 1;
dbms_lock.sleep(0.5);
end loop;
commit;
dbms_pipe.pack_message(''DONE'');
n_status := dbms_pipe.send_message(''' || pipe_name || ''');
end;
');

begin
open l for
select
rownum as r_no
, value(d) as result
from
table(slow_fetch(c)) d;
loop
fetch l into n_row, n_result;
exit when l%notfound;
dbms_output.put_line('Row ' || n_row || ':' || n_result);
end loop;
close l;
/*
n_row := 0;
loop
fetch c into n_result;
exit when c%notfound;
n_row := n_row + 1;
dbms_output.put_line('Row ' || n_row || ':' || n_result);
dbms_lock.sleep(0.01);
end loop;
close c;
*/
exception
when others then
dbms_output.put_line('Error: ' || sqlerrm);
end;

dbms_pipe.pack_message('DONE');
pipe_status := dbms_pipe.send_message(pipe_name);
dbms_lock.sleep(5);
pipe_status := dbms_pipe.receive_message(pipe_name, 5);
pipe_id := dbms_pipe.remove_pipe(pipe_name);
declare
job_does_not_exist exception;
pragma exception_init(job_does_not_exist, -23421);
begin
dbms_job.remove(job_id);
commit;
exception
when job_does_not_exist then
dbms_output.put_line('Job: ' || job_id || ' does not exist any longer.');
end;
exception
when others then
pipe_id := dbms_pipe.remove_pipe(pipe_name);
declare
job_does_not_exist exception;
pragma exception_init(job_does_not_exist, -23421);
begin
dbms_job.remove(job_id);
commit;
exception
when job_does_not_exist then
dbms_output.put_line('Job: ' || job_id || ' does not exist any longer.');
end;
raise;
end;
/

spool off


This code does the following:

1. Creates two tables, one that will be modified and read, and another one whose sole purpose is to ensure that the undo will be overwritten

2. Uses a pipelined table function to fetch data slowly from a passed ref cursor object. Note that this is purely optional for demonstration purposes of a pipelined table function and as you can see the commented part simply fetches from the initial cursor directly to achieve the same result.

3. Uses DBMS_PIPE to perform very rudimentary synchronisation between the spawned jobs and the main session.

4. The basic principle is similar, but somewhat different to the previous test case:
- We open a cursor. At that moment the result is pre-ordained.
- Then we spawn a separate job that modifies the complete table that the query is based on.
- Once this is successfully done we spawn another job that attempts to fill up and overwrite our small undo tablespace.
- While this job is running we start to fetch from the initially opened cursor.
- As soon as the fetch is complete, either due to errors or successfully completed, we tell the job to stop the update operation and finally clean up if the job for whatever reason is still running (which should not happen).

Here is the result from 11.1.0.7 (and 10.2.0.4 which behaves the same) without the SCN_ASCENDING hint:


Row 1:1
Row 2:1
Row 3:1
Row 4:1
Row 5:1
Row 6:1
Row 7:1
Row 8:1
Row 9:1
Row 10:1
.
.
.
Row 584:1
Row 585:1
Row 586:1
Row 587:1
Row 588:1
Row 589:1
Row 590:1
Row 591:1
Row 592:1
Row 593:1
Row 594:1
Error: ORA-01555: snapshot too old: rollback segment number 11 with name
"_SYSSMU11_1238392578$" too small
Job: 170 does not exist any longer.

PL/SQL procedure successfully completed.

Elapsed: 00:00:17.06


So we get the expected error. Now let's try with the SCN_ASCENDING hint:


.
.
.
open c for
select /*+ scn_ascending */
col1
from
scn_ascending_demo;
.
.
.


Here's the (even more obvious) result:


Row 1:1
Row 2:1
Row 3:1
Row 4:1
Row 5:1
Row 6:1
Row 7:1
Row 8:1
Row 9:1
Row 10:1
.
.
.
Row 456:1
Row 457:1
Row 458:1
Row 459:1
Row 460:1
Row 461:1
Row 462:1
Row 463:0
Row 464:0
Row 465:0
Row 466:0
Row 467:0
Row 468:0
Row 469:0
Row 470:0
.
.
.
Row 1993:0
Row 1994:0
Row 1995:0
Row 1996:0
Row 1997:0
Row 1998:0
Row 1999:0
Row 2000:0
Job: 181 does not exist any longer.

PL/SQL procedure successfully completed.

Elapsed: 00:10:38.18


So again the hint has worked and we can see the inconsistent reads that should have been ORA-01555 errors.

What if we change this test case slightly so that an index access is used?


.
.
.
open c for
select /*+ scn_ascending first_rows */
col1
from
scn_ascending_demo
order by
filler;

.
.
.


In this case again I couldn't prevent the ORA-01555 error, so this seems to be corroborate the theory that only full table scans are able to use the SCN_ASCENDING request successfully.

So in summary I have to say that this feature seems to be quite questionable, may be even buggy, and even when it works it looks quite scary given the otherwise very robust multi-versioning capabilities of Oracle which represent one of the cornerstones of its fundamental architecture.

I haven't checked yet if the hint does also modify the behaviour of DML statements, but since these employ already their "write" consistency as it is called by Tom Kyte, it's quite unlikely that the SCN_ASCENDING hint is applicable. This means that an update DML statement (or SELECT FOR UPDATE) that while processing encounters that the data accessed has been modified in the meantime by others is going to "restart". This effectively means that any potential changes already applied are going to be rolled back, and the statement again starts from scratch based on the latest data. Note that this restart can happen multiple times, and yes, the amount of undo and redo generated will be increased if this is going to happen, although Oracle seems to rollback the changes only once and from then on switch to a SELECT FOR UPDATE mode first. This is a bit similar to the SCN_ASCENDING behaviour, but the crucial difference is that the DML statement is able to re-start its work, whereas the query might have already fetched numerous rows that already have been processed by the client, so whereas the DML statement is still consistent because it starts all over again, the query results are potentially inconsistent since there is no re-start possible if the client has already processed a part of the result set.

As a side note: The "restart" effect of the "write" consistency can actually lead to triggers being fired multiple times for the same row and is one of the reasons why you should never perform non-transactional operations (typically sending an email) from within a trigger. The non-transactional operation cannot be rolled back and therefore will be potentially repeated, e.g. sending out emails to the same recipient multiple times. One possible solution to this problem is to encapsulate the non-transactional operation into something that is transactional, e.g. a job submitted via DBMS_JOB, because DBMS_JOB is transactional and the job creation will be rolled back as part of the DML restart.

The "restart" behaviour of the SELECT FOR UPDATE statement is somewhat documented in the Advanced Application Developer's Guide.

12 comments:

Dion_Cho said...

Randolf.

It's an amazing and weird new feature. :)

As far as I know, this is the first(not sure it would be last) exception to Oracle's proud read consistency model.

Madhu said...

Thanks for the breaking news. This is a great feature in our situation, except for the limitation of full table access.
Our system has queries that run 4 to 6hrs on tables (~200million recs) that change constantly, and sometimes they run like 12hrs. We don't care about consistency at all (it's not financial system!) so this is valuable "feature" for us, as most of the query time is wasted on constructing a read consistent block.

Randolf said...

Madhu,

are you sure that the queries you've mentioned are efficient? 200 million rows doesn't sound like a volume that should require multiple hours to process, although of course I don't have a clue what your queries are performing, what else in running simultaneously and what hardware you're using and therefore your mileage might vary.

If you could speed up the queries you might have less trouble, but as you say, may be the main reason for the queries being slow is that they're permanently hunting down the undo data to provide a read-consistent view of the data.

If this is the case you might want to reconsider your current approach how the data is loaded into the tables.

By the way, I don't think that simply using the SCN_ASCENDING hint would make your queries much faster, because I assume that Oracle still tries its best to re-construct the blocks as required and only breaks the consistency if the undo data is not available.

So in your particular case to benefit from the SCN_ASCENDING hint you might need to use an undo tablespace that is as small as possible to support the transactions, but overwrites the undo data as soon as possible so that Oracle can't re-construct the blocks anymore.

It sounds however like an odd approach I have to admit.

Best Regards,
Randolf

Daniel Rey said...

Randolf,

Thanks a lot for this interesting information.

In my opinion not allowing the combination of index usage and SCN_ASCENDING hint makes perfectly sense. Otherwise the same row could get read multiple times.

Let's say we run a index range scan. Row A gets read, it's value in the index is 2. Now an update is running over row A, in parallel to our query. It sets it's value to 100 and commits. Meanwhile we scanned until value 99. Now we will reread row A with value 100. Of course the opposite can happen too.

This can not happen when using a full table scan.

Best Regards,
Daniel

Madhu said...

Randolf,

Thanks for pointing out, I think you are right - Oracle will ignore the snapshot too old and proceed further instead of erroring. So I guess, I have less to be excited about :-(

Part of the long query problems that we have are due to the way the tables are designed, and some are due to queries hitting complex/convoluted plsql code (packages procedures returning collections, for example). But these are long term fixes, that I don't have control on.

-Madhu

Hemant K Chitale said...

Do we know which Oracle customer raised the Enhancement Request and how it was justified ?
Letting the Hint loose in the public domain means giving licence to developers to write code that might break the much revered Read Consistency model, doesn't it ?

Randolf said...

@Madhu:

I definitely would think more than twice before using this scary and questionable hint, even if you say "we don't care about consistency at all" which sounds to me a bit odd (it sounds a bit like "we don't care about our content at all").

Using complex PL/SQL functions on such a data volume is for sure not a good idea in terms of performance. You probably could benefit most from replacing the PL/SQL logic with a similar plain SQL solution which is usually much, much faster.

@Daniel:
I think that it's for sure possible to read the same row twice even when using a full table scan and the SCN_ASCENDING hint. Consider the following testcase, which is a modified version of the first one from the post:

-- code start

drop table t purge;

drop table t_2 purge;

create table t
as
select
a.*, 1 as my_id
from
all_objects a
order by
dbms_random.random;

alter table t add constraint t_pk primary key (object_id);

exec dbms_stats.gather_table_stats(null, 'T', cascade=>true)

create table t_2
as
select * from t
where 1 = 2;

create or replace function wait_sec(
in_row_identifier in number
, in_wait_sec in number default 1
, in_dbms_output in boolean default false
) return number
as
begin
if in_dbms_output then
dbms_output.put_line('Row: ' || in_row_identifier);
end if;
dbms_lock.sleep(in_wait_sec);
return in_row_identifier;
end;
/

-- session 1
declare
snapshot_too_old exception;
pragma exception_init(snapshot_too_old, -1555);
no_space_left exception;
pragma exception_init(no_space_left, -30036);
begin
for i in 1..100 loop
begin
execute immediate 'truncate table t_2';
insert /*+ append */ into t_2
select
*
from
t
where
rownum <= 2000;
commit;
delete
from
t
where
object_id in (
select
object_id
from
t_2
);
commit;
insert into t
select
a.*
from
t_2 a
order by
dbms_random.random;
commit;
exception
when snapshot_too_old or no_space_left then
commit;
end;
dbms_lock.sleep(0.5);
end loop;
end;
/

-- session 2
select /*+ scn_ascending */
count(*) as cnt
, object_id
, max(substr(object_name, 1, 5)) as dummy1
, count(distinct fetch_slow) as dummy2
from (
select
object_name
, object_id
, wait_sec(object_id, 0.01) as fetch_slow
from
t
)
group by object_id
having count(*) > 1;

-- code end

The "session 1" replaces the table content in chunks (due to the limited undo space in the "UNDO_SMALL" tablespace available), and "session 2" slows down the processing artificially using a simple PL/SQL function while looking for duplicates of the primary key. Guess what it's going to return? It ought to be no rows, but it does return multiple duplicates.

But I think your point is heading in the right direction:
Consider a index access traversing the index root block -> branch blocks -> leaf blocks. It probably will be a mess if some of these index blocks are consistent and others are not.

If an index is used to filter data and the table block where the index entry points to is inconsistent with the index leaf block, wrong data might be returned.

Likewise, consider an execution plan that depends on the sort order of an index (no SORT ORDER BY operation, e.g. a FIRST_ROWS_n plan using the index to return data in a guaranteed order).

Now if some of the index leaf blocks are inconsistent, or the table blocks that the index points to are inconsistent then this potentially returns the data in the wrong order (or wrong data anyway), and therefore would lead to a wrong result set.

So, yes, there seem to be good reasons why this can't be used with indexes.

This basically begs for more tests, e.g. what happens if an IOT is involved, may be having an OVERFLOW segment? A index fast full scan might be supported, but what happens to the overflow lookup? Hopefully IOTs are not supported at all, but I haven't tested it yet.

@Hemant:
I don't have a clue who initially requested this "enhancement", but as I've mentioned in the blog post I agree that it is a scary feature and I can't think of a reasonable justification why it should be used.

Do you think I should make this point clearer in the post?

It would be interesting to see which point release initially knows about it. In 9.2.0.8 "SCN_ASCENDING" is not mentioned in V$RESERVED_WORDS, but it does apparently in 10.2.0.4. I don't have at present a pre-10.2.0.4 point release at hand to check previous 10g releases for the keyword.

I was thinking more than twice if I should publish the post at all, but it has already been reported multiple times and it is officially documented on MetaLink, so I thought it's better to inform than to keep quiet about it.

But I agree that probably there is the possibility of mis-use, simply because some people don't understand the reason for the "snapshot too old" error and are happy to have found a way to avoid it...

Best Regards,
Randolf

Mark Freeman said...

Rudolph,

I want to be sure I understand the concept in your article as it applies to a real-world situation.

I have to run a one-time validation script on an table of mailing addresses. There is a cursor that does a full table scan on this table. The cursor is accessed via a FOR loop. For each record, the script does various look-ups using other tables and based upon what it finds it will either update the address record or insert a record in an exception table. The script does a COMMIT after every 20,000 iterations of the loop. This script takes 12 - 24 hours to run.

While the script is running, application users are free to update addresses in the table. We do not want to block them from making these updates, nor do we want to ignore these updates.

When this script processes each record, it needs to see the most recently committed version of that record, regardless of what that record looked like when the cursor was first opened. In other words, if a user updated an address after the script started but before the loop gets to that record, the new version of the record is what should be processed by the script.

Would use of the /*+ scn_ascending */ hint help me achieve the goal?
Would I be better off to COMMIT after processing each record?
Would doing a "SELECT ... INTO ... FROM address WHERE pk = cursor_rec.pk FOR UPDATE" for each record before processing the record help (or would it still see the original version of the record)?

Randolf said...

Mark,

I'm not going to raise here the question why it takes such a long time to process the data.

With your approach you're basically facing different issues:

- The long running query might run into ORA-01555 errors if the undo required to re-construct the read consistent version of blocks accessed gets overridden. The SCN_ASCENDING hint might help here, yes. Since you're not interested in the read consistent view of data as you say, this might be an option to avoid the error.

Another option might be to switch your UNDO to retention guarantee while performing this batch run, requiring potentially a huge UNDO tablespace and the risk of failing DML activities in case of UNDO size issues.

Another potential option would require you to be able to identify rows that have already been reconciled - in that case you could process your data in smaller chunks, lowering the risk of running into ORA-01555 errors - but with the potential issue of increased run time and I/O if you don't find an efficient way of reading the chunks / identifying already reconciled data.

So much about the CURSOR FOR loop you use to drive your batch.

- What you're trying to achieve regarding updating only the latest version of the record: What has been described in this post doesn't help you in that matter.

Usually you should attempt to go for an "optimistic" locking approach. That means you can either use a timestamp value (or from 10g on the ORA_ROWSCN pseudo-column but that required you to create the table with ROWDEPENDENCIES enabled which you probably won't have) or fetch all fields, and while attempting to update an particular row you would compare that timestamp or all fields to the current row content, something like:

UPDATE TABLE SET ...
WHERE PK = fetched_ID
AND ORA_ROWSCN = fetched_ORA_ROWSCN

-- or using all fields approach
(
AND field1 = fetched_field1
AND field2 = fetched_field2
AND field3 = fetched_field3
(taking NULLs into account etc.)...
)

If the update modifies 0 rows, you know that the row has been modified by someone else in the meantime.

Then you can try to re-query that particular row, if you like with FOR UPDATE to prevent anyone else from updating it while you do, and perform then your UPDATE, or you keep the optimistic approach for that row and loop until the row is left unmodified.

This approach avoids to re-query every row before performing the update, which would be the other option, as you mention.

By using the optimistic locking you will update all those rows which have not been modified without the need to requery it, which should help to speed up that part of the job.

How often you attempt to COMMIT has nothing to do with how to get the latest version of the data, but simply controls how other users are able to modify the data modified by your batch job - concurrency so to say.

It also influences to a certain degree the speed of your batch run, the amount of redo generated and the load caused by your batch run.

Randolf

Mark Freeman said...

Randolph,

Thanks for the detailed response. It sounds like the hint will be useful in my situation without causing any problematic side-effects. I have no control or influence over the UNDO settings in the production environment so I just have to make do as best I can. Processing in smaller chunks would be a reasonable approach (fire off multiple jobs, each processing a specific range of PK values), but if the hint will solve the problem, that seems far simpler.

As for the off-topic issues I raised, I am sure that ROWDEPENDENCIES has not been enabled on the table. I could read all fields and compare them in the UPDATE, but that would cause the UPDATE to be skipped if the record had been changed. I don't want to skip processing the row, but be sure I am processing the latest version of it. So I think I am better off with doing the extra read at the top of the loop to read and lock the one record and then COMMIT after each UPDATE. I sacrifice some speed, but I gain both thoroughness (all records get updated if needed) and safety (I won't accidentally overwrite unprocessed user-updated data with processed older data).

Randolf said...

Mark,

> I could read all fields and
> compare them in the UPDATE, but
> that would cause the UPDATE to be
> skipped if the record had been
> changed. I don't want to skip
> processing the row, but be sure I
> am processing the latest version
> of it

probably my explanation of my intended approach wasn't clear enough:

- You use all the fields in the WHERE clause of the UPDATE statement to make sure that you only update the row if it hasn't been modified
- Should the update return 0 affected rows, you know the row has been modified (or deleted)
- In this case you don't skip the row, but start a special processing for this particular row, that is either a SELECT FOR UPDATE to make sure that no one else can modify the row, or another nested loop that reads the row content and attempts to update again with all fields in the WHERE clause until the update returns one affected row (or the row is no longer available).

This way you save the overhead of querying each and every record.

If you expect only few records to be updated in the meanwhile this might save you some overhead - if you think that the majority of rows will be updated anyway, you can go for your approach of querying each and every record.

Randolf

Mark Freeman said...

If I put all the columns in the cursor and within the loop I process the row and then do an UPDATE conditional on all the columns in the table being the same as in the cursor, I'd only have to do SELECT FOR UPDATE and process the new version of row if the SQL%ROWCOUNT for the UPDATE was zero. If I move all the processing to a private procedure, that would be straightforward. Odds are against the data having been changed, so having to process the same row a second time (to process the updated data) would be rare.

Cool idea. Thanks!