String Decomposition with Regular Expressions

July 12, 2007


Once in a while an interesting problem pops up on Oracle OTN forums
http://forums.oracle.com/forums/thread.jspa?threadID=530584&tstart=0
where Joe Fuda published an inspiring solution leveraging Regular Expressions.

The fact that regular expressions are so powerful that can generate a very elegant solution is not surprising at all, because they are based on deep and fruitful Programming Language Theory. Furthermore, there is a certain similarity between database and language theory operators. They both have a union operator which have identical properties, and they both have join which is slightly different. This partially explains why regular expressions feature fits snugly into SQL.

Let’s see how regular expressions can be leveraged upon a standard SQL pattern — String Decomposition. Given a table of strings


create table sources (
num integer,
src varchar2(1000)
);
insert into sources values(1, '123,45,6,7,8');
insert into sources values(2, '8,9');
commit;
select * from sources;

 NUM   SRC 
 1   123,45,6,7,8 
 2   8,9 

the query should output a relation with tuples containing individual substrings that were “separated by commas” in the original table

 NUM   SRC   Decomposed String 
 1   123,45,6,7,8   8 
 1   123,45,6,7,8   123 
 1   123,45,6,7,8   7 
 1   123,45,6,7,8   6 
 1   123,45,6,7,8   45 
 2   8,9   8 
 2   8,9   9 

When we decompose a string into a set of substrings, we change concatenation into union. In our example, concatenation is defined via comma character, so what we want to do is to replace the comma with regular expression disjunction operator. Here is the query:


select distinct num, src,
regexp_substr
( src,
replace( src, ',', '|' ),
1,level
)
from sources, dual
connect by
regexp_substr
( src,
replace( src, ',', '|' ),
1,level
) is not null

An alternative solution leveraging regular expressions negation operator


select distinct src, regexp_substr(src,'[^,]+',1,level)
from sources
connect by regexp_substr(src,'[^,]+',1,level) is not null

is slightly shorter, but I fail to notice much theory behind it.

What makes the method less practical, however, is the fact that each string fragment is derived multiple times — a virtue of CONNECT BY execution. An alternative solution with MODEL operator is much more verbose, but also gives a programmer much more control how the execution is driven.

Advertisements

4 Responses to “String Decomposition with Regular Expressions”

  1. Volder Says:

    1. DUAL is redundant
    2. You’ll get less values than in the string, when elements are met more than once, e.g.

    SQL> with t as (select 1 num, '10,20,10,30,20'str from dual union all
      2             select 2 num, '15,13,14,13,16'str from dual)
      3  select distinct num,
      4                  str,
      5                  regexp_substr(str, replace(str, ',', '|'), 1, level)
      6    from t--, dual
      7  connect by regexp_substr(str, replace(str, ',', '|'), 1, level) is not null
      8  order by num
      9  /
    
           NUM STR            REGEXP_SUBSTR(STR,REPLACE(STR,
    ---------- -------------- ------------------------------
             1 10,20,10,30,20 10
             1 10,20,10,30,20 20
             1 10,20,10,30,20 30
             2 15,13,14,13,16 13
             2 15,13,14,13,16 14
             2 15,13,14,13,16 15
             2 15,13,14,13,16 16
    
    7 rows selected
    
    SQL>
    

  2. Indeed, I was unclear if I want decomposition into sets, bags, or lists (in the latter case one more column is needed).

  3. Joe Fuda Says:

    Glad to see my REGEXP technique inspired this neat string decomposition solution Vadim. 🙂 Very nice.

    BTW, if you use the approach shown below you won’t need DISTINCT and you won’t need an extra column to address the duplicate value issue Volder identified.

    select * from sources;

    NUM SRC
    ———- —————
    1 123,45,6,7,8
    2 8,9
    3 8,8,8

    3 rows selected.

    select
    num,
    src,
    regexp_substr
    ( src,
    replace( src, ‘,’, ‘|’ ),
    1,
    level
    ) as decomposed_string
    from sources
    connect by
    regexp_substr
    ( src,
    replace( src, ‘,’, ‘|’ ),
    1,
    level
    ) is not null and
    connect_by_root num = num
    order by 1,3 ;

    NUM SRC DECOMPOSED_STRING
    ———- ————— —————–
    1 123,45,6,7,8 123
    1 123,45,6,7,8 45
    1 123,45,6,7,8 6
    1 123,45,6,7,8 7
    1 123,45,6,7,8 8
    2 8,9 8
    2 8,9 9
    3 8,8,8 8
    3 8,8,8 8
    3 8,8,8 8

    10 rows selected.


    Joe Fuda


  4. Thank you Joe. With your correction we just plug in “hierarchical” reg exp string decomposition query instead of nasty model inner view for the original problem at
    http://forums.oracle.com/forums/thread.jspa?messageID=1955433&#1955433
    and it executes as quick if not even faster — I updated the thread.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: