Monday, April 27, 2009

ETL Process using Microsoft Excel

ETL project usually involves data cleanup. An architect many times acts as a bridge between Business Analyst and System DBA.

If you are familiar with Design Patterns, to Business Analyst, the architect is like a "Bridge" pattern; to DBA, the architect is like a "Builder" pattern;

Creating Query using Microsoft Excel
Task statements:
Business Analyst has provided (see figure below):
Column A: SchemaName and Table Name
Column B: Field Name
Column C: Type of data (From business point of view, PI is referred to Personal Information)
Column D: Commentys (From business point of view)
Column E: Data Type
Column F: New Value

Business Analyst wants to replace all PI data with the value of '0'. The list is not completed yet, but it will have more than 100 rows.
System DBA asks Architect to provide executable Oracle SQL scripts similar to the text in Column G.



Architect Solution: Use Excel functions to create the command which can be easily copy / paste to each row.

The Command in Column G:
=CONCATENATE("UPDATE TABLE ", REPLACE(INDIRECT(ADDRESS(ROW(), COLUMN()-6)),FIND("_", INDIRECT(ADDRESS(ROW(), COLUMN()-6))), 1,"."), " SET ", INDIRECT(ADDRESS(ROW(), COLUMN()-5)), " = ", INDIRECT(ADDRESS(ROW(), COLUMN()-1)), ";")

Excel Functions:
CONCATENATE()
In Excel, the Concatenate function allows you to join 2 or more strings together.
The syntax for the Concatenate function is:
Concatenate( text1, text2, ... text_n )
There can be up to 30 strings that are joined together.
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

REPLACE()
In Excel, the Replace function replaces a sequence of characters in a string with another set of characters.
The syntax for the Replace function is:
Replace( old_text, start, number_of_chars, new_text )
old_text is the original string value.
start is the position in old_text to begin replacing characters.
number_of_chars is the number of characters to replace in old_text.
new_text is the replacement set of characters
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

INDIRECT()
In Excel, the Indirect function returns the reference to a cell based on its string representation.
The syntax for the Indirect function is:
Indirect( string_reference, ref_style )
string_reference is a textual representation of a cell reference.
ref_style is optional. It is either a TRUE or FALSE value. TRUE indicates that string_reference will be interpreted as an A1-style reference. FALSE indicates that string_reference will be interpreted as an R1C1-style reference. If this parameter is omitted, the Indirect function will interpret string_reference as an A1-style.
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

ADDRESS()
In Excel, the Address function returns a text representation of a cell address.
The syntax for the Address function is:
Address( row, column, ref_type, ref_style, sheet_name )
row is the row number to use in the cell address.
column is the column number to use in the cell address.
ref_type is optional. It is the type of reference to use. It can be any of the following values:
ValueExplanation
1Absolute referencing.For example: $A$1
2Absolute row; relative column.For example: $A1
3Relative row; absolute column.For example: A$1
4Relative referencing.For example: A1

If this parameter is omitted, the Address function assumes that the ref_type is set to 1.
ref_style is optional. It is the reference style to use: either A1 or R1C1. It can be any of the following values:
ValueExplanation
TRUEA1 style referencing
FALSER1C1 style referencing


Explanation
If this parameter is omitted, the Address function assumes that the ref_style is set to TRUE.
sheet_name is optional. It is the name of the sheet to use in the cell address. If this parameter is omitted, then no sheet name is used in the cell address.
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

ROW()
In Excel, the Row function returns the row number of a cell reference.
The syntax for the Row function is:
Row( reference )
reference is optional. It is a reference to a cell or range of cells.
Note:
If the reference parameter is omitted, then the Row function assumes that the reference is the cell address in which the Row function has been entered in.

For example, the Row function used in the picture above returns 1 because the Row function has been entered in cell A1. Therefore, it assumes the following formula:
=Row(A1)
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

COLUMN()
In Excel, the Column function returns the column number of a cell reference.
The syntax for the Column function is:
Column( reference )
reference is optional. It is a reference to a cell or range of cells.
Note:
If the reference parameter is omitted, then the Column function assumes that the reference is the cell address in which the Column function has been entered in.
For example, the Column function used in the picture above returns 1 because the Column function has been entered in cell A1. Therefore, it assumes the following formula:
=Column(A1)
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

FIND()
In Excel, the Find function returns the location of a substring in a string. The search is case-sensitive.
The syntax for the Find function is:
Find( text1, text2, start_position )
text1 is the substring to search for in text2.
text2 is the string to search.
start_position is the position in text2 where the search will start. The first position is 1.
Note:
If the Find function does not find a match, it will return a #VALUE! error.
Applies To:
Excel 2007, Excel 2003, Excel XP, Excel 2000

1 comment:

  1. Yesterday I had problem with my working xls file. I couldn't solve it and went to bed. This morning my friend called me and by chance suggested - recover Excel file. He was absolutely right, the software resolved my trouble quite rapidly.

    ReplyDelete