January 28, 2013

Pentaho Data Integration - Logging (#007)


Open the job created in the post #006; "jb_employees_incremental_load"


Right-click anywhere in the workspace/canvas and select "Job settings".


In the "Job properties" window go to the "Log" tab.


Log Connection: conn_oracle11g_orcl_dwh_dwh
Log schema: dwh
Log table: pentaho_log


Click "SQL" to create a log table.



Click "OK" and save your work.

Run the job and leave "Log level" as it is: "Basic logging"


Every event  seen in the "Logging" tab can be seen in the "log_field" column of the "pentaho_log" table.



SQL:
SELECT log_field FROM pentaho_log;

After logging table is ready Pentaho will show data in the "History" tab.

Some random example:

January 27, 2013

Pentaho Data Integration – Variables (#006)


Pentaho Data Integration – How to Use a “Set Variables” step.
Pentaho Data Integration – How to Replace Variables in a SQL statement


Explore the repository and create a new folder "employees_incremental_load"



Inside "employees_incremental_load" create a "eil_transformations" subfolder.


Create three new objects:
- job: "jb_employees_incremental_load"
- transformation "tr_eil_dates"
- transformation "tr_eil_new_employees_load"

The main job naming convention:
jb_[folder_name], i.e.: "jb_employees_incremental_load"

The job will be saved in the main folder: "employees_incremental_load"
The transformations will be saved in the subfolder "eil_transformations"






The ETL process will work as follows:
The main job will include two transformations. First transformation will pass a date value (by "Set Variables" step) to the second transformation. The second transformation will receive the data value and pass it as a parameter to the SELECT statement.


The "tr_eil_dates" transformation

Add two steps to the workspace area:
- From the "Input" folder "Table input"
- From the "Job" folder "Set Variables"

Adjust the "Input table" step:


SQL:
SELECT '2007-01-01' AS date_value FROM DUAL

Edit the "Set Variables" step" and click "Get Fields"


The transformation structure:


The "tr_eil_new_employees_load" transformation:

Add two steps to the workspace area:
- From the "Input" folder "Table input"
- From the "Output" folder "Table output"

Adjust the "Table input" step:

Remember to check the "Replace variables in script" option.

SQL:
SELECT
  EMPLOYEE_ID
, FIRST_NAME
, LAST_NAME
, EMAIL
, PHONE_NUMBER
, HIRE_DATE
, JOB_ID
, SALARY
, COMMISSION_PCT
, MANAGER_ID
, DEPARTMENT_ID
FROM EMPLOYEES
WHERE HIRE_DATE < TO_DATE('${date_value_in}', 'YYYY-MM-DD')

Adjust the "Table output" step:


Click the "SQL" button to create the "new_employees" table in the "dwh" schema


Variables in the script are causing troubles.

Comment it out in order to create a table. After the table is created return to the previous form.



Right-click anywhere in the workspace/canvas and select "Transformation settings"


In the "Transformation properties" window select the "Parameters" tab and add a "date_value_in" parameter without a default value.


The transformation structure:


The "jb_employees_incremental_load" job:


From the "General" folder add three steps to the workspace area:
- "START" and two "Transformations"

Adjust the "Transformation" in the following way:


Adjust the "Transformation 2" in the following way:


The job has the following structure:


Run and review the job’s results.





January 23, 2013

Windows CMD - Useful (#005)

Count lines in multiple files using Windows command prompt

for %G in (*.sql) do find /c /v "~~~" %G

Source: http://ora-00001.blogspot.ca/2011/07/count-lines-in-multiple-files-using.html






Add date and time to log files created from Windows command prompt

:: change window's title
TITLE date_in_a_file_name

:: print current date and time
ECHO current_date: %date%
ECHO current_time: %time%

@echo off
:: format of date is YYYYMMDD
SET currdate=%date%
SET dd=%currdate:~7,2%
SET mm=%currdate:~4,2%
SET yyyy=%currdate:~10,4%
SET yyyymmdd=%yyyy%%mm%%dd%

:: format of time is HHMiSS
SET currtime=%time%
SET hh=%currtime:~0,2%
SET mi=%currtime:~3,2%
SET ss=%currtime:~6,2%
SET hhmiss=%hh%%mi%%ss%
SET hhmiss=%hhmiss: =%

ECHO test_string > C:\result_file_with_a_date_%yyyymmdd%_%hhmiss%.log



Pentaho Data Integration - Useful (#004)


Windows Environment Variables

PENTAHO_JAVA_HOME=C:\Program Files\Java\jre7
The PENTAHO_JAVA variable will be set automatically

Links

Pentaho INFOCENTER:
http://infocenter.pentaho.com/help/index.jsp

Latest Pentaho Data Integration (aka Kettle) Documentation:
http://wiki.pentaho.com/display/EAI/Latest+Pentaho+Data+Integration+%28aka+Kettle%29+Documentation

kettle-cookbook:
http://code.google.com/p/kettle-cookbook/

Pentaho Data Integration - How To Use a Job (#003)


In the post "Pentaho Data Integration - How To Use a Transformation" a transformation was made to create and load the "countries" table in a data warehouse schema (dwh).
Repeat the formula for each table in the "hr" schema.


Add one option. In each target (tgt_table_name) definiton select "Truncate table". Before each load a table will be truncated.


SQL:
SELECT
table_name,
CASE WHEN table_name <> 'COUNTRIES' THEN 'to do' ELSE NULL END AS create_tr
FROM user_tables
ORDER BY table_name;


After creating six additional transformations. The repository should contain the following objects:


Now let's create a new job which will include all previously created transformations.


Save it as "jb_hr_initial_load".

Jobs' naming convention:
jb_[project_name]_[task]



From the "General" Folder on the "Design" tab select, drag and drop the following steps:
- "START"
- "Transformation" times seven
- "Success"
Save the progress.


Double click the "Transformation" step.


Select "Specify by name and directory".
Click a transformation sign and select the first transformation "tr_hr_countries_inital_load"
Save the progress.




Now the job has the following structure:


Repeat the previous step for each transformation in the "hr" folder.


Run the job and review the results.




SQL:
SET LIN 150

SELECT 'SELECT ''' || table_name || ''' AS table_name, COUNT(1) AS r_count FROM ' || table_name || ' UNION ALL' AS q_in_q
FROM user_tables
ORDER BY table_name;