January 28, 2013
Pentaho Data Integration - Logging (#007)
Open the job created in the post #006; "jb_employees_incremental_load"
Right-click anywhere in the workspace/canvas and select "Job settings".
In the "Job properties" window go to the "Log" tab.
Log Connection: conn_oracle11g_orcl_dwh_dwh
Log schema: dwh
Log table: pentaho_log
Click "SQL" to create a log table.
Click "OK" and save your work.
Run the job and leave "Log level" as it is: "Basic logging"
Every event seen in the "Logging" tab can be seen in the "log_field" column of the "pentaho_log" table.
SQL:
SELECT log_field FROM pentaho_log;
After logging table is ready Pentaho will show data in the "History" tab.
Some random example:
January 27, 2013
Pentaho Data Integration – Variables (#006)
Pentaho Data Integration – How to Use a “Set Variables” step.
Pentaho Data Integration – How to Replace Variables in a SQL statement
Explore the repository and create a new folder "employees_incremental_load"
Inside "employees_incremental_load" create a "eil_transformations" subfolder.
Create three new objects:
- job: "jb_employees_incremental_load"
- transformation "tr_eil_dates"
- transformation "tr_eil_new_employees_load"
The main job naming convention:
jb_[folder_name], i.e.: "jb_employees_incremental_load"
The job will be saved in the main folder: "employees_incremental_load"
The transformations will be saved in the subfolder "eil_transformations"
The ETL process will work as follows:
The main job will include two transformations. First transformation will pass a date value (by "Set Variables" step) to the second transformation. The second transformation will receive the data value and pass it as a parameter to the SELECT statement.
The "tr_eil_dates" transformation
Add two steps to the workspace area:
- From the "Input" folder "Table input"
- From the "Job" folder "Set Variables"
Adjust the "Input table" step:
SQL:
SELECT '2007-01-01' AS date_value FROM DUAL
Edit the "Set Variables" step" and click "Get Fields"
The transformation structure:
The "tr_eil_new_employees_load" transformation:
Add two steps to the workspace area:
- From the "Input" folder "Table input"
- From the "Output" folder "Table output"
Adjust the "Table input" step:
Remember to check the "Replace variables in script" option.
SQL:
SELECT
EMPLOYEE_ID
, FIRST_NAME
, LAST_NAME
, PHONE_NUMBER
, HIRE_DATE
, JOB_ID
, SALARY
, COMMISSION_PCT
, MANAGER_ID
, DEPARTMENT_ID
FROM EMPLOYEES
WHERE HIRE_DATE < TO_DATE('${date_value_in}', 'YYYY-MM-DD')
Adjust the "Table output" step:
Click the "SQL" button to create the "new_employees" table in the "dwh" schema
Variables in the script are causing troubles.
Comment it out in order to create a table. After the table is created return to the previous form.
Right-click anywhere in the workspace/canvas and select "Transformation settings"
In the "Transformation properties" window select the "Parameters" tab and add a "date_value_in" parameter without a default value.
The transformation structure:
The "jb_employees_incremental_load" job:
From the "General" folder add three steps to the workspace area:
- "START" and two "Transformations"
Adjust the "Transformation" in the following way:
Adjust the "Transformation 2" in the following way:
The job has the following structure:
Run and review the job’s results.
January 23, 2013
Windows CMD - Useful (#005)
Count lines in multiple files using Windows command prompt
for %G in (*.sql) do find /c /v "~~~" %G
Source: http://ora-00001.blogspot.ca/2011/07/count-lines-in-multiple-files-using.html
Add date and time to log files created from Windows command prompt
:: change window's title
TITLE date_in_a_file_name
:: print current date and time
ECHO current_date: %date%
ECHO current_time: %time%
@echo off
:: format of date is YYYYMMDD
SET currdate=%date%
SET dd=%currdate:~7,2%
SET mm=%currdate:~4,2%
SET yyyy=%currdate:~10,4%
SET yyyymmdd=%yyyy%%mm%%dd%
:: format of time is HHMiSS
SET currtime=%time%
SET hh=%currtime:~0,2%
SET mi=%currtime:~3,2%
SET ss=%currtime:~6,2%
SET hhmiss=%hh%%mi%%ss%
SET hhmiss=%hhmiss: =%
ECHO test_string > C:\result_file_with_a_date_%yyyymmdd%_%hhmiss%.log
for %G in (*.sql) do find /c /v "~~~" %G
Source: http://ora-00001.blogspot.ca/2011/07/count-lines-in-multiple-files-using.html
Add date and time to log files created from Windows command prompt
:: change window's title
TITLE date_in_a_file_name
:: print current date and time
ECHO current_date: %date%
ECHO current_time: %time%
@echo off
:: format of date is YYYYMMDD
SET currdate=%date%
SET dd=%currdate:~7,2%
SET mm=%currdate:~4,2%
SET yyyy=%currdate:~10,4%
SET yyyymmdd=%yyyy%%mm%%dd%
:: format of time is HHMiSS
SET currtime=%time%
SET hh=%currtime:~0,2%
SET mi=%currtime:~3,2%
SET ss=%currtime:~6,2%
SET hhmiss=%hh%%mi%%ss%
SET hhmiss=%hhmiss: =%
ECHO test_string > C:\result_file_with_a_date_%yyyymmdd%_%hhmiss%.log
Pentaho Data Integration - Useful (#004)
Windows Environment Variables
PENTAHO_JAVA_HOME=C:\Program Files\Java\jre7
The PENTAHO_JAVA variable will be set automatically
Links
Pentaho INFOCENTER:
http://infocenter.pentaho.com/help/index.jsp
Latest Pentaho Data Integration (aka Kettle) Documentation:
http://wiki.pentaho.com/display/EAI/Latest+Pentaho+Data+Integration+%28aka+Kettle%29+Documentation
kettle-cookbook:
http://code.google.com/p/kettle-cookbook/
Pentaho Data Integration - How To Use a Job (#003)
In the post "Pentaho Data Integration - How To Use a Transformation" a transformation was made to create and load the "countries" table in a data warehouse schema (dwh).
Repeat the formula for each table in the "hr" schema.
Add one option. In each target (tgt_table_name) definiton select "Truncate table". Before each load a table will be truncated.
SQL:
SELECT
table_name,
CASE WHEN table_name <> 'COUNTRIES' THEN 'to do' ELSE NULL END AS create_tr
FROM user_tables
ORDER BY table_name;
After creating six additional transformations. The repository should contain the following objects:
Now let's create a new job which will include all previously created transformations.
Save it as "jb_hr_initial_load".
Jobs' naming convention:
jb_[project_name]_[task]
From the "General" Folder on the "Design" tab select, drag and drop the following steps:
- "START"
- "Transformation" times seven
- "Success"
Save the progress.
Double click the "Transformation" step.
Select "Specify by name and directory".
Click a transformation sign and select the first transformation "tr_hr_countries_inital_load"
Save the progress.
Now the job has the following structure:
Repeat the previous step for each transformation in the "hr" folder.
Run the job and review the results.
SQL:
SET LIN 150
SELECT 'SELECT ''' || table_name || ''' AS table_name, COUNT(1) AS r_count FROM ' || table_name || ' UNION ALL' AS q_in_q
FROM user_tables
ORDER BY table_name;
Subscribe to:
Posts (Atom)