Quantcast
Channel: Tutorial – Marco Ghislanzoni's Blog
Viewing all articles
Browse latest Browse all 10

KNIME for beginners – Part 4

$
0
0

In Part 4 we will learn how to read data from an external file and apply to them some advanced transformation and analysis capabilities provided by KNIME. This tutorial picks up where Part 3 left. You may also want to read Part 1 and Part 2.

Reading Data from an Excel File

In most real-life data analysis tasks the input information to work with will be provided to you in form of a file, often an Excel sheet or a CSV (Comma Separated Values) text file. KNIME comes with a variety of IO nodes that are capable of reading and writing the most common types of data file. Additionally, KNIME is capable of connecting to databases to read data and write back results.

To start with a clean slate, create a new workflow and name it Advanced KNIME workflow. We are about to read some input data from an Excel file containing historical sales information from a fictious company. If you are following along as recommended, you can download the file from this link and save it in a convenient location on your hard-drive. Make a  note of the location where you saved the file because you will need it shortly.

Go to the Node Repository and enter “xls” in the search box. By default KNIME comes with four Excel related nodes. Two of them, the XLS Reader and the XLS Writer, can respectively read and write Excel compatible files, both in XLS and in XLSX format. The reader node reads the content of a specific sheet (tab) from an Excel file while the writer nodes writes content to a specific sheet (tab) of an Excel file. The other two are helper nodes. Read XLS Sheet Names looks into an Excel file and returns the list of sheets it is made of. XLS Sheet Appender adds a new sheet to an existing Excel file or creates a new file with the sheet if it doesn’t exist already.

Add the XLS Reader node to the new workflow you just created, then open its configuration dialog.

From now on I will assume you are familiar with all basic KNIME operations such as searching for nodes in the Node Repository, adding a node to a workflow, configuring it and connecting it to other nodes properly. If not, please go through Part 1, 2 and 3 of this tutorial first, then come back here.

In the configuration dialog for the XLS Reader node click the Browse button and navigate to the location where you saved the data file. In my case the file is under /User/marco/Documents/KNIME but yours may be different. As soon as you locate the file KNIME reads it and tries to “interpret” its content to generate the preview at the bottom of the window.

The options under Adjust Settings: in the configuration dialog are useful to tell KNIME a bit more about the structure and content of the Excel file we want to bring in.

If the file has multiple Sheets in it, the Select the sheet to read option allows you to select which sheet KNIME has to read. Each XLS Reader node can read one sheet and one sheet only, but you can use multiple XLS Reader nodes to read additional Sheets from the same Excel file. Since our sample file has only one sheet, this option can be left unchanged.

The next option, Column Names, tells KNIME which row of the Excel sheet contains the names (headers) of the columns. By default KNIME assumes there are no column headers. By looking at the preview you can note that Row0 actually contains the column headers while the data start on Row1. We need to tell KNIME that the headers are in Row0, the first row of the file. This can be done by ticking the Table contains column names in row number: option to make it active and entering 1 in the text box next to it.

It is indeed a bit confusing that the headers are in Row0 but we had to enter a 1 in this option. Think of the 1 as indicating the first row of our data which KNIME labels, by default, Row0.

As you activate the option above and enter the 1, note the red message in the preview suggesting to click refresh to update it.

As you click refresh, the preview is updated to take into account the fact that the first row contains the column headers. The columns are not longer named “Col0”, “Col1” and so on but are now named with the actual headers from the Excel file. You can scroll the preview to the right to check that all columns have been recognised properly. You can also scroll down to check how the other rows have been imported.

Note that KNIME assigns a data type to each column based on an own interpretation of their content. The ID column for example has been assigned a Number (integer) data type. This may or may not be appropriate, in any case it can be changed later one by applying a data type transformation to the column.

Since from the preview we can tel the file will be read correctly, there is no need to tweak the other options. Please take some time to read the description of the XLS Reader node to familiarise yourself with the additional options in case you need them for your own projects.

Click Ok to confirm the configuration for the XLS Reader node, then execute it to load the file into memory. Once the node has been executed, you can use its internal viewing capabilities to display the data table as read from the Excel file. This is what it should look like.

The file we just read into KNIME contains a number of sales transactions for the years from 2005 to 2008. Each transaction has an ID, a Sales value (in USD), a Cost (also in USD), a Category and a Product, a Sales Date, a Quarter and a Year for the sales, the name of the Sales Rep who closed the deal and finally the Region and State where the sales took place.

We are going to use this data to perform various analysis and derive conclusions on the sales performances of our company and sales representatives in these years.

Changing a Column Data Type

Before starting with our analysis, we want to fix the fact that the ID column has a wrong data type. We asked our business controller and he confirmed that the ID column, despite looking numeric, should indeed be a String because later on two additional letters were appended to the ID and we want to maintain the compatibility with the new way of working.

In few words we need to change the data type of the ID column from Number (integer) to String. This can be accomplished in KNIME by using a manipulation node. Through some careful search in the Node Repository we find, under Column —> Convert & Replace, a node named Number to String. Note that we need to convert the data type of the entire column, hence we need a Column manipulation node. There are also a number of Row manipulation nodes which can be used for other tasks which are Row related. Take some time to familiarise yourself with the many other Convert & Replace node in the same group. You never know when one of them may become useful!

Add the Number to String node to the workflow and connect it to the XLS Reader node, then open its configuration window.

Note that by default KNIME has added to the Include list all columns which have a Number data type, integer like ID and Year or double precision like Sales and Cost. We want to convert only the ID column, so we need to remove all the other columns from the Include list. SHIFT + click on Sales, Cost and Year to select them all at the same time and click on the << remove button to move them to the Exclude list on the left. This indicates that only the ID column will be converted from Numeric to String.

Click Ok to confirm the configuration and Execute the Number to String node to make the conversion happen. Display its internal view to confirm that the ID column has now a String data type.

Adding a Calculated Column

The data table we are working with contains a Sales value and an associated Cost in USD for each sales. It would be interesting to also have a column with the Profit (or Margin) for each sales, where Profit = Sales – Cost. If you are familiar with Excel, this could be done by entering a simple formula into an empty column, then copying the formula down to all rows. Since there are 7085 rows in this data table, this may require quite some scrolling. In KNIME we can achieve the same result with much less pain by using a Math Formula node.

A Math Formula node can be used to add a new calculated column to any data table in KNIME or to replace an existing column with the result of a mathematical operation involving that column as well as other columns in the same table. The Math Formula node is also part of the Manipulation —> Column —> Convert & Replace group.

Add a Math Formula node to the workflow, connect its input to the output of the Number to String node and open its configuration. At this point your workflow should look something like this.

The configuration window of the Math Formula node may look complicated at first sight, so let’s go through it section by section.

The top left box named Column List contains the list of columns that are available to be added to the mathematical expression we are going to write shortly. Note that this list only includes those columns that have a Number data type. These are the only one we can “calculate” something with. Note also that there are two extra entries in the list, ROWINDEX and ROWCOUNT, which do not correspond to any column in our original table. We will explain more about them after having understood how the Math Formula node works.

Under the Column List is the Flow Variable List. Flow Variables are an advanced KNIME concept we will cover in one of the future tutorials and for the moment you don’t need to bother with them.

In the central section is a drop-down box which allows to filter through all the available functions as listed in the Function box. You can use any of these functions to build very powerful mathematical expression to be applied to your data. By clicking on each function a short description on its use appears in the Description box.

The Expression box is where we are going to write in few moments the mathematical expression needed to calculate the Margin from the Sales and Cost values.

Append Column indicates that the result of the mathematical expression should be stored in a new additional column of the data table. The box next to it allows to enter the desired name for the new column.

If you want instead to store the result inside an existing column, which has the effect of overwriting its original content, you can select the Replace Column option and choose from the dropdown list next to it which column will be overwritten.

Last but not list, convert to Int forces the result of the mathematical expression to be a Number (integer) data type. This means that any calculation result which is a fractional number will be forcefully converted to an integer before being stored back in the table.

Now that we understand each configuration option, it’s time to write the Expression that calculates the Profit and adds it as an extra column to the original data table.

To build the expression instead of writing it from scratch, first double click on Sales in the Column list.

The value $Sales$ is added to the Expression. This is a placeholder for the content of the Sales column. Now type a “-“ minus symbol with the keyboard, then double click on Cost inside the Column List. You should end up with an Expression like this.

All we need at this point is to indicate that the result of the expression (or formula) should be added to a new column in our data table. Click on the Append Column option to activate it and enter “Margin” as the name of the column to append. Finally click OK to confirm and close the configuration.

As you Execute the Math Formula node, the Expression will be applied to the data table row by row and the result stored in the corresponding row of a new column called Margin.

The Expression we just entered in the configuration tells the node to take the Sales value for a row, subtract to it the Cost value from the same row and store the result in the corresponding row of the new Margin column. Pretty straightforward, isn’t it?

If you didn’t do it already, Execute the Math Formula node and display its internal view to check that the new column Margin has been created and that the Margin value for each row corresponds indeed to the difference between Sales and Cost values for the same row. By default an appended column is the rightmost one in the table, so you may need to expand the window horizontally to display the Margin column.

For an easier comparison with the Sales and Cost columns, you can grab the Margin column by its header and drag it to be next to the Cost column. This is just a temporary effect and as you close the view the Margin column will “go back” to the rightmost side of the table.

Now that we know how to use the Math Formula node and how it works, let’s go back to the ROWCOUNT and ROWINDEX entries in the Column List. ROWCOUNT is easier to understand because it is just a placeholder for the total number of rows in our table, 7085 in this case.

ROWINDEX on the other hand changes to reflect the index of the current row as the Math Formula node goes through each row, one by one, applying the Expression formula to calculate the result. When the first row is being calculated ROWINDEX will be 1, then it becomes 2 when the second row is calculated an so on until the last row when it equals ROWCOUNT.

ROWINDEX is useful for example if you want to add a column with a sequential index of the rows, from 1 to ROWCOUNT (7805 in the case of our table). In this case the Expression is simply like this.

Note that since ROWCOUNT and ROWINDEX are special values, they are enclosed in a double $ symbol when added to an Expression opposite to column names that are enclosed in a single $ symbol.

Enough on the Math Formula node, let’s move to the next step.

Filtering Rows

A common task in data analysis is to filter data according to some criteria in order to work, from that point on, only with a defined subset of them. In the case of our sales data, let’s say we are interested in analysing only the results of the last available sales year, 2008. This can be achieved by taking out all rows in the data table except those whose year is equal to 2008 or, turning it around, allowing only those rows whose year is equal to 2008 to pass through the filter.

Since the filtering has to be applied row by row, we need a Row Filter node. KNIME comes with many different Row and Column filters. You can find them in the Node Repository under Manipulation —> Column —> Filter and Manipulation —> Row —> Filter. Take some time to read their description and familiarise yourself with their function.

When you are ready, add a Row Filter node to the workflow and connect its input to the output of the Math Formula node, then open its configuration window.

By default the Row Filter works by including (outputting) rows matching a set of criteria. Rows not matching the criteria are excluded, meaning they will stopped by the filter and will not be part of its output. This is appropriate for what we want to achieve with our sales data. Inclusion can also be based on row number or row ID. The Exclusion options behave exactly the opposite of the Include ones. They block all the rows that match the criteria and let all rows that do not match go through.

The matching is done using the content of a specific column in the data table, indicated by the Column to test: parameter. In our case we want to filter based on the sales Year, so go on and choose Year from the drop-down box. Since Year is a Number (integer) data type, the option filter based on collection elements is automatically greyed out.

Next we need to indicate the Matching criteria to select the sales year we want to work on, namely 2008. In this case we can use both pattern matching and enter 2008 in the text box below it or range checking and enter 2008 both as lower bound and as upper bound. Let’s use the latter since range checking is more appropriate for working with numeric values.

Click OK to confirm the configuration, then execute the Row Filter node and display its output.

Note that the total number of rows in the data table has been reduced from 7805 to 1751. Scroll down to confirm that only the sales related to year 2008 have been retained. Filtering in KNIME is a non destructive operation. The unfiltered table is always available from the node before the one applying the filter. Multiple filters can be applied one after the other to narrow down even further your data set.

A useful alternative to the Row Filter node is the Row Splitter node. The Row Splitter node splits the rows into two sets, one matching the filtering criteria and one not matching them. Both sets are available from its two output ports for further analysis. For example, you may want to split a sales data set into US sales and RoW (Rest of the World) sales, with RoW sales being all sales that are not US sales, to analyse the two sets separately and then compare the results. A Row Spitter node is exactly what you need in this case.

Grouping Rows (aka Pivoting)

Now that we have the exact subset of data we want to analyse, it is time to start getting some insights. We would like to answer questions like “What were the total sales for each product?”, “Which Sales Rep sold the most?”, “What were the total Sales, Cost and Margin in each quarter?” and many others.

A fundamental task in data analysis is the capability to aggregate similar data along different dimensions to describe them in terms of total (sum), mean, standard deviation, quartiles, etc. KNIME provides a powerful and versatile way to aggregate data and produce summary results, the GroupBy node. If you are familiar with Pivot Tables in Excel, think of the GroupBy node as a Pivot Table generator on steroids.

Let’s add a GroupBy node to our workflow. The GroupBy node can be found under Manipulation —> Row —> Transform. Connect its input to the output of the Row Filter and open its configuration window.

Within the Settings tab there are four tabs, Groups, Manual Aggregation, Pattern Based Aggregation and Type Based Aggregation. For our task we will focus on Groups and Manual Aggregation.

The options in the Groups tab are used to indicate to KNIME which columns you want to group on. By default no grouping column is defined (the right list is empty).

Let’s say we want to know how much each Sales rep sold in total in 2008.  We add the Sales Rep column to the right side to the KNIME we intend to group (aggregate) over each Sales Rep. You can simply double click the Sales Rep entry in the left list to move it to the right list.

Now we need to tell KNIME which column we want to aggregate per Sales Rep and how we want the aggregation to be performed. Should KNIME calculate totals, means, standard deviations or something else? Click on the Manual Aggregation tab to setup the aggregation columns and criteria.

Note that the Sales Rep column does not appear among the Available columns since it is being used for the Grouping and you cannot aggregate on the same column(s) you group on.

Since we want to calculate the total Sales, double click Sales in the Available columns to add it to the right list.

Note that by default Mean is chosen as the Aggregation method for Sales. In this case we want total sales, not average ones, so we need to change it. Click on Mean within the right column to display a list of available aggregation methods. Scroll down until you find Sum.

As you hover with the mouse cursor over the Sum method a tooltip appears explaining how the method works. Select Sum to switch the aggregation method for Sales to total sales, then click OK to confirm the configuration and close the window. Execute the GroupBy node and display its internal view to check how the aggregation has worked.

We now have a list of 76 sales representatives each one with their total sales for your 2008. Sweet! To see who the top performers were, you can use the embedded sorting capabilities of this view. Click on the Sum(Sales) column header and choose Sort Descending from the pop-up menu.

You can now see the top performers.

Note that this sorting is only temporary and the table will go back to its natural sorting order (row order) once the view is closed.

As an additional exercise to gain practice with the GroupBy node, you can try to answer the other questions that were formulated earlier. At this point your workflow will look something like this.

Output to an Excel File

As already mentioned, not only KNIME can read input data from an Excel file, it is also able to write output results to an Excel file. As a final step let’s add an XLS Writer node to our workflow and use it to store the table with the total sales per sales rep to an Excel file.

The configuration of the XLS Writer node is pretty straightforward.

By default all available columns are pre-added to the Include list. Once you have provided the output file name and checked, if useful, the add column headers under the Add names and IDs option, you can click OK to confirm the configuration and close the window. Then Execute the node and check in the destination folder that the output file has been properly created. If you have Microsoft Excel, you can try to open it to check its content. You may even quickly add a bar chart to produce a visual representation of the performances of your sales force in 2008. Congratulations to Mrs. Jane Holden!

At the end of this Part your workflow should look like this.

Remember to save your workflow before closing KNIME.

Conclusions

This concludes the series of tutorials KNIME for Beginners. I hope you found them instructive and useful at the same time. If I get requests for it, I may decide to start a new series of KNIME tutorials focusing this time on more advanced topics like Flow variables, Meta Nodes and Loops.

The post KNIME for beginners – Part 4 appeared first on Marco Ghislanzoni's Blog.


Viewing all articles
Browse latest Browse all 10

Latest Images

Trending Articles





Latest Images