Note: This task was very generously contributed by Timo Grossenbacher - Geographer and Data Specialist extraordinaire. Thanks very much Timo!
Often, you have to combine two or more different data sets because they contain complementary information. Or, for example, the data come in chunks from the server and need to be reassembled on the client side.
Combining or merging data may involve one of the following tasks:
Let's say we have a list of the following articles:
var articles = [{
"id": 1,
"name": "vacuum cleaner",
"weight": 9.9,
"price": 89.9,
"brand_id": 2
}, {
"id": 2,
"name": "washing machine",
"weight": 540,
"price": 230,
"brand_id": 1
}, {
"id": 3,
"name": "hair dryer",
"weight": 1.2,
"price": 24.99,
"brand_id": 2
}, {
"id": 4,
"name": "super fast laptop",
"weight": 400,
"price": 899.9,
"brand_id": 3
}];
And of the following brands:
var brands = [{
"id": 1,
"name": "SuperKitchen"
}, {
"id": 2,
"name": "HomeSweetHome"
}];
As you can see, in each article, brand_id
points to a particular brand, whose details are saved in another data set - which can be considered a lookup table in this case. This is often how separate data schemes are stored in a server-side database. Also note that the last article in the list has a brand_id
for which no brand is stored in brands
.
What we want to do now is to combine both datasets, so we can reference the brand's name
directly from an article. There are several ways to achieve this.
Array
functionsWe can implement a simple join (left outer join in database terms) using native, i.e., already existing Array
functions as follows. The method presented here modifies the articles
array in place by adding a new key-value-pair for brand
.
articles.forEach(function(article) {
var result = brands.filter(function(brand) {
return brand.id === article.brand_id;
});
delete article.brand_id;
article.brand = (result[0] !== undefined) ? result[0].name : null;
});
console.log(articles);
=> [{
"id": 1,
"name": "vacuum cleaner",
"weight": 9.9,
"price": 89.9,
"brand": "HomeSweetHome"
}, {
"id": 2,
"name": "washing machine",
"weight": 540,
"price": 230,
"brand": "SuperKitchen"
}, {
"id": 3,
"name": "hair dryer",
"weight": 1.2,
"price": 24.99,
"brand": "HomeSweetHome"
}, {
"id": 4,
"name": "super fast laptop",
"weight": 400,
"price": 899.9,
"brand": null
}];
First, we loop over each article
, where we take its brand_id
to look up the corresponding brand
using the native filter
function. Note that this function returns an array and we expect it to have only one element. In case there is no corresponding brand
, result[0]
will be undefined
, and in order to prevent an error (something like result[0] is undefined
), we use the ternary operator.
Also, as we no longer need brand_id
after the lookup has been done, we can safely delete it.
If we want to join by more than one attribute, we can modify the filter function to achieve this. Hypothetically, this might look something like:
innerArray.filter(function(innerArrayItem) {
return (
innerArrayItem.idA === outerArrayItem.idA &&
innerArrayItem.idB === outerArrayItem.idB
);
});
A more generic, and also more performant version of a join is proposed below (abbreviated from this StackOverflow answer). Its output is equivalent to the one of the above method.
function join(lookupTable, mainTable, lookupKey, mainKey, select) {
var l = lookupTable.length,
m = mainTable.length,
lookupIndex = [],
output = [];
for (var i = 0; i < l; i++) { // loop through l items
var row = lookupTable[i];
lookupIndex[row[lookupKey]] = row; // create an index for lookup table
}
for (var j = 0; j < m; j++) { // loop through m items
var y = mainTable[j];
var x = lookupIndex[y[mainKey]]; // get corresponding row from lookupTable
output.push(select(y, x)); // select only the columns you need
}
return output;
};
Because above defined function creates an index for the lookupTable
(in our case brands
) in the first iteration, it runs considerably faster than the previously shown method. Also, via a callback, it allows us to directly define which keys (or "attributes") we want to retain in the resulting, joined array (output
). It is used like so:
var result = join(brands, articles, "id", "brand_id", function(article, brand) {
return {
id: article.id,
name: article.name,
weight: article.weight,
price: article.price,
brand: (brand !== undefined) ? brand.name : null
};
});
console.log(result);
=> [{
"id": 1,
"name": "vacuum cleaner",
"weight": 9.9,
"price": 89.9,
"brand": "HomeSweetHome"
}, {
"id": 2,
"name": "washing machine",
"weight": 540,
"price": 230,
"brand": "SuperKitchen"
}, {
"id": 3,
"name": "hair dryer",
"weight": 1.2,
"price": 24.99,
"brand": "HomeSweetHome"
}, {
"id": 4,
"name": "super fast laptop",
"weight": 400,
"price": 899.9,
"brand": null
}];
Note that we don't modify articles
in place but create a new array.
Let's say we want to load a huge data set from the server, but because of network performance reasons, we load it in three chunks and reassemble it on the client side. With D3v5 and later, we can use Promise.all() to run many requests concurrently, combining them after they have finished downloading using d3.merge(). Note that Promise.all()
takes an array of Promises, in this case supplied by calls to d3.csv()
.
Promise.all([
d3.csv("/data/big_data_1.csv"),
d3.csv("/data/big_data_2.csv"),
d3.csv("/data/big_data_3.csv")
]).then(function(allData) {
console.log(d3.merge(allData));
});
=> [{"a": "1", "b": "2"},{"a": "3", "b": "4"},{"a": "5", "b": "6"}]
Note that the argument passed to d3.merge
must be an array itself, which is why we use the square brackets.
In the last case, we have two or more data sets that contain attributes describing the same observations, or conceptual entities, and they need to be combined. This implies that all data sets have the same length. For example, dataset_1
below contains two observations of attribute type
and attribute model
, while dataset_2
contains the same two entities, but observed through attributes price
and weight
.
var dataset_1 = [{
'type': 'boat',
'model': 'Ocean Queen 2000'
}, {
'type': 'car',
'model': 'Ferrari'
}];
var dataset_2 = [{
'price': 23202020,
'weight': 5656.9
}, {
'price': 59988,
'weight': 1.9
}];
So in both data sets we essentially have separate information about the same conceptual entities, thus it makes sense to "merge" them, for which we can use lodash's merge
function:
var result = _.merge(dataset_1, dataset_2);
console.log(result);
=> [{
'type': 'boat',
'model': 'Ocean Queen 2000',
'price': 23202020,
'weight': 5656.9
}, {
'type': 'car',
'model': 'Ferrari',
'price': 59988,
'weight': 1.9
}];