Some of these questions have been asked on various forums (mail/github) and I thought of addressing them here:
Question 1. When I looked into the appliance level data (3.csv, 4.csv, etc.) and compare with mains, I observed that each appliance starting time period is different for example, fridge starts from 7th June and Washing machine starts from 10th June, whereas mains have data from 22nd May.
This is correct. We used jPlugs for collecting our appliance level data. Our smart meter deployment to collect mains data started a few days before the data collection for different appliances. jPlugs would collect data only when the appliance was ON. Thus, washing machine data starts from 10th June, the day it was first used during the data collecion.
Question 2. What is the best start and end time for using electricity data?
The nilmtk HDF5 I created uses data between 7-13-2013 and 8-4-2013. This period has the maximum amount of sensor data available.
Question 3. What preprocessing have you done in nilmtk HDF5?
In light of question 1 and 2, I have done the following preprocessing in the nilmtk HDF5:
- Chose start and end date as 7-13-2013 and 8-4-2013
- Downsampled data to 1 minute resolution
- Ignored water motor which didn't have sufficient data in this time window
- Filled gaps where data was missing with zeros (this indicates appliance was not being used)
- Ensured that all appliances and both the mains have equal amount of data
Question 4. Your paper claims to have more data than what you have provided on the data set page.
This is correct. We post processed the data a while after the data collection. We found that:
- The CT data is probably not very useful. This is due to the fact that there was interference among different CTs and thus close by CTs report same values, which is wrong
- The phone data collected from FunF was huge in size. We thought that our data collection was going on nicely when we saw the huge amount of data being generated. However, during our post processing, we found that FunF can produce tonnes of data even in an hour. The application had actually run only for a very small time period and thus we don't think that the phone data would be very useful.
- Some of the sensors failed in between and thus we removed their data.
Question 5. The difference between the Sum of appliances and Mains is very huge (on an average >350W)
This is likely to happen. We didn't monitor non plug load appliances such as fans and lighting which can consume upto 350 Watts. The idea was to monitor them using CTs on the MCB. However, as mentioned in question 4, this didn't quiet work out. Having said that, nilmtk reports about 74% of energy is submetered, which is a good number and comparable or better than many existing data sets.
Question 6. The NILMTK paper says the fraction of total energy assigned is around 0.89, we are getting a different number.
NILMTK paper was written when nilmtk was in v0.1. Since that, nilmtk has evolved. Also, various filtering procedures were applied. If the exact same conditions are met, like the same test-train split, etc., you should get same numbers.
Question 7. If my understanding is correct, Ambient database consists of mainly two different data sets, labeled as i)Light temp and ii) PIR. Both sets are presented as .csv file format with Light_temp.csv contains 4 columns, while PIR shows 2 columns. Would it possible to get the information about those columns?
Information schema for multisensor can be found here
I am repeating here with more information. Light_temp.csv has the following four columns: timestamp, node_id, light, temp. Here, timestamp is Unix epoch, node_id is the id of the sensor (from 2-7 where each sensor node is kept in a different room), light (light intensity on a scale of 0-100), temp (temperature in Fahrenheit)
Pir.csv has the following two columns: timestamp, node_id Here node_id is the same as in the previous case. Here, a reading means that at this node_id and timestamp motion was detected.
Question 8. While collecting data, any kind of correlation was measured / observed between this Ambient dataset and electricity data?
I think there is a very strong relationship between ambient and electricity data. I haven’t looked into the same in detail yet. However, the following figure (which is Figure 11 in our paper describing the data set) should be helpful.
Question 9. In the paper you mention about annotated data set. Where can I find it?
I noted events for a particular day here.
Question 10. In the paper, section 3.1, under sub-section head “Ambient monitoring”, it is said that the data have been picked up at the rate of 1 Hz. If this is the case, we should be getting data at every second. However, I am seeing from the data file that the data has been dumped irregularly (i.e., not at the rate of 1Hz rate) for both files. Only for the 4th August 2013, the data has been picked up in every 1 second. May be my observation could be wrong.
I think you are right. There are some limitations in the OpenZWave stack that we used for data collection. We had to hack our way to poll data from these sensors which would happen in a circular fashion. I think we set our program to collect data at quickest possible rate. However, the stack itself had some issues preventing the same. Having said that, I think downsampling data to something like 1 minute, using some feature such as max, min, median, etc. should be sufficient, given that these ambient parameters don't really change that much in the absence of events.
Question 11. Are any sound sensors are placed in those 5 rooms? If that is the case, I couldn’t find any sound level measured data.
We did use phones across all these 5 rooms collecting data using FunF journal. FunF generates huge amount of data for even small time intervals. I didn't pay enough heed to this and thought that data collection had been going on smoothy, which wasn't the case. As a result, when I analysed the data collected from the phones recently, I found it to be useless and thus I didn't put it up publicly.Also see question 4.
Question 12. There are negative time stamps in the data? Should we ignore those part and consider only positive time epochs?
These can be safely ignored.
Question 13. In the electricity data, do the first two files i.e., 1.csv and 2.csv (they have been marked both as “mains” in “label.dat” file ) are same? I opened both the files and their found their VLNs and fs are matching at the same time-stamp, while there are differences between respective columns. If my observations are correct, could you please let us know what and why those differences have been occurred ?
The instrumented home had 2 meters installed by the utility. 1.csv and 2.csv correspond to these two meters. Now, the frequency and voltage measured by them is the same as both of them get the same supply from the grid. These meters cater to disjoint set of loads within the home. For instance, the two Acs are on different meters.
Question 14. From the picture above (taken from your research article), can we get information that , which ambient node_id placed in which room or near to which appliances?
The mapping between rooms and node id is here.
Within the room, you need to see the location of the multi sensor to see which appliances are close by. Ground floor big room is the room containing the TV. Ground floor small room contains no monitored appliance and is on the other side of motor. First floor big room contains AC and electric iron. First floor small room contains AC and laptop.
Question 15. I am seeing only five multi sensors (Multi-sensors + Android phone combo) are placed in 5 rooms. Is my observation is correct?
You are right. Node #5 broke down and thus was removed. Initially, we had planned to install it in the kitchen.
Question 16. Is within electricity data, the “Geyser” data is considered because in “label.dat” file the name of Geyser is not there? However, I believe that, jPlug is attached with Geyser data to get the appliance level data. At the same time, in your paper, you mention that during your deployment 3 jPlugs were stopped functioning, is that the same reason why Geyser data has not been recorded ?
You are right. The geyser jPlug malfunctioned.
Question 17. Looking at water_meter.csv, I see multiple timestamps per second in some cases, which would be much too fast for liter pulses, let alone 10 liter pulses. From the looks of iawe-website/mapping.py the first column probably represents which meter is being logged.
We did have issues with water meter data collection. As far as I remember, we ended up discarding events which took place within a very short interval as being false positives. I know this is not ideal, but we didn't find any other way to resolve this issue. The overhead tank was on the rooftop and we had to run a 12 feet wire down to a floor below where we would log the pulses on a Raspberry Pi. The plots on this Github issue indicate that choosing a suitable time window and discarding multiple events during the same was reasonable. Unfortunately, I am not yet able to find the script where I had done these calculations and made these plots. I will look into my older laptop to see if I can find something.
Question 18. From the looks of iawe-website/mapping.py the first column probably represents which meter is being logged. What does the third column of zeros and ones represent?
The third column of 0 or 1 represents whether there is an event occurring or not. Earlier, we had used interrupt based mechanism to log the data where we needed the third column. Later, we shifted to polling and continued using the same schema.
Question 19. The paper also says that there is 1 day of fully labelled data for 18 water fixtures. Where can I find this?
The following link contains information on annotated events. Regarding the 18 fixtures, I think I had done metadata collection where I'd open one water fixture in isolation for a fixed amount of time and note the water consumed from the home water meter. I did have this recorded somewhere, but am unable to find it currently.
Aside: Have a look at Figure 9 in http://arxiv.org/pdf/1404.7227v3.pdf and the corresponding text.
This figure captures the motor’s energy-water relationship. While turning on the water motor increases the pumping rate from 1 l/min to 20 l/min, it incurs an additional expense of 1 horsepower.
Question 20. I actually did see nipunbatra/Home_Deployment#24 before. Are events just grouped based on how close they are together? Looking at the most zoomed out plot (the last of the three), I see from around 15:30 to 16:00 a half hour's worth of events are lumped together. Looking at a similar sized window from 13:00 to 13:30 the events are separated into a few different groupings. Where can I find this example in the data? Based on the comment, I am assuming it is June 14, 2013 but I do not know if you are using GMT or your local time. What does the "Reading" column mean in the table?
The way the pulse based meters work is that a pulse is sent if the reading modulus 1 is less than 0.1. Thus, when we poll such data, it is possible that water usage stopped at this point and the meter will keep sending pulses. This is probably what is happening in the most zoomed out plot. I am using local time (GMT +5.5 hrs) on my data collection.
Thus, the grouping only shows that there was no water usage during that time. Or, being more technically correct, the water usage didn't change by more than 0.1 of the resolution of the meter (1l or 10l for the different meters).
The table shows the reading in litres which I noted manually from the water meter.
Question 21. Is your data set a good one for water disaggregation?
In light of the above questions, the answer is probably not.