Contents Science Lab


A wide-area multi-modal multi-view dataset for action recognition and transformer-based sensor fusion research.
MultiSensor-Home: Multi-modal multi-view dataset and benchmarks for action recognition in home environments
This dataset is introduced in a paper published in Pattern Recognition (IF: 7.6) and presented at the IEEE FG conference. For detailed methodology, experimental results, and technical insights, please refer to the publication.
A simple way to download the dataset:
# Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]"
hf download thanhhff/MultiSensor-Home1 --repo-type=dataset --local-dir dataset/home1
hf download thanhhff/MultiSensor-Home2 --repo-type=dataset --local-dir dataset/home2
MultiSensor-Home is a comprehensive multi-view action recognition dataset captured in a real home environment. The dataset features:
Note: The original high-resolution dataset (4000×3000 pixels) is available upon request. Please contact: nguyent [at] cs.is.i.nagoya-u.ac.jp

Home1 and Home2 floor plan showing camera positions and room layout
This layout file is essential for understanding the spatial relationships between different camera views and the overall recording environment.
MultiSensor-Home1/
├── 01/ # Recording session 1
├── 02/ # Recording session 2
├── 03/ # Recording session 3
├── 04/ # Recording session 4
├── 05/ # Recording session 5
├── 06/ # Recording session 6
├── 07/ # Recording session 7
├── 08/ # Recording session 8
├── all_labels.json # Complete annotations
├── train.json # Training split annotations
├── test.json # Test split annotations
└── README.md # This file
Videos follow the pattern: {id}-{View}{number}-Part{part}.mp4
Examples:
00-View1-Part1.mp4 - ID 00, View 1, Part 115-View3-Part2.mp4 - ID 15, View 3, Part 223-View5-Part1.mp4 - ID 23, View 5, Part 1The dataset contains 16 action classes covering various human activities in the home environment:
Each video segment is annotated with:
{
"video_url_1": "01/00-View1-Part1.mp4",
"video_url_2": "01/00-View2-Part1.mp4",
"video_url_3": "01/00-View3-Part1.mp4",
"video_url_4": "01/00-View4-Part1.mp4",
"video_url_5": "01/00-View5-Part1.mp4",
"tricks": [
{
"start": 3.2472731152647976,
"end": 6.1332581718146235,
"labels": ["Sitdown"]
},
{
"start": 7.524156360433797,
"end": 59.07342151340292,
"labels": ["ReadBook"]
}
]
}
The original dataset at full resolution (4000×3000 pixels) is available upon request.
Please include:
When using this dataset, please cite our paper:
@article{nguyen2026PR,
title = {MultiSensor-Home: Multi-modal multi-view dataset and benchmarks for action recognition in home environments},
author = {Trung Thanh Nguyen and Yasutomo Kawanishi and Vijay John and Takahiro Komamizu and Ichiro Ide},
journal = {Pattern Recognition},
pages = {113810},
year = {2026},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2026.113810},
url = {https://www.sciencedirect.com/science/article/pii/S0031320326007752}
}
@inproceedings{nguyen2025multisensor,
title = {MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion},
author = {Trung Thanh Nguyen and Yasutomo Kawanishi and Vijay John and Takahiro Komamizu and Ichiro Ide},
booktitle = {Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition},
year = {2025},
doi = {https://doi.org/10.1109/FG61629.2025.11099071},
note = {Best Student Paper Award}
}
@inproceedings{nguyen2025multisensor_DS,
title = {MultiSensor-Home: Benchmark for Multi-modal Multi-view Action Recognition in Home Environments},
author = {Trung Thanh Nguyen},
booktitle = {Proceedings of the 7th ACM International Conference on Multimedia in Asia},
year = {2025},
doi = {https://doi.org/10.1145/3743093.3771657},
note = {Doctoral Symposium}
}
We welcome contributions and feedback. If you find any issues or have suggestions for improvements, please contact us.
For questions about the dataset, paper, or to request the original high-resolution version:
Email: nguyent [at] cs.is.i.nagoya-u.ac.jp
This work was partly supported by Japan Society for the Promotion of Science (JSPS) KAKENHI JP21H03519 and JP24H00733.
This dataset is designed to advance research in multi-view action recognition, sensor fusion, and transformer-based approaches for understanding human activities in real-world environments.