We are searching for excellent candidates with a PhD in Computer Science, Remote Sensing or a similar field interested in conducting research in archaeological sites detection based on applying machine learning techniques on multispectral aerial images. Contract duration: 22 months
The context of this project is an ongoing collaboration between the Catalan Institute of Classical Archaeology (ICAC) at Tarragona and the Computer Vision Center (CVC) at Barcelona. The contract is linked to the project Mapping Archaeological Heritage in South Asia (MAHSA) funded by the Arcadia Fund and developed as a collaboration between the University of Cambridge, The University Pompeu Fabra and the Catalan Institute of Classical Archaeology.
Research background
With our previous research we developed an algorithm in JavaScript for Google Earth Engine for the detection of archaeological sites using server combined satellite data (Sentinel 1, SAR, and Sentinel 2, multispectral) and multitemporal (merging 6 years of data for each of the included bands).
The bands included in the composite raster are as follows:
- Sentinel 1 (4 bands, 1427 merged images) Sentinel 2 (10 bands, 2914 images)
- Ascending VV, descending VV B2-B8, B8A, B11 and B12
- Ascending VV, VH, descending VV, VH
The algorithm used a Random Forest with 300 trees in probability mode and was trained using pixels from the composite image at points where we knew there were archaeological sites (Orengo et al. 2020). In this new project, we are interested in modifying this algorithm to further automate the process and include new bands such as SMTVI (Orengo and Petrie 2017), MSRM (Orengo and Petrie 2018) and thermal data derived from Landsat 7 that we think may be useful to differentiate archaeological sites. Dividing these data into seasonal groups will show significant changes in vegetation cover, humidity, and temperature that may help to distinguish archaeological sites. This will lead to a significant increase in the number of bands of the composite raster. We would make these new band additions in collaboration with the candidate, if applicable.
Additions to the algorithm would be:
- Automatic classification of the multiband raster of the study area to divide it into whole areas with a similar environment. An unsupervised classification would suffice. We’ve done that with the Google Earth Engine as well, and it would be possible to export the classified areas from there.
- Automatic selection and classification of the training data provided by the user (for each of the areas in which the previous process has divided the raster) to ensure that these are complete and that they correspond to the bands of the multiband image and not subjective visual perceptions.
- Initial classification test to make a selection of the bands with real impact on the classification result. This way the ranking can be performed using only the significant bands and save much processing.
- Probabilistic classification of the various sectors of the study area (0-1) using the training data provided by the user.
- Ability to select new positive and negative data to improve classification in successive iterations.
The addition of new bands (up to a total of 28-34) would increase computing and prevent the continued use of Google Earth Engine. Therefore, the candidate will need to adapt the algorithm to use their own HPC resources or cloud services. This may involve adapting the algorithm to another language, such as Python. However, this code should connect to the Google Earth Engine platform to obtain the multiband raster, as this is the only platform that can provide this type of information. Earth Engine can be accessed using Python code from a Jupyter Notebook, Colab or similar.
The last step would be to create a front-end that gives access to the algorithm and resources for its training (uploading files with location of sites, selection of new data in successive iterations, etc.). This last step, however, would not fall within the functions of the position. Subcontracting this service will be considered. In any case, the algorithm should be ready to serve as a back-end.
References
- Orengo et al. 2020. https://www.pnas.org/content/117/31/18240
- Orengo i Petrie 2017. https://www.mdpi.com/2072-4292/9/7/735
- Orengo i Petrie 2018. https://onlinelibrary.wiley.com/doi/full/10.1002/esp.4317