GSOC2024 ML/DL Starter Problem
GSOC2024 ML/DL Starter Problem
This is a simple toy problem meant as a pre-application exercise for the GSOC2024 project Astronomical data enhancement with DL.
Overview
Here we will have a more simplified case of the actual project with no time information. We have measurements of galaxies in different wavelengthts (i.e., broadband filters) in five different fields and the task is to bring them all onto a same wavelength footing. A simple notebook to read the galaxy data in the initial filters is in this repository (gsoc-ML-exercise/ReadCandels.md).
Link to the problem published in the Open Astronomy GitHub: Published Problem
Instructions
- Clone this repo and checkout the branch gsco-ML-exercise.
- Write code.
- Required: A simple way to combine all five fields in optical and NIR filters and output one file in the requested wavelengths.
- Optional: Use ML or DL to do this combination.
- Optional: Add plots to show what you did makes sense.
- Optional: Use prior information in ML by grouping galaxies at similar redshifts/stellarmasses/ etc. which are columns in the initial catalogs.
- Required: Write text. 300 words max, included as a '.md' file.
- Explain what you did and how you approached the problem.
- For any code that you did not write but would if you had more time, write down what you would do.
- Required: Open a PR with your code and writeup to merge to the main branch. Have GSoC2024 in the title of your PR.
Proposed Solution
Here I present the solution I implemented for this DL Starter Problem (GSoC2024).
Link to GitHub : Solution Pull Request
Solution Description
First, I implemented an interpolation function that linearly interpolates the measurements. The results offered a good first approximation. Then I implemented a polynomial interpolation exploring the best degree of the polynomial. This method offered an output considering a wider range of information from the measurements and improving the solution.
Subsequently, I searched for the best ML models to predict these measurements. I started with Support Vector Regression model, which outperformed the previous methods after briefly tuning the parameters. This approach harnessed more detailed information from the data, achieving superior fits and capturing complex data relationships more accurately. Finally, despite experimenting with Deep Neural Networks, challenges arose due to overfitting, the high computational demands and the poor approximation to the measurements obtained, made it less feasible and appropriate to the project's scope.
Results
The following graphs, for each interpolation method, represent the plots for 3 sample galaxies, comparing the original measurements with the predicted ones showcasing also the interpolation function obtained to see how the method adjusted to the measurements.
Linear Interpolation
We can clearly see how the interpolated points in the common wavelenghts are obtained directly form the linear interpolation. This is a very simple way to obtain the measurements although there are more advanced interpolation methods that can offer a more reliable solution based on a wider range of information from the measurements we have.
Polynomial Interpolation
The outcomes obtained with polynomial interpolation appear to be more grounded in the information and potential relationships between measurements than with linear interpolation. This method offers a well-adjusted interpolation that aligns closely with both the general trend and specific data points.
SVR Interpolation
Using interpolation through a Support Vector Regression (SVR) model demonstrates significantly improved outcomes, indicating that this approach effectively leverages a broader spectrum of information. Unlike simpler interpolation methods, SVR is adept at capturing complex relationships within the data by learning a detailed curve that represents a higher understanding of the actual wavelenght functions. The total computation time with this model is affordable being less than 5 minutes approximately.
DNN Interpolation
Using a deep neural network model, the results seem to be less accurately fitted. With extensive training, the model tends to overfit, resembling linear interpolation, while insufficient training leads to outcomes that do not align well with the expected values. Moreover, the training demands are significantly high for the scope of this project, presenting practical constraints in terms of time and computational resources. This suggests that although deep neural networks offer powerful modeling capabilities, their application may not be the most efficient or effective choice for projects with limited resources or those aimed at modeling data with these specific underlying patterns.
Final Conclusion
Upon reviewing the outcomes of various interpolation methods, the Support Vector Regression (SVR) model stands out as the most promising. This approach appears to encapsulate a broader array of information from the data measurements, demonstrating superior adaptability and precision in its fit compared to other techniques. Unlike the polynomial interpolation, which required careful balancing between degrees to avoid overfitting or underfitting, with results limited by the polynomial properites, and the deep neural network model, which faced challenges with overfitting and high computational demands, the SVR model effectively captures the complex relationships within the data while maintaining am affordable training.
Future Improvements
With more time, a fine-tuning of the parameters of the ML methods could be done to improve the results using cross-validation within the data we have. Also I could investigate more ML models that have proven good results in the past with similar problems. Finally, another way to approach this problem could be using pretrained models used in interpolation or even searching for similar data to train the ML models and try to improve their results.
Comments
Post a Comment