In this blog series, we’re proud to shine a light on some of the top Capstone projects from the fifth graduating class of Data Science for All / Colombia. Capstone projects are a critical component of the program’s curriculum. Teams work on projects together to apply what they learn during the program to a real-world data problem. These projects were sourced directly from public and private entities in Colombia and solve a real problem these entities are facing. Through these projects, our graduates learn practical, job-oriented data skills and give back to their community using the power of data science & AI.
Meet the Team
The project is the work of a multidisciplinary team, which allows have different points of view about how to solve the problem for the Institute. Our team members combined their strengths and skills to accomplish tasks and support other members, the team cohesion was key factor in our team. The team is composed by:
Industrial engineer, Master of Science in engineering focused on applied statistics. areas of interest: linear regression, multivariate statistics, machine learning and astrostatistics. Member of the International Astrostatistics Association (IAA). He is coauthor of some articles related to statistics applied to astronomy and mechanical engineering.
Administrative engineer, graduate diploma in economy and MBA, she likes create strategies based on data analysis. Her background includes lead areas of marketing and business intelligence in different economic sectors.
Business administrator with background in Customer services. He likes data science.
About the Project:Instituto Geográfico Agustín Codazzi
The Project is called CATS, which stands for Clasificador Automático Taxonómico de Suelos in Spanish, or Automatic Taxonomic Soil Classifier.
Soil Taxonomic Classification Automation because we want help to drive Agriculture development.
This project was conceived out of the necessity from the Instituto Geográfico Agustín Codazzi (IGAC) of having a tool capable of performing automatic soil taxonomy classification, using data collected in the field.
This application seeks to aid and make easier the process of taxonomic soil classification performed by the edaphologists from the Instituto, whose goal is to make an inventory and cartography of the soils in Colombia. For this purpose, several models were fitted and tested using a database with more than 12,000 observations, which recorded 209 different variables, collected in the Cundiboyacense Plateau.
Click to read the datafolio
Finalist Capstone Project Presentation
What was the most exciting/surprising findings from your project?
After fitting, validating, and testing the model accuracy was about 94% despite of data quality. Building a model that give such accuracy make us to feel proud about our data cleaning and the excellent job in the EDA.
What were some challenges you faced and how did you overcome them?
The main challenge was to clean the data which had 199 features because poor data quality and many null values, doing the analysis and the modeling tasks difficult. In fact, there were variables that we need to recategorize since has about 200 categories. On the other hand, design a data model for the application to help improve data quality and avoid future data problems was an interesting challenge.
What do you view as the impact of your project?
We created an application that is scalable, so we see it in the future working in two sites, one private and when the IGAC load new data, and make maintenance of the model(s) and based on that a public site where farmers can go to know about the taxonomical classifications of their soils and look for the best crops to plant in their farms based on the soil classification.
Congratulations to this team, their mentors, and TA, for this accomplishment!
If you're interested in joining our Data Science for All mission to recruit our Data Science for All fellows or to become aMentor,please get in touch.