App for AWS Elastic Beanstalk


Project flow, teamwork, and function implementations for a complete data science project within Lambda School Labs.

Putting it all together in a nutshell
I spent two months working on an inherited project with a team of web developers and data scientists, collecting and cleaning new data, performing data analysis and feature engineering, creating and adding new features to the project using machine learning modeling techniques, delivering them via FastAPI endpoints, and eventually deploying the app to AWS Elastic Beanstalk with AWS Relational Database.

Requirements, Concepts, Architecture, and Planning
CitySpire is an app that analyses data from cities such as population, cost of living, rental rates, crime rates, park (walk score), and a variety of other social and economic factors that influence where people choose to live. The aim of CitySpire is to provide users with the most up-to-date city details in one convenient location. The app must present critical city data in a user-friendly and easy-to-understand format.

Our task as data scientists was to gather and analyse city data such as housing costs, weather patterns, schooling data, work listings, and other variables, make predictions and forecasts for these factors, and recalculate a new livability ranking.

Crime rates, walk ranking, traffic, population, rental prices, air quality, city recommendations, and livability score were all API endpoints in an inherited app.

The first move was to create an app architecture on top of the one that was already in place:

Here’s a link to the Whimsical website for a closer look at the entire software architecture, including web and iOS for the project: Diagram of the CitySpire Architecture.

The First Steps
Since our project’s objective had already been established, the next and most crucial move was to locate the data we would need for our project. We needed to gather housing, weather, job openings, and school data for each city and state, then clean and feature engineer it. I was in charge of finding weather data for each city for more predictions and forecasting after we divided tasks to work more effectively with my peer data scientists. The other members of the team were tasked with providing house rates, job listings, and educational information.

Feature Engineering and Exploratory Data Analysis
Finding some data online takes the most time; I spent the first two weeks searching for and collecting weather data for each region, then combining it: Github. The following step was to clean the data, remove missing values, and perform feature engineering. Since some of the features had more than half of their data missing, I decided to delete them. Other features had less than 20% missing values, so I used the mean and median to fill them in. Before feeding the data to a model, it was also necessary to encode categorical values and scale all of the data.I used OrdinalEncoder for encoding and StandardScaler from the Sklearn library for scaling. The GitHub repository for the exploratory data analysis and feature engineering process can be found here.

Modeling of Data
I’ve decided to create two functions, one for weather temperature forecasting and the other for weather conditions (sunny, rainy, cloudy, and snowy days) each year. Timeseries modelling was used for my first function, and aggregation was used for the second.

I chose Facebook Prophet because I wanted to do several timeseries modelling. It’s the programme that generates automatic forecasts that can be fine-tuned by hand. It’s best for non-linear patterns, and it’s resistant to missing values and outliers, which is exactly what I needed for our project. The only requirement is that the goal column be renamed to ‘y’ and the date column be renamed to ‘ds’. I grouped data by city in the code below so that I could get forecasting for each city separately. Then I ran regular forecasts for the next two years and saved the data in a csv format.

group cities for al data

cities = df_final.groupby([‘City’,’State’], as_index=False)

forecast per each city

for city in cities.groups:
group = cities.get_group(city)

# define model 
model = Prophet()

# fit the data

# make forecast for next two years
future = model.make_future_dataframe(periods=365*2, freq='D')
forecast = model.predict(future)
forecast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
forecast['City'] = city[0]
forecast['State'] = city[1]

# append csv with forecasted data
forecast.to_csv('data/weather_2year_forecast.csv', mode='a', index='False')

The following is an example of a forecasted temperature dataset, where ‘yhat’ represents the expected average temperature, ‘yhat lower’ represents the minimum predicted temperature, and ‘yhat upper’ represents the maximum predicted temperature:

I used the Sklearn library’s mean absolute error (MAE) and root mean square error (RMSE) performance metrics to evaluate my model’s performance.

Sacramento, California, has decent results:

The performance is depicted in the graph below. The model appears to be able to forecast average temperatures, but it does not always catch extreme values. The model’s overall performance is excellent, in my opinion.

Here’s where you’ll find the whole FB Prophet workflow: GitHub is a platform that allows you to share

Weather Conditions as a Whole
I grouped the values into four labels for the weather conditions: sunny, cloudy, rainy, and snowy, and aggregated records based on four years of historical weather conditions to generate a new CSV file with this info. Since some cities do not have specific requirements that are normally only in some areas, I filled NaN values with zero. The complete code can be found on GitHub.

The following code snippet demonstrates how I measured the mean for each city and combined the results:

create a new df with calculated mean

conditions_df = pd.DataFrame()

group cities

cities = new_df.groupby([‘City’, ‘State’], as_index=False)

for city in cities.groups:
group = cities.get_group(city)

# get columns city, state
series1 = pd.Series({'City': city[0], 'State': city[1]})

# get the mean, round and convert float to int
series2 = pd.Series(group.mean().round().astype('int64')[:4])

# concatenate 2 series
concatenated = pd.Series(data=pd.concat([series1, series2]))
# append df
conditions_df = conditions_df.append(concatenated, ignore_index=True)

The following is an example of aggregated weather data:

Weather Conditions dataset

Configuring FastAPI
FastAPI was easy to set up because we had already inherited the project. The previous data science team left us with some excellent guidance. It was just a matter of cloning the git repository and setting up the local environment. It downloaded and installed all of the project’s dependencies.

Elastic Beanstalk on AWS
To deploy to Elastic Beanstalk, I first had to install AWS CLI on my computer, which I did by installing an installer based on my operating system. After installing the AWS CLI, I decided to make sure everything was working properly by running the command AWS —version to make sure I wasn’t getting a “command not found” error instead of a version number. The next move was to download and install the AWS Elastic Beanstalk CLI: nip Install awsebcli and verify its version with web —version.

I needed to follow these steps to deploy the local FastAPI to AWS EB with the newly developed endpoints:

Then, to configure AWS CLI, I had to run aws configure and enter my access key when prompted. If that’s completed, the following commits can be re-deployed using just eb deploy and eb open to see the improvements that have been deployed.

AWS RDS Postgres is a database service provided by Amazon Web Services.
It’s not difficult to set up a database in AWS. All you have to do now is log into the AWS console and select an area. Then, in the EC2 service, build a protection category and define form ‘PostgreSQL’ in inbound rules with the source ‘Anywhere,’ leaving the rest of the settings alone. Then, in the RDS service, create a standard database with the Postgres engine, and in the VPC security group, select your newly formed security group by selecting “choose current.”

I used python-dotenv to get database address from our.envfile and SQLAlchemy to create database link to populate database with new data.

Endpoints of the FastAPI
I had to build endpoints to retrieve this data because I was working on a weather forecasting model and the collection of weather conditions. I introduced two endpoints: /api/weather daily forecast and /api/weather monthly forecast, which provide daily and monthly weather temperature forecasts for the next two years, and a third endpoint, /api/weather conditions, which provides an average number of sunny, cloudy, rainy, and snowy days per year. My teammates took care of the rest of the endpoints. All of our endpoints (marked in red square) were up and running once the project was deployed to Elastic Beanstalk and the link to RDS was properly set up:

I’ve also added two endpoints to visualise monthly weather temperature forecasts /api/weather forecast graph and weather conditions per year /api/weather conditions graph:

The use of visualization
I used the Plotly library to visualize the weather forecast and conditions for the next two years. I generated graph objects with Plotly and converted them to JSON for API endpoints, which I then made with the plot-react.js library and sent to the front-end app as interactive graphs:

Is there anything else that could be done?
CitySpire is a set of concepts. Its functionality can be expanded by adding a variety of functions. Unfortunately, we didn’t have enough time to finish release 2, which required revisiting the current livability score and collaborating with DS to see whether we could make improvements by incorporating newly developed features such as housing costs, weather, employment, and school boards. We intended to repurpose existing functionality for a livability score that included our features. The web team also proposed tweaking the variables on the user side that are more relevant, assigning weights to them based on user expectations, and calculating the livability score. Furthermore, we might build labels to make it easier for users to determine if the city is the location of their dreams.

What other features do you think you should incorporate in the future? There are a few that come to mind:

Show socioeconomic variables such as wealth, wages, occupation, religion, education level, family size, purchasing habits, and so on.
Show historical and cultural elements, as well as recreational areas, such as tourist attractions, museums, landmarks, and parks. It may be a list of the top ten places to visit in each city or state;

Social supports, such as child care, medical care, community protection, and others, are also critical considerations.

Sources of Information:
Data on the weather
Job postings: Info on job postings
Housing Prices: Info on Housing Prices
Boards of Education: Boards of Education data is now on Telegram. Join Diginews channel in your Telegram and stay updated with latest news