Graduating is a few months away. It is important to me to know what awaits me. Data Scientist job listings taken from Glassdoor constitute the dataset I chose.
Objectives
- In what kinds of jobs do salaries tend to be higher? (Job Title, Salary Estimate)
- Do salary levels differ according to a job’s location?
- Which kinds of companies will pay a higher salary? (Company, Size, Industry, Revenue)
The limit and assumption is:
- Salary estimates are taken from Glassdoor and are not necessarily indicative of actual salaries.
- During the time the dataset was published, in July 2020, it only included the outcome for that time period.
Hypothesis:
If people lost their jobs in the middle of the pandemic, there are job openings.
I started with importing my data from a local file, uploaded the libraries, and clean the dataset. I removed unnecessary columns, by using .drop(). Luckily there weren’t any missing values.
#Remove Rating values from Company Name.
ds['Company Name'],_=ds['Company Name'].str.split('\n', 1).str
# 1st column after split, 2nd column after split (delete when '_')
# string.split(separator, maxsplit) maxsplit default -1, which means all occurrances
# Split salary into two columns min salary and max salary.
ds['Salary Estimate'],_=ds['Salary Estimate'].str.split('(', 1).str
#exclude hourly rating salaries
ds=ds[(ds['Salary Estimate'].str.contains(' Per Hour'))==False].reset_index(drop=True)
# Split salary into two columns min salary and max salary.
# lstrip is for removing leading characters; rstrip is for removing rear characters
ds['Min_Salary'],ds['Max_Salary']=ds['Salary Estimate'].str.split('-').str
ds['Min_Salary']=ds['Min_Salary'].str.strip(' ').str.lstrip('$').str.rstrip('K').fillna(0).astype('int')
ds['Max_Salary']=ds['Max_Salary'].str.strip(' ').str.lstrip('$').str.rstrip('K').fillna(0).astype('int')
# To estimate the salary with for analysis, we will look at one number: Est_Salary = (Min_Salary+Max_Salary)/2
ds['Est_Salary']=(ds['Min_Salary']+ds['Max_Salary'])/2
# To estimate the size for analysis, we will look at one number: Est_Salary = (Min_Salary+Max_Salary)/2
#ds['Est_Size']=(ds['Min_Size']+ds['Max_Size'])/2
# Separate 'City' & 'State' from job 'Location'
ds['City'],ds['State'] = ds['Location'].str.split(', ',1).str
# Clean up duplicated city names in State's name
ds['State']=ds['State'].replace('Arapahoe, CO','CO')
ds['State']=ds['State'].replace('Los Angeles, CA','CA')
ds['State']=ds['State'].replace('NY (US), NY','NY')
I rechecked for missing values.
Job Title 0
Salary Estimate 0
Rating 405
Company Name 0
Location 0
Headquarters 240
Size 229
Founded 970
Type of ownership 229
Industry 543
Sector 543
Revenue 229
Competitors 2743
Easy Apply 3725
Min_Salary 0
Max_Salary 0
Est_Salary 0
City 0
State 0
dtype: int64
As you can see in the above figure, there are a lot of missing values. ‘Easy Apply’ and ‘Competitors’ have the highest number of missing values (> 50%)
Most companies have null values in Easy Apply because they are not hiring at the moment.
I looked at a few aspects to see my possibilities for a good job by using the bar chart and explanatory visualizations, for example, the current openings, Top 20 cities with their minimum and maximum salaries and Size of Employees Vs No of Companies, etc.
This is a bar graph about the Current Openings
It is a uniform distribution for the top 10 companies. The following are the top 10 companies hiring for Data Analyst roles in 2020.
This is a minimum and maximun Salary chart
The minimum salary is unimodal, the maximum salary is non symmetric; bimodal.
From this, we learn that we have a > chance of reciving a salary in the max salary range.
This is a bar graph of the job title with the most job openings and currenty hiring
This is not skewed to the right as the job titles have no correlation.
In conclusion, the bar graph indicates that the most openings are in the Data Science field, so as a new hire looking to be hired I’ll start with that field.