Predicting Travel Using Flickr

 

People love taking photographs of beautiful places. In chasing the perfect shot, many people travel to new places and record their experiences. In the following predictive analysis, I utilized a database of 100 million photographs and videos from Flickr to predict where people will choose to travel over time.

 

Preprocessing, Cleaning and Manipulation of Information

The Flickr database includes the following fields of data: 

  • Photo/video ID
  • User NSID, User nickname
  • Date taken
  • Date uploaded
  • Capture device
  • Title, Description
  • User tags (comma-separated), Machine tags (comma-separated)
  • Longitude, Latitude
  • Accuracy
  • Photo/video page URL, Photo/video download URL
  • License name, License URL
  • Photo/video server identifier, Photo/video farm identifier
  • Photo/video secret, Photo/video secret original
  • Photo/video extension original
  • Photos/video marker (0 = photo, 1 = video)

Cleaning was broken down into the following steps:

  1. Choosing only photographs that have a date taken listed as after 1850
  2. Taking out any brands with the word "scan" in the name, eliminating any photographs that have been scanned in
  3. Binning the rest of the camera brands, putting any that occur less than 1% of the time into a category "Other"
  4. Analyzing the dataset from 2000 until 2014, which roughly corresponds to the beginning of Flickr onward
 

Which Camera Brands Dominate?

Through the expansion of Flickr from the early 2000s on up, it is evident that Canon starts to take the lead. However, once 2007 rolls around and the Apple iPhone is introduced, it begins to take a larger slice of the pie.

 

Clustering Analysis

Considering the size of the database, I was running into too many issues with long runtimes in analysis. To narrow down the analysis to a region that I can assess, I limited the analysis to the United States and Central America. I used K-Means clustering to break the area up into regions based on the number of photos. The optimal number of clusters was developed using a silhouette score of a range of clusters between 2 and 30. 

To preserve distinct regions throughout North America, and also achieve a decent clustering, I picked 15 clusters as the ideal number.

kmeans_15clusters.png

Year by Year

Photos are most popular on each of the coasts, and begin to fill the interior of the continent over time.

 

Linear Regression Analysis for Prediction

Once the points are grouped by cluster, they are then sliced by year. The time series showcase the trends over time in each region. Each grouping of five years predicts the next year. On average, the R-squared fit of the prediction is 86.2%, with a root mean square error of 11.9%.

 

So, What Will Happen in 2019?

prediction_cluster0.jpg

Based on the analysis, it seems as if the Pacific Northwest will be the most popular location to take photographs, holding its status from 2000 onward.

Hawaii and the South will be the least popular locations.

Central America and California are becoming more trendy.

Next Steps

This analysis has been based on a simple K-Means clustering, with the number of clusters fine tuned. It also has been sliced into a simple year by year time series, and analyzed using linear regression.

More Diversified Data

Using a database of only Flickr photos introduces biases to the data and the prediction. For example, the relative popularity of Flickr has evolved and peaked around 2010-2011, and has noticeably declined. The rise of various photo-sharing services such as Instagram, Twitter, Facebook, etc. have affected the total photos uploaded to Flickr.

To improve the prediction, the information from these sources would need to be added and adjusted. There will continue to be biases based on the demographics of each user base, and how the services are used.

More Models

It would be interesting to find a method of applying K-Medians to the area, to find the more dense locations.

The model used above to predict on the number of photos is a linear regression model from statsmodels. I also used support vector regression and linear support vector regression models to check. They were less stable in the face of the limited data, and produced less accurate forecasts.

Smaller Time Slices

Slicing the time series into months would perhaps increase the noise in the data, but also provide more data points to define the forecast accurately.