SmellsLikeML

AHA: Apartment Hunting App

08/13/17

Housing Crisis, Engineering Apartment Hunting

If you are considering a move in the SF Bay area, plan to start early. The housing market in this part of the country is notoriously competitive. Anecdotes abound with engineers opting to sleep at work. If you didn't come to the tour with all your paperwork, you've already missed the deal.

Some time back, a friend forwarded Justin's post. More recently, I found Vik's post in a newsletter and was inspired to build my own scraper to regain the competitive edge.

Taking inspiration from the simplicity of the "Tinder swipe", my partner and I set about creating a similar interface below.

With AWS, text based data was stored in a DynamoDB Table and images were stored into an S3 bucket. This information was loaded into a simple Flask application and left/right swipes were logged back to DynamoDB.

Craigslist ads don't conform to a rigid structure but after gaining familiarity with typical posts you can pick up on common patterns referencing concepts like: park, lake, BART, laundry, square footage, etc. Additional information, like drive times to points of interest can be computed by geocoding lat/long pairs using the Google Maps Distance Matrix API.

For a quick overview of the structure of the posts, we can experiment with word embeddings. Below, I have taken the posts, lowered all the characters, split each post on whitespace and run a Word2Vec model. I set a minimum threshold of 5 appearances in the text, a 3-window, and embedding dimension of 30, all due to the simplicity of a small corpus of rather formulaic text.

Next, I applied t-SNE for dimensionality reduction and visualization. Here I reduced the perplexity parameter from default down to 5, slightly increased the early_exaggeration parameter to 15, I reduced the learning rate to 10, used pca initialization and allowed for 10000 iterations with the method set to the more accurate 'exact' while allowing for 600 iterations without improvement. These choices were made after finding projected points spread equidistantly over a ball with default parameter settings. The resulting projection of 500 points communicates much of the intuitive structure of an apartment post.

Fortunately, there is not a lot of creativity in expressing the idea of "1 bedroom" e.g. 1 bdrm, 1x1, 1 room, etc. While I could use proximity in the embedded space to associate each variant with the "1 bedroom" concept representative, simple regex pattern matching suffices.

At this point, we can call it a day and start swiping but I don't want to waste any time setting up a viewing. I made an "About Us" email template and used Amazon's SES to fire away with each right swipe.

Done, right? Well, Craigslist works by showing recent posts at the top. When a home is unoccupied for a while, this means it will likely be reposted. It turns out that some places were posted over 100 times over the course of my data collection.

Roughly 30% of the targeted posts appeared more than once! That is a lot of extra swipes. Instead, I should deduplicate the corpus so I can get back to something more profitable.

Going beyond an improved application, I want to train a ML algorithm to identify my visual preferences. Then I can rerank the apartments to prioritize those which I am likely to right swipe. Ultimately, I'd like to automate the entire process and simply wait until I hear back about scheduling a viewing. In the complimentary post, I will set up a classifier to learn my apartment preferences.