banner



Do You Need To Clean An Arbnb

How to deal with messy location and text data in Python

Source: airbnbeazy.com

Project background

Airbnb fascinates me. I previously worked for a year and a half at an Airbnb property direction company, every bit head of the team responsible for pricing, revenue and analysis. One affair I find peculiarly interesting is how to figure out what price to charge for a listing on the site. Although 'information technology's a 2 chamber in Manchester' will get y'all reasonably far, in that location are actually a huge number of factors that can influence a listing's price.

As part of a bigger projection on using deep learning to predict Airbnb prices, I plant myself thrown back into the murky world of belongings information. Geospatial information tin can be very complex and messy — and user-entered geospatial information doubly and so. This postal service will explain how I went nigh sourcing and preparing the data for my projection, including some thoughts on dealing with UK geographic information (it's surprisingly complicated) and extracting relevant information from long text strings.

The dataset

The dataset used for this project comes from Insideairbnb.com, an anti-Airbnb lobby group that scrapes Airbnb listings, reviews and agenda data from multiple cities around the world. The dataset was scraped on nine April 2019 and contains information on all London Airbnb listings that were live on the site on that date (about 80,000).

Cleaning and preparing the data

For the full exciting details (disclosure: level of excitement may vary) of data cleaning, feel free to check out my GitHub repo. In the interests of brevity, I'll discuss three particular areas of information pre-processing that might exist of interest.

Features that weren't included (merely I would take liked to include)

The original dataset independent 106 features, including quite a few text columns of all the unlike clarification fields that y'all can fill in for an Airbnb listing. Due to time constraints I did not do whatsoever tongue processing (NLP) in this model, and then all these features were dropped. Yet, an interesting artery of hereafter development for this model would be to augment it with NLP — perhaps for sentiment analysis, or looking for keywords, or some sort of fancy Word2Vec type state of affairs that looks for similar listing descriptions and uses this to aid gauge price based on similar listings.

Another potential management of future work could include reviews. Insideairbnb.com besides scrapes reviews, which can be matched to listings with their listing IDs. Although near guests tend to requite most listings high ratings, more than nuanced ratings could perhaps be derived from the reviews themselves.

Dealing with London geography (TLDR: London was not mapped with data scientists in mind)

Postcodes in the UK are complicated and messy. They can exist various lengths, and consist of letters and numbers in various orders. The first half of a postcode is called the outcode or postcode commune, and refers to areas as shown here:

London postcode districts. Source: https://en.wikipedia.org/wiki/London_postal_district

Only to brand things more complicated, the main geographic partition of London is into 32 boroughs plus the Urban center of London (technically a Corporation rather than a borough due to some quirks of twelfth Century English history), which do non align with postcode districts (because that would exist too easy, right?):

London postcode districts (blood-red), layered over London boroughs (black lines). Source: https://en.wikipedia.org/wiki/London_postal_district

In that location also aren't whatever like shooting fish in a barrel ways of classifying London areas on a less granular level. In fact, there is not even understanding on what counts equally 'inner London':

What even is London? Source: https://en.wikipedia.org/wiki/Inner_London

And to make matters even worse, information technology turns out Airbnb allows hosts to enter postcodes in a costless text entry box, precluding any easy separation of parts of postcodes, and allowing hosts to write all kinds of nonsense (my favourite is just the word 'no').

In the stop, after discarding a bunch of regex experimentation with postcodes, I settled on using borough as the unit of geography. Location is very important for Airbnb listings, and so I was not entirely happy about having to use borough. It is not on a peculiarly fine-grained level, and does non e'er express well whether a property is in central London or out in the sticks — which makes a huge difference to price. For example, the famous Shard skyscraper is in Southwark, only and then is Dulwich , where the tube doesn't even reach (disclaimer: Dulwich is actually lovely, only is probably less well known to tourists in London).

Both in Southwark, just perchance with dissimilar Airbnb prices? Left: the Shard (source: https://www.visitbritainshop.com/world/the-view-from-the-shard/). Correct: Dulwich high street (source: https://de.wikipedia.org/wiki/Dulwich_(London)).

I did besides experiment with using latitude and longitude instead of borough in order to go more fine-grained results — but as a future weblog post will show, it was not entirely successful.

Amenities (and so very many amenities)

In the dataset from Insiderairbnb.com, amenities were stored as i big cake of text— hither'due south one example:

In order to figure out what the diverse options were and which listings had them, I commencement made a giant cord of all the civilities values, tidied it up a bit, carve up out the individual amenities separated by commas, and created a set of the resultant list (fortunately the dataset was minor enough to allow this, but I would have needed a more efficient way to practise this with a much larger dataset):

And here's a list of all the amenities it is possible to have:

                      '24-60 minutes check-in',
'Accessible-height bed',
'Accessible-height toilet',
'Air conditioning',
'Air purifier',
'Alfresco bathtub',
'Amazon Echo',
'Apple TV',
'BBQ grill',
'Baby bathroom',
'Babe monitor',
'Babysitter recommendations',
'Balcony',
'Bathroom towel',
'Bathroom essentials',
'Bathtub',
'Bathtub with bathroom chair',
'Beach essentials',
'Beach view',
'Beachfront',
'Bed linens',
'Bedroom comforts',
'Bidet',
'Trunk soap',
'Breakfast',
'Breakfast bar',
'Breakfast table',
'Building staff',
'Buzzer/wireless intercom',
'Cable TV',
'Carbon monoxide detector',
'True cat(south)',
'Ceiling fan',
'Ceiling hoist',
'Central air conditioning',
'Irresolute table',
"Chef's kitchen",
'Children's books and toys',
'Children's dinnerware',
'Cleaning before checkout',
'Coffee maker',
'Convection oven',
'Cooking basics',
'Crib',
'DVD actor',
'Solar day bed',
'Dining area',
'Disabled parking spot',
'Dishes and silverware',
'Dishwasher',
'Domestic dog(south)',
'Doorman',
'Double oven',
'Dryer',
'EV charger',
'Electric profiling bed',
'Elevator',
'En suite bath',
'Espresso machine',
'Essentials',
'Ethernet connexion',
'Exercise equipment',
'Actress pillows and blankets',
'Family/kid friendly',
'Fax machine',
'Fire extinguisher',
'Fire pit',
'Fireplace guards',
'Firm mattress',
'First aid kit',
'Fixed grab bars for shower',
'Fixed grab bars for toilet',
'Apartment path to front door',
'Formal dining area',
'Free parking on premises',
'Complimentary street parking',
'Total kitchen',
'Game console',
'Garden or backyard',
'Gas oven',
'Ground floor access',
'Gym',
'HBO GO',
'Hair dryer',
'Hammock',
'Handheld shower caput',
'Hangers',
'Heat lamps',
'Heated floors',
'Heated towel rack',
'Heating',
'Loftier chair',
'Loftier-resolution computer monitor',
'Host greets you',
'Hot tub',
'Hot h2o',
'Hot water kettle',
'Indoor fireplace',
'Cyberspace',
'Atomic number 26',
'Ironing Board',
'Jetted tub',
'Keypad',
'Kitchen',
'Kitchenette',
'Lake access',
'Laptop friendly workspace',
'Lock on bedroom door',
'Lockbox',
'Long term stays immune',
'Luggage dropoff immune',
'Memory foam mattress',
'Microwave',
'Mini fridge',
'Mobile hoist',
'Mountain view',
'Mudroom',
'Irish potato bed',
'Netflix',
'Part',
'Other',
'Other pet(southward)',
'Outdoor kitchen',
'Outdoor parking',
'Outdoor seating',
'Outlet covers',
'Oven',
'Pack 'n Play/travel crib',
'Paid parking off premises',
'Paid parking on premises',
'Patio or balustrade',
'Pets allowed',
'Pets live on this property',
'Pillow-summit mattress',
'Pocket wifi',
'Puddle',
'Pool cover',
'Pool with pool hoist',
'Printer',
'Private bathroom',
'Private entrance',
'Individual gym',
'Individual hot tub',
'Private living room',
'Private pool',
'Projector and screen',
'Propane barbeque',
'Rain shower',
'Fridge',
'Roll-in shower',
'Room-darkening shades',
'Safe',
'Safety card',
'Sauna',
'Security system',
'Self cheque-in',
'Shampoo',
'Shared gym',
'Shared hot tub',
'Shared pool',
'Shower chair',
'Single level habitation',
'Ski-in/Ski-out',
'Smart Goggle box',
'Smart lock',
'Smoke detector',
'Smoking allowed',
'Soaking tub',
'Sound system',
'Stair gates',
'Stand up alone steam shower',
'Continuing valet',
'Steam oven',
'Pace-free admission',
'Stove',
'Suitable for events',
'Sun loungers',
'TV',
'Tabular array corner guards',
'Lawn tennis court',
'Terrace',
'Toilet paper',
'Touchless faucets',
'Walk-in shower',
'Warming drawer',
'Washer',
'Washer / Dryer',
'Waterfront',
'Well-lit path to entrance',
'Wheelchair accessible',
'Wide clearance to bed',
'Wide clearance to shower',
'Broad doorway',
'Broad entryway',
'Wide hallway clearance',
'Wifi',
'Window guards',
'Wine cooler',
'toilet',

In the list above, some civilities are more than important than others (e.g. a balcony is more than probable to increment toll than a fax car), and some are likely to exist fairly uncommon (e.m. 'Electric profiling bed'). Based on previous experience in the industry, and furtherresearch into which amenities are considered past guests to be more important, a selection of the more important amenities were extracted. These were and so selected from for inclusion in the final model depending on how sparse the data was. For example, if it turns out that almost all backdrop accept/do non have a item amenity, that characteristic will not be very useful in differentiating between listings or helping explicate differences in prices.

The whole convoluted lawmaking for this tin can be found on GitHub, merely this is the final department where I removed columns where over xc% of the listings either had or did not have a particular amenity:

These are the amenities that I concluded upwards keeping:

  • Balcony
  • Bed linen
  • Breakfast
  • TV
  • Coffee machine
  • Basic cooking equipment
  • White goods (specifically a washer, dryer and/or dishwasher)
  • Child-friendly
  • Parking
  • Outdoor space
  • Greeted by host
  • Internet
  • Long term stays immune
  • Pets immune
  • Private entrance
  • Condom or security organisation
  • Self cheque-in

Summary

Afterwards these (and many other) cleaning and pre-processing steps, the Airbnb was in suitable form to begin exploration and modelling, and you can read more than about this in my next mail on data exploration, and another post I wrote about building a predictive model.

If you found this mail service interesting or helpful, please let me know via the medium of claps and/or comments, and you can follow me in order to be notified about future posts. Thanks for reading!

Source: https://towardsdatascience.com/predicting-airbnb-prices-with-deep-learning-part-1-how-to-clean-up-airbnb-data-a5d58e299f6c

Posted by: archielablight.blogspot.com

0 Response to "Do You Need To Clean An Arbnb"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel