In an exploration of some data on open street maps (OSM), FourSquare, and analysis of the same places on ground showed some discrepancies between addresses, names and actual location. In other words, the same place had different names at different data sources and hence a need for normalization is required.
Summarize metrics:
Here, I’m listing down the collected data from our sources:
Technique Used:
We use three features for merging the data.
- Location
- Name Matching
- Address Matching
A brief summary of utilizing information in each of these fields, along with important chunks of code is explained as follows:
Location:
As a first step, we loop over FourSquare location entries to fetch all data within 200 meters of our current fourSquare entry. This is acheived using a query similar to:
query2<- paste0("SELECT *,
ST_Distance(ST_GeomFromText('POINT(",placeFourData$long[i],
" ",placeFourData$lat[i],")',4326),
place_geometry) as Dist,
st_asText(st_geometryn(place_geometry,1)) as geom
from place where
ST_DWithin(ST_GeomFromText('POINT(",placeFourData$long[i],
" ",placeFourData$lat[i],")',4326),place_geometry,",d,")
ORDER BY ST_Distance(ST_GeomFromText('POINT(",placeFourData$long[i],
" ",placeFourData$lat[i],")',4326), place_geometry);")
where d is distance in radians, and placeFourData$long[i] , placeFourData$lat[i] is the longitude and lattitude of current fourSquare location. The “order by” statement returns the location closest to our point of interest on top. The distance is also stored in a variable “dist” for further use. A confidence interval depending on the least euclidean
Name Matching:
Names in FourSquare are taken as the base for name matching. Each name entry of fourSquare and nearest location places is tokenized, part of speech tagging is done and conjunctions and determiners are removed for comparison part. Optimal string alignment, with a maximum distance of two is used to match each string of name in FourSquare with name in OSM data. A confidence level is calculated by dividing the number of votes recieved by each name in OSM data divided by the length of strings in OSM data. Example of function coded is shown as follows:
FourSquareName <- "The Rose & Crown Pub"
OSMName <- c("Santa Clara County",
"NA","East Palo Alto",
"Rose and Crown","Thaiphoon",
"The Goldsmith")
print(nameMatch2(FourSquareName,OSMName))
## FourSquareName1 osmName combVotes probName
## 1 The Rose & Crown Pub Santa Clara County 0 0.0000000
## 2 The Rose & Crown Pub NA 0 0.0000000
## 3 The Rose & Crown Pub East Palo Alto 0 0.0000000
## 4 The Rose & Crown Pub Rose and Crown 2 0.6666667
## 5 The Rose & Crown Pub Thaiphoon 0 0.0000000
## 6 The Rose & Crown Pub The Goldsmith 0 0.0000000
Address Matching:
The third and final metric used for matching FourSquare data with OSM data is looking at the address. The OSM data that contains addresses have there first field as the house number. So, we only take that into account when matching. Since, we are only matching with places that are closely located, therefore we are on the same street with high probability. A psuedo example of this function in action is shown as follows:
OSMAddresses <- c("NA","NA","547 CA,
United States","NA","NA","543 CA, United States",
"541 CA, United States",
"215 CA, United States",
"538 Ramona Street, CA, United States")
fourSquareAddress <- "547 Emerson St"
addressMatch(fourSquareAddress,OSMAddresses)
## FourAddress AddressRawMatch probMatch
## 1 547 Emerson St NA 0.0000000
## 2 547 Emerson St NA 0.0000000
## 3 547 Emerson St 547 CA,\n United States 1.0000000
## 4 547 Emerson St NA 0.0000000
## 5 547 Emerson St NA 0.0000000
## 6 547 Emerson St 543 CA, United States 0.3333333
## 7 547 Emerson St 541 CA, United States 0.0000000
## 8 547 Emerson St 215 CA, United States 0.0000000
## 9 547 Emerson St 538 Ramona Street, CA, United States 0.0000000