Intro to Google Refine
- https://www.youtube.com/watch?v=B70J_H_zAWM
- https://www.youtube.com/watch?v=cO8NVCs_Ba0
- https://www.youtube.com/watch?v=5tsyz3ibYzk
Workshop by John Little
- Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/colinrobinsonuib/openrefine-workshop/refs/heads/main/data/building-permits.csv
- Click
Next >> - Select: Columns are separated by
commas (CSV) - Change the Project Name to
Building Permitsand clickCreate Project >>(top-right)
- How many rows are in this data set?
-
Answer
21,982
-
- Click the column header:
authorized_work > Facet > Text facet - How many facets are there?
- You may see a program warning, change the maximum number of choices show in the text facet to 8000 and proceed.
-
Answer
7633
- In the Facet box, click count
- What is the 4th most popular type of authorized work?
-
Answer
SCREEN PORCH
-
- How many permits (rows) exist for that type fo authorized work??
-
Answer
233
-
- What is the 4th most popular type of authorized work?
- To find spelling clusters, first make sure none of the facets are selected. Click the
Clusterbutton in the facet box. ClickClusteragain. - Click the Select All button, then the Merge Selected & Re-Cluster button, to merge all terms (accepting the default: Method =
key collision; Keying Function =fingerprint) - Repeat previous step using the
ngram-fingerprintKeying Function, then close the Cluster & Edit dialog box - How many SCREEN PORCH facets now exist?
-
Answer
238
-
- Select the DECK Facet
- Select the SCREEN PORCH facet combined with the DECK Facet. (Hover your mouse over the facet, click include)
- How many matching rows exist?
-
Answer
578
-
- On the "county" column, make a text facet and limit to Durham County (DURH)
- How many "Single Family" homes received permits in Durham County for Screen Porches or Decks?
-
Answer
3
-
- Make a Text Facet on the
work_type_descriptioncolumn - There are two facets for new buildings:
NEW BUILDINGandNew Building - How many
NEW BUILDINGrows exist?-
Answer
3
-
- How many
New Buildingrows exist?-
Answer
9668
-
- Select
NEW BUILDINGfacet, limiting to 3 matching rows. To the right of theNEW BUILDINGfacet, hover your mouse over theeditfeature; clickeditand alter the text to:New Building - How many
New Buildingrows exist?-
Answer
9671
-
- Mass edit
OTHER&Otherso they have the same value - Mass edit
ALTERATIONS/REPAIRSandAlterations/repairsso they have the same value - How many Facet choices exist?
-
Answer
15
-
address> Edit column > Split into several columns…- Separator =
(
- Separator =
address 2> Edit column > Split into several columns…- Separator =
,
- Separator =
address 2 1> Edit column > rename this column >latitudeaddress 2 2> Edit column > rename this column >longitudelongitude> Edit cells > Transform >value.replace(")", "")longitude> Edit cells > Commons transforms > To number
- Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/colinrobinsonuib/openrefine-workshop/refs/heads/main/data/police-incidents.csv
- Click
Next >> - Select: Columns are separated by
commas (CSV) - Change the Project Name to
Police Indicentsand clickCreate Project >>(top-right)
Question: Which year is missing the most location data?
- Isolate data: you have incident data without location information
- Location > Facet > Customized facets > Facet by blank
- Click true (True = cell is blank ; False = cell has data)
- How many matching rows have no location data?
- Explanation for the next step:
You’ll create a custom facet (by Year) without altering any data. To do this, you’ll use a very simple expression which takes a slice of data based on fixed position, starting in position 6 (counting begins at 0) and ending four character positions later at 10 (6 + 4=10). Given the data, find the year value within the “INC DATETIME” field.
- To take the year "2014", use expression:
value[6,10] - To take the hour, use
value[11,13] - To take PM (or AM), use
value[20,22]
- Create a Custom Facet to Sort by Year
INC DATETIME> Facet >Custom text facet- Expression = value[6,10]
- Which year is missing the most location data?
-
Answer
2005
-
- Keep only the data with location values, i.e. delete all rows that have no location data
- Remove ("X out") the Year facet
- All > Edit rows > Remove all matching rows
- Remove( "X out") the Location facet
- How many rows remain?
-
Answer
75967
-
Learn more about regular expressions here and test them out here
Create a new column to extract the year from the INC DATETIME column:
INC DATETIME> Edit column > Add column based on this column- New Column Name = YEAR
- Expression =
value.match(/.*\/(\d\d\d\d).*/)[0] - Click
OK
Question: What is the most common bicycle incident?
Text Filter: Explore your data to find how many incident descriptions involve bicycles
LCR DESC> Text filter- Enter:
bicycle - How many rows match?
-
Answer
251
-
- Enter:
Facet: Make a Text Facet on the LCR DESC column
LCR DESC> Facet > Text Facet- How many bicycle categories are there?
-
Answer
6
-
- What is the most common bicycle incident?
-
Answer
LARCENY/BICYCLES ($50-$199) - 119
-
- How many bicycle categories are there?
Question: How many bicycle incident were not felonies?
FELONY> Facet > Text facetFELONY> Edit cells > Cluster and edit- Method =
key collision; Key Function =metaphone3 - For each row, check "Merge?" and change the "New Cell Value"
- to
Felonyin the first row - to
Not Felonyin the second row
- to
- click the "Merge Selected & Re-Cluster" button
- Try other "Methods" and “Keying Functions”. “Merge Selected & Re-Cluster” after each operation
- Method =
- How many bicycle incidents were not felonies?
-
Answer
There are 234 items after clustering using both Methods "key collision" and "nearest neighbor"
-
- Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/colinrobinsonuib/openrefine-workshop/refs/heads/main/data/authors.csv
- Click
Next >> - Select: Columns are separated by
commas (CSV) - Change the Project Name to
Authorsand clickCreate Project >>(top-right)
Question: How old were each of these authors when they died?
name > Reconcile > Start reconciling
The first time you use this feature, you will need to add a new service. Click "Add Standard Service" and enter the URL: https://wikidata.reconci.link/en/api
- Select the service >
Next >> - Pick an instance type to reconcile against:
human (Q5) - Start reconciling
- Facet by
noneto see which authors were not matched - Select
Search for matchunder each unmatched author and find a match - Remove all facets
name> Edit column > Add columns from reconciled values- Add property
date of birth - Add property
date of death - Click
OK
date of death> Edit column > Add column based on this column- New Column Name =
age - Expression =
diff(value, cells["date of birth"].value, "years") - Click
OK
Use the current time as the date of death for alive authors
- Undo creating the
Agecolumn - Facet
date of deathby blank - Select all blank rows
date of death> Edit cells > Transform- Expression =
now() - Remove all facets
- Create the
Agecolumn as specified above