(English version)
Neulich mußte ich eine JSON-Datenstruktur aus einer Web-Quelle holen und in die Datenbank einfügen. Das Besondere an diesen Daten ist, daß sie einfach in einem zweidimensionalen Array liegen, die Webseite interpretiert die Daten selbst. Es sind also keine Feldnamen im JSON.
In R ging es recht einfach:
require(rjson)
# JSON-Array laden, List daraus machen
global2k_lst <- fromJSON(file="http://www.forbes.com/ajax/load_list/?type=organization&uri=global2000&year=2014");
# Die Liste in einen eindimensionalen Vektor konvertieren
global2k <- unlist(global2k_lst)
# Durch die Angabe der Dimensionen wird eine Matrix daraus
dim(global2k) <- c(12,2000)
# Aus der Matrix dann ein Data Frame machen und die Felder benennen
global2k <- as.data.frame(t(global2k))
names(global2k) <- c("rank", "urlname", "name", "logoname", "sort",
"sales", "profits", "assets", "marketvalue",
"country", "industry", "us_state")
write.csv2(global2k, "/tmp/forbes2000.csv", row.names=FALSE)
Andererseits hat PostgreSQL seit Version 9.3 viele JSON-Funktionen. Wie funktioniert es also direkt in der Datenbank?
Erst mal habe ich den ganzen JSON-String in eine Tabelle ins Feld „forbes2k_json“ eingelesen. (PostgreSQL hat ja auch mit 244 kByte Daten in einem Feld kein Problem.) Der Inhalt des Arrays läßt sich dann mit PostgreSQL-Mitteln mit etwas Aufwand als Tabelle ausgeben:
-- JSON-Array in 2000 Zeilen aufspalten
with lines as (
select json_array_elements(forbes2k_json::json) as company
from forbes2k_json
),
-- Die Elemente der Zeilen in jeweils 12 Zeilen aufspalten, die Original-Zeilen nummerieren
entries as (
select json_array_elements(company)::text as entry,
row_number() over () as entrynr
from lines
),
-- Die Felder nummerieren
fields as (
select *,
row_number() over (partition by entrynr) as fieldnr
from entries
)
-- Mit 12 self joins die Tabellenstruktur aufbauen und die Felder konvertieren
select f1.entry::integer as rank,
regexp_replace(f2.entry::text, '^"(.*)"$', '\1') as urlname,
regexp_replace(f3.entry::text, '^"(.*)"$', '\1') as company,
regexp_replace(f4.entry::text, '^"(.*)"$', '\1') as logoname,
f5.entry::integer as sort,
regexp_replace(regexp_replace(f6.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as sales,
regexp_replace(regexp_replace(f7.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as profits,
regexp_replace(regexp_replace(f8.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as assets,
regexp_replace(regexp_replace(f9.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as marketvalue,
regexp_replace(f10.entry::text, '^"(.*)"$', '\1') as country,
regexp_replace(f11.entry::text, '^"(.*)"$', '\1') as industry,
regexp_replace(f12.entry::text, '^"(.*)"$', '\1') as us_state
from fields f1
inner join fields f2 on f2.entrynr = f1.entrynr and f2.fieldnr = 2
inner join fields f3 on f3.entrynr = f1.entrynr and f3.fieldnr = 3
inner join fields f4 on f4.entrynr = f1.entrynr and f4.fieldnr = 4
inner join fields f5 on f5.entrynr = f1.entrynr and f5.fieldnr = 5
inner join fields f6 on f6.entrynr = f1.entrynr and f6.fieldnr = 6
inner join fields f7 on f7.entrynr = f1.entrynr and f7.fieldnr = 7
inner join fields f8 on f8.entrynr = f1.entrynr and f8.fieldnr = 8
inner join fields f9 on f9.entrynr = f1.entrynr and f9.fieldnr = 9
inner join fields f10 on f10.entrynr = f1.entrynr and f10.fieldnr = 10
inner join fields f11 on f11.entrynr = f1.entrynr and f11.fieldnr = 11
inner join fields f12 on f12.entrynr = f1.entrynr and f12.fieldnr = 12
where f1.fieldnr = 1
order by rank, sort
Kompliziert ist es nicht, nur viel zu tippen (und copy/paste-en). Da die Daten selbst keine Feld-Metadaten wie Name und Typ enthalten, muß man sie selbst festlegen.
Processing a JSON array in PostgreSQL
I needed to get a JSON data structure from the web and put its contents into a database. These data were a bit special in the sense that the JSON only contained only a two dimensional array interpreted by the website, without any field names.
In R it was quite easy:
require(rjson)
# Load JSON array and convert to a list
global2k_lst <- fromJSON(file="http://www.forbes.com/ajax/load_list/?type=organization&uri=global2000&year=2014");
# Convert to a one-dimensional vector
global2k <- unlist(global2k_lst)
# Change to a matrix by specifying the dimensions
dim(global2k) <- c(12,2000)
# Convert the matrix to a data frame and specify the field names
global2k <- as.data.frame(t(global2k))
names(global2k) <- c("rank", "urlname", "name", "logoname", "sort",
"sales", "profits", "assets", "marketvalue",
"country", "industry", "us_state")
write.csv2(global2k, "/tmp/forbes2000.csv", row.names=FALSE)
On the other hand, PostgreSQL has lots of JSON functions since release 9.3. So how to do it directly in the database?
The first step is reading the entire JSON string into a table in a field called „forbes2k_json“. (PostgreSQL can easily store 244 kBytes of data in a single field.) A not too complex query converts the array contents into a normal table:
-- JSON array into 2000 lines
with lines as (
select json_array_elements(forbes2k_json::json) as company
from forbes2k_json
),
-- Existing lines converted into 12 lines each, with an additional row number for
-- the original lines
entries as (
select json_array_elements(company)::text as entry,
row_number() over () as entrynr
from lines
),
-- Row numbers of the entry lines
fields as (
select *,
row_number() over (partition by entrynr) as fieldnr
from entries
)
-- 12 self joins for building the table structure; convert field types
select f1.entry::integer as rank,
regexp_replace(f2.entry::text, '^"(.*)"$', '\1') as urlname,
regexp_replace(f3.entry::text, '^"(.*)"$', '\1') as company,
regexp_replace(f4.entry::text, '^"(.*)"$', '\1') as logoname,
f5.entry::integer as sort,
regexp_replace(regexp_replace(f6.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as sales,
regexp_replace(regexp_replace(f7.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as profits,
regexp_replace(regexp_replace(f8.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as assets,
regexp_replace(regexp_replace(f9.entry::text, ',', ''), '^"(.*)"$', '\1')::numeric(10,3) as marketvalue,
regexp_replace(f10.entry::text, '^"(.*)"$', '\1') as country,
regexp_replace(f11.entry::text, '^"(.*)"$', '\1') as industry,
regexp_replace(f12.entry::text, '^"(.*)"$', '\1') as us_state
from fields f1
inner join fields f2 on f2.entrynr = f1.entrynr and f2.fieldnr = 2
inner join fields f3 on f3.entrynr = f1.entrynr and f3.fieldnr = 3
inner join fields f4 on f4.entrynr = f1.entrynr and f4.fieldnr = 4
inner join fields f5 on f5.entrynr = f1.entrynr and f5.fieldnr = 5
inner join fields f6 on f6.entrynr = f1.entrynr and f6.fieldnr = 6
inner join fields f7 on f7.entrynr = f1.entrynr and f7.fieldnr = 7
inner join fields f8 on f8.entrynr = f1.entrynr and f8.fieldnr = 8
inner join fields f9 on f9.entrynr = f1.entrynr and f9.fieldnr = 9
inner join fields f10 on f10.entrynr = f1.entrynr and f10.fieldnr = 10
inner join fields f11 on f11.entrynr = f1.entrynr and f11.fieldnr = 11
inner join fields f12 on f12.entrynr = f1.entrynr and f12.fieldnr = 12
where f1.fieldnr = 1
order by rank, sort