Local Install
-
Preqeuisites:
- a UNIX-like environment (e.g. Linux, BSD, win32+msys/cygwin, ...)
- perl (debian package "perl")
- first, install (cpanm, debian package "cpanminus")
-
install DiaCollo and any missing dependencies with:
$ cpanm DiaColloDB DiaColloDB::WWW
-
test your installation:
$ dcdb-create.perl --version dcdb-create.perl version 0.12.006 by Bryan Jurish
Local Corpus Data
Download Corpus Data
... e.g. from the Deutsches Textarchiv- Download each work as "TCF (tokenisiert, serialisiert, lemmatisiert, normalisiert)"; e.g. for BOOK=kant_aufklaerung_1784:
-
create a file-list; e.g. if you saved all and only the corpus TCF files in directory "mycorpus.d", run:
$ find mycorpus.d -name '*.tcf' -print > mycorpus.files
- ... or just grab and unpack the pre-compiled toy data archive (Kant + Hegel + Marx + Nietzsche)
Create Local Corpus Index
... using dcdb-create.perl:$ dcdb-create.perl -dclass=TCF -l dta-khmn-tcf.files -o=dta-khmn.d
dcdb-create.perl[13408] INFO: DiaColloDB.Corpus: using document parser class DiaColloDB::Document::TCF dcdb-create.perl[13408] INFO: DiaColloDB: create(dta-khmn.d) v0.12.006 dcdb-create.perl[13408] INFO: DiaColloDB: (term x document) matrix modelling via DiaColloDB::Relation::TDF enabled. dcdb-create.perl[13408] INFO: DiaColloDB: create(): processing 16 corpus file(s) dcdb-create.perl[13408] INFO: DiaColloDB: create(): processing files [ 0%]: dta-khmn-tcf.d/hegel_logik0101_1812.tcf ... dcdb-create.perl[13408] INFO: DiaColloDB: create(): processing files [ 94%]: dta-khmn-tcf.d/nietzsche_zarathustra04_1891.tcf dcdb-create.perl[13408] INFO: DiaColloDB: create(): building attribute frequency filter (fmin_l=2) dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD cut -d" " -f1 dta-khmn.d/atokens.dat | sort -n | uniq -c | dcdb-create.perl[13408] INFO: DiaColloDB: create(): filter (fmin_l=2) pruning 16619 of 34324 attribute value type(s) (48.42%) dcdb-create.perl[13408] INFO: DiaColloDB: create(): building attribute frequency filter (fmin_p=2) dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD cut -d" " -f2 dta-khmn.d/atokens.dat | sort -n | uniq -c | dcdb-create.perl[13408] INFO: DiaColloDB: create(): filter (fmin_p=2) pruning 0 of 10 attribute value type(s) (0.00%) dcdb-create.perl[13408] INFO: DiaColloDB: create(): populating global term enum (tfmin=2) dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD cut -d " " -f -2 dta-khmn.d/atokens.dat | sort -n | uniq -c | dcdb-create.perl[13408] INFO: DiaColloDB: create(): will prune 19783 of 40745 term tuple type(s) (48.55%) dcdb-create.perl[13408] INFO: DiaColloDB: create(): filtering corpus tokens & assigning term-IDs dcdb-create.perl[13408] INFO: DiaColloDB: create(): assigned 20962 term tuple-IDs to 616008 of 635790 tokens (pruned 3.11%) ... dcdb-create.perl[13408] INFO: DiaColloDB: creating co-frequency index dta-khmn.d/cof.* [dmax=5, fmin=2] dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD | sort -nk1 -nk2 -nk3 | uniq -c - dta-khmn.d/cof.dat dcdb-create.perl[13408] TRACE: DiaColloDB.Relation.Cofreqs: create(): stage1: generate pairs (dmax=5) dcdb-create.perl[13408] TRACE: DiaColloDB.Relation.Cofreqs: create(): stage2: load pair frequencies (fmin=2) ... dcdb-create.perl[13408] INFO: DiaColloDB: create(): DB dta-khmn.d created. dcdb-create.perl[13408] INFO: DiaColloDB: close(dta-khmn.d) dcdb-create.perl[13408] INFO: DiaColloDB: operation completed in 1m31.841s; db size = 1.5MB
Command-Line Interface
Query Local Index
-
Simple term query using dcdb-query.perl:
$ dcdb-query.perl dta-khmn.d Vernunft -slice=50 -kbest=4
#1:N #2:f1 #3:f2 #4:f12 #5:score #6:label #7:lemma 1024502 14189 8957 565 9.643632 1750 rein ... 2373560 108 62 2 8.590609 1850 würgen
-
compare authors using dcdb-query.perl (Kant vs. Hegel, TDF relation):
$ dcdb-query.perl dta-khmn.d -tdf -slice=0 '*=2 #has[author,/Kant/]' '*=2 #has[author,/Hegel/]'
#1:Na #2:Nb #3:f1a #4:f1b #5:f2a #6:f2b #7:f12a #8:f12b #9:scorea #10:scoreb #11:diff #12:label #13:lemma 539813 539813 115944 140187 1251 1251 937 1251 8.033354 4.212118 3.821236 0-0 möglich ... 539813 539813 115944 140187 1356 1356 2 1356 -0.839843 8.251063 -9.090906 0-0 Bestimmtheit
Query Remote Index
-
using dcdb-query.perl:
$ dcdb-query.perl http://kaskade.dwds.de/dstar/dta/diacollo Vernunft -slice=50 -kbest=4 -date='1600:1899'
#1:N #2:f1 #3:f2 #4:f12 #5:score #6:label #7:lemma 27259072 5678 7969 46 6.787266 1600 Verstand ... 128594838 9651 6238 17 5.131722 1850 Kant
-
using curl:
$ curl -sSL http://kaskade.dwds.de/dstar/dta/diacollo/profile -F query=Vernunft -F date=1600-1899 -F slice=50 -F format=text
#1:N #2:f1 #3:f2 #4:f12 #5:score #6:label #7:lemma #8:pos 27259072 5678 7969 46 6.787266 1600 Verstand NN ... 128594838 9651 15504 17 4.468905 1850 Gemeinschaft NN
Query Multiple Indices
-
using dcdb-query.perl:
$ dcdb-query.perl 'list://http://kaskade.dwds.de/dstar/dta/diacollo http://kaskade.dwds.de/dstar/kern/diacollo' Vernunft -slice=50 -kbest=4
#1:N #2:f1 #3:f2 #4:f12 #5:score #6:label #7:lemma 905020 321 666 7 7.860449 1500 menschlich ... 139727118 17869 12174 69 6.233783 1950 Kant
WWW GUI Wrappers
... using dcdb-www-server.perlWrap a local index
-
start the local server:
$ dcdb-www-server.perl dta-khmn.d
- point your browser at http://localhost:6066
- when done, press Ctrl+C to shut down the server (or close its terminal)
Wrap a remote index
-
start the local server:
$ dcdb-www-server.perl -port=6067 http://kaskade.dwds.de/dstar/dta/diacollo
- point your browser at http://localhost:6067
- when done, press Ctrl+C to shut down the server (or close its terminal)
Wrap multiple indices (lazy union)
-
start the local server:
$ dcdb-www-server.perl -port=6068 "list://http://kaskade.dwds.de/dstar/dta/diacollo http://kaskade.dwds.de/dstar/kern/diacollo"
- point your browser at http://localhost:6068
- when done, press Ctrl+C to shut down the server (or close its terminal)
Load Query Results into R
-
Store data in a local file (WWW GUI: use the "Text" format's "Raw URL" link)
$ dcdb-query.perl dta-khmn.d Vernunft -slice=50 >vernunft.tsv
-
load stored file into R:
$ R > data <- read.delim("data.tsv", quote="", comment.char="#", col.names=c("N","f1","f2","f12","score","label","lemma")) > # ... stuff happens ...