DiaCollo: How-To

Local Install

  • Preqeuisites:
    • a UNIX-like environment (e.g. Linux, BSD, win32+msys/cygwin, ...)
    • perl (debian package "perl")
  • first, install (cpanm, debian package "cpanminus")
  • install DiaCollo and any missing dependencies with:
    $ cpanm DiaColloDB DiaColloDB::WWW
  • test your installation:
    $ dcdb-create.perl --version
    dcdb-create.perl version 0.12.006 by Bryan Jurish

Local Corpus Data

Download Corpus Data

... e.g. from the Deutsches Textarchiv

Create Local Corpus Index

... using dcdb-create.perl:
$ dcdb-create.perl -dclass=TCF -l dta-khmn-tcf.files -o=dta-khmn.d
dcdb-create.perl[13408] INFO: DiaColloDB.Corpus: using document parser class DiaColloDB::Document::TCF
dcdb-create.perl[13408] INFO: DiaColloDB: create(dta-khmn.d) v0.12.006
dcdb-create.perl[13408] INFO: DiaColloDB: (term x document) matrix modelling via DiaColloDB::Relation::TDF enabled.
dcdb-create.perl[13408] INFO: DiaColloDB: create(): processing 16 corpus file(s)
dcdb-create.perl[13408] INFO: DiaColloDB: create(): processing files [  0%]: dta-khmn-tcf.d/hegel_logik0101_1812.tcf
...
dcdb-create.perl[13408] INFO: DiaColloDB: create(): processing files [ 94%]: dta-khmn-tcf.d/nietzsche_zarathustra04_1891.tcf
dcdb-create.perl[13408] INFO: DiaColloDB: create(): building attribute frequency filter (fmin_l=2)
dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD cut -d" " -f1 dta-khmn.d/atokens.dat | sort -n | uniq -c |
dcdb-create.perl[13408] INFO: DiaColloDB: create(): filter (fmin_l=2) pruning 16619 of 34324 attribute value type(s) (48.42%)
dcdb-create.perl[13408] INFO: DiaColloDB: create(): building attribute frequency filter (fmin_p=2)
dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD cut -d" " -f2 dta-khmn.d/atokens.dat | sort -n | uniq -c |
dcdb-create.perl[13408] INFO: DiaColloDB: create(): filter (fmin_p=2) pruning 0 of 10 attribute value type(s) (0.00%)
dcdb-create.perl[13408] INFO: DiaColloDB: create(): populating global term enum (tfmin=2)
dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD cut -d " " -f -2 dta-khmn.d/atokens.dat | sort -n | uniq -c |
dcdb-create.perl[13408] INFO: DiaColloDB: create(): will prune 19783 of 40745 term tuple type(s) (48.55%)
dcdb-create.perl[13408] INFO: DiaColloDB: create(): filtering corpus tokens & assigning term-IDs
dcdb-create.perl[13408] INFO: DiaColloDB: create(): assigned 20962 term tuple-IDs to 616008 of 635790 tokens (pruned 3.11%)
...
dcdb-create.perl[13408] INFO: DiaColloDB: creating co-frequency index dta-khmn.d/cof.* [dmax=5, fmin=2]
dcdb-create.perl[13408] TRACE: DiaColloDB.Utils: CMD | sort -nk1 -nk2 -nk3 | uniq -c - dta-khmn.d/cof.dat
dcdb-create.perl[13408] TRACE: DiaColloDB.Relation.Cofreqs: create(): stage1: generate pairs (dmax=5)
dcdb-create.perl[13408] TRACE: DiaColloDB.Relation.Cofreqs: create(): stage2: load pair frequencies (fmin=2)
...
dcdb-create.perl[13408] INFO: DiaColloDB: create(): DB dta-khmn.d created.
dcdb-create.perl[13408] INFO: DiaColloDB: close(dta-khmn.d)
dcdb-create.perl[13408] INFO: DiaColloDB: operation completed in  1m31.841s; db size = 1.5MB

Command-Line Interface

Query Local Index

  • Simple term query using dcdb-query.perl:
    $ dcdb-query.perl dta-khmn.d Vernunft -slice=50 -kbest=4
    #1:N	#2:f1	#3:f2	#4:f12	#5:score	#6:label	#7:lemma
    1024502	14189	8957	565	9.643632	1750	rein
    ...
    2373560	108	62	2	8.590609	1850	würgen
  • compare authors using dcdb-query.perl (Kant vs. Hegel, TDF relation):
    $ dcdb-query.perl dta-khmn.d -tdf -slice=0 '*=2 #has[author,/Kant/]' '*=2 #has[author,/Hegel/]'
    #1:Na	#2:Nb	#3:f1a	#4:f1b	#5:f2a	#6:f2b	#7:f12a	#8:f12b	#9:scorea	#10:scoreb	#11:diff	#12:label	#13:lemma
    539813	539813	115944	140187	1251	1251	937	1251	8.033354	4.212118	3.821236	0-0	möglich
    ...
    539813	539813	115944	140187	1356	1356	2	1356	-0.839843	8.251063	-9.090906	0-0	Bestimmtheit

Query Remote Index

  • using dcdb-query.perl:
    $ dcdb-query.perl http://kaskade.dwds.de/dstar/dta/diacollo Vernunft -slice=50 -kbest=4 -date='1600:1899'
    #1:N	#2:f1	#3:f2	#4:f12	#5:score	#6:label	#7:lemma
    27259072	5678	7969	46	6.787266	1600	Verstand
    ...
    128594838	9651	6238	17	5.131722	1850	Kant
  • using curl:
    $ curl -sSL http://kaskade.dwds.de/dstar/dta/diacollo/profile -F query=Vernunft -F date=1600-1899 -F slice=50 -F format=text
    #1:N	#2:f1	#3:f2	#4:f12	#5:score	#6:label	#7:lemma	#8:pos
    27259072	5678	7969	46	6.787266	1600	Verstand	NN
    ...
    128594838	9651	15504	17	4.468905	1850	Gemeinschaft	NN

Query Multiple Indices

  • using dcdb-query.perl:
    $ dcdb-query.perl 'list://http://kaskade.dwds.de/dstar/dta/diacollo http://kaskade.dwds.de/dstar/kern/diacollo' Vernunft -slice=50 -kbest=4
    #1:N	#2:f1	#3:f2	#4:f12	#5:score	#6:label	#7:lemma
    905020	321	666	7	7.860449	1500	menschlich
    ...
    139727118	17869	12174	69	6.233783	1950	Kant

WWW GUI Wrappers

... using dcdb-www-server.perl

Wrap a local index

  • start the local server:
    $ dcdb-www-server.perl dta-khmn.d
  • point your browser at http://localhost:6066
  • when done, press Ctrl+C to shut down the server (or close its terminal)

Wrap a remote index

  • start the local server:
    $ dcdb-www-server.perl -port=6067 http://kaskade.dwds.de/dstar/dta/diacollo
  • point your browser at http://localhost:6067
  • when done, press Ctrl+C to shut down the server (or close its terminal)

Wrap multiple indices (lazy union)

  • start the local server:
    $ dcdb-www-server.perl -port=6068 "list://http://kaskade.dwds.de/dstar/dta/diacollo http://kaskade.dwds.de/dstar/kern/diacollo"
  • point your browser at http://localhost:6068
  • when done, press Ctrl+C to shut down the server (or close its terminal)

Load Query Results into R

  • Store data in a local file (WWW GUI: use the "Text" format's "Raw URL" link)
    $ dcdb-query.perl dta-khmn.d Vernunft -slice=50 >vernunft.tsv
  • load stored file into R:
    $ R
    > data <- read.delim("data.tsv", quote="", comment.char="#", col.names=c("N","f1","f2","f12","score","label","lemma"))
    > # ... stuff happens ...