Tutorial: Quickstart¶
Let’s say we have a very short piece of text stored in
input.txt
. It looks something like:Mrs. Bennet deigned not to make any reply, but, unable to contain herself, began scolding one of her daughters.
What are some of the tools in textkit that we can use on this text?
Convert Text to Tokens¶
Tokenization is the process of turning text into chunks of text. These chunks can be sentences, words, or even sections of words.
Textkit converts a text file into a token document - where each line has one token per line.
textkit text2words input.txt
This command converts our input.txt text file into a token document where each token is a word.
The output would look something like:
Mrs.
Bennet
deigned
not
to
make
any
reply
,
but
,
unable
to
contain
herself
,
began
scolding
one
of
her
daughters
.
This is typically the first thing we want to do when using textkit, as textkit is all about working with tokens.
The output by default goes to standard out. You can redirect to a file by using >
.
textkit text2words input.txt > words.txt
This would put our words into words.txt
.
We can also get bigrams (two word tokens).
textkit text2words input.txt | textkit words2bigrams > bigrams.txt
Here we first convert the text to word tokens and use that as the input for the bigram tokenization.
The contents of bigrams.txt
would look like:
Mrs. Bennet
Bennet deigned
deigned not
not to
to make
make any
any reply
reply ,
, but
but ,
, unable
unable to
to contain
contain herself
herself ,
, began
began scolding
scolding one
one of
of her
her daughters
daughters .
Note the use of | for piping one textkit command into another.
With no file passed in, many textkit commands default to standard in.
This can be indicated explicitly by using a dash (-
) to indicate standard in.
Commands that begin with text
in textkit transform text into tokens of some sort.
Any command that uses words
expects to work with token documents that have one word per line.
A bigram is just a special case of an NGram
- so lets make some ngrams of size 5:
textkit text2words input.txt | textkit words2ngrams -n 5
Which produces:
Mrs. Bennet deigned not to
Bennet deigned not to make
deigned not to make any
not to make any reply
to make any reply ,
make any reply , but
any reply , but ,
reply , but , unable
, but , unable to
but , unable to contain
, unable to contain herself
unable to contain herself ,
to contain herself , began
contain herself , began scolding
herself , began scolding one
, began scolding one of
began scolding one of her
scolding one of her daughters
one of her daughters .
Notice the -n
argument to indicate the number of words that should be included in each ngram.
With all textkit commands, the --help
flag shows all possible arguments for a command.
textkit words2ngrams --help
Usage: textkit words2ngrams [OPTIONS] [TOKENS]
Tokenize words into ngrams. ngrams are n-length word tokens. Punctuation
is considered as a separate token.
Options:
--sep TEXT Separator between words in bigram output. [default: ]
-n, --length INTEGER Length of the n-gram [default: 2]
--help Show this message and exit.
Filter Tokens¶
textkit includes a number of filtering capabilities that can be useful for tweaking your tokens.
Notice our word and ngram tokens above include commas and periods? Let’s remove them using filterpunc
.
textkit text2words input.txt | textkit filterpunc
If we don’t want to pipe these commands together, we can also just execute filters on the words.txt
- the saved word token file.
textkit filterpunc words.txt
In natural language processing, stop words
are words so common that they provide little information about a document, and so are often removed. Textkit’s filterwords
will remove stop words from our token output.
textkit filterwords words.txt
We can also just filter words that are less then a certain number of characters long:
textkit filterlengths -m 5 words.txt
This would produce:
Bennet
deigned
reply
unable
contain
herself
began
scolding
daughters
Transform Tokens¶
There are a number of tools in textkit to transform tokens in varous ways.
Ensuring the casing of our tokens is consistent is a common text analysis preprocessing step.
This is done in textkit using tokens2lower
and tokens2upper
. These commands work on tokens as well as raw text.
textkit tokens2lower input.txt
mrs. bennet deigned not to make any reply, but, unable to contain
herself, began scolding one of her daughters.
textkit tokens2upper words.txt
MRS. BENNET DEIGNED NOT TO MAKE ANY REPLY, BUT, UNABLE TO CONTAIN
HERSELF, BEGAN SCOLDING ONE OF HER DAUGHTERS.
Token Information and Stats¶
textkit is also great for finding out interesting stuff about your text.
Count unique tokens with tokens2counts
, which outputs a CSV-like output that includes the token and the count of that token in the document.
textkit tokens2counts words.txt
TODO: topbigrams
TODO: tokens2pos
Package¶
Once the tokens are setup and transformed the way you want them, it can be useful to package up a set of documents into a single file for downstream visualization or other uses.
textkit tokens2json words1.txt words2.txt > out.json