Contributing

If you are interested in contributing to textkit we would love your help!

Here is a bit more about the structure of the codebase and how to contribute.

Code Structure

Each command is implemented in its own file. These command files are organized into sub-directories:

  • tokenize
  • filter
  • transform
  • stats
  • package

The use of these sub-directories is primarily for developer convenience and commands can be moved around if a better structure is found.

Commands

textkit uses Click. to handle command line arguments and inputs. Click uses decorators to define these arguments and options in a succinct way.

textkit strives to use text as an input and text as an output. Raw text can be processed using commands that start with text2 like text2words.

Token documents (text files with a token on each line) can be used and produced by commands that include words in the name.

Utilities

There are a very small set of utility functions that are useful in keeping textkit

These are contained in the utils.py file. Some that you might find helpful:

read_tokens will convert a token document into a list of tokens. Use this to process the input file if your input is a token document.

output is a light wrapper around the output capabilities of Click that prevents error messages if the command is exited early (like when piping to head).

Writing New Commands

Want to contribute a new command? Great!

textkit uses GitHub Pull Requests to incorporate other developer’s work.

Fork the repo and then create a branch for your new command. Create and test it, then submit a Pull Request.